For å spille Alias følg denne linken: stianjo.no/also-known-as

There are many variations of the game where a team is guessing a word only one team member knows. One such game is Alias from Tactic, and this game is well known in Norway. I searched for an online implementation of it one time, and found nothing. (For English speakers I guess say-another-way.online is a fully digital version you could try). I want it in Norwegian, and it doesn’t need any rules implemented; just the cards with some words on is ok!

To create a word game we need some words. I thought about using a dictionary, but that contains too many advanced words. I knew it is possible to count the frequencies of words on Wikipedia, and decided that the top words would be words well known and hopefully suitable in a game of Alias. Wikipedia actually have a list of the most frequently used words on this link, but I didn’t know that at the time… This require some serious over-engineering!

Mining Wikipedia for words

I read the official download page and first downloaded nowiki-20210101-pages-articles.xml. Since neither pandas nor Apache Spark can import XML out of the box, I decided to go for json format instead. After some trying and failing with converting Wiki text, I found that the CirrusSearch option on this page is a json dump with expanded Wiki text. So I ended up downloading nowiki-20210111-cirrussearch-content.json.gz.

I quickly confirmed that Python could not import the entire JSON file in memory, which was what pandas tried to do. Something beefier is Apache Spark, and I chose it simply because I had used it before. It is supposed to be a tool one can use on your standard off-the-shelves computers for Big Data. Since it is meant for large data, it means it will store data on the disk instead of in memory. And it of course supports parallelism. Although the real power comes from the ability to run in a cluster, I will just use my single computer.

I used Jupyter notebook for the analysis, and you can read the notebook with text and code at Github; it will actually render it for you!

I spent some hours cleaning the data. Although the dataset didn’t contain wiki text, there were still some templates which made appearance as common words and phrases. These templates I gave up removing.

I decided to use spaCy to map words into their lemma. Doing it in Spark was difficult for reasons explained in this blog post, so I decided to dump all data and use spaCy on batches of the data. I spent a few hours debugging writing to disk because of a Spark error that only affects Windows. The dump ended up in 44 files. It took 10 mins and 30 secs to run spaCy on one file. In total it took around 8 hours to run on all files, so it is highly unoptimised!

Not surprisingly, the top words were just stopwords. I ended up removing both English and Norwegian stopwords. The top 10 words ended up being these:

besøke
norsk
mye
annen
under
to
artikkel
få
hos
år

If we compare with the Frequency list on Wikipedia, there are some of the same words on top. The list on Wikipedia has not removed stopwords, it is from 2011, and it seems it has not performed lemmatisation either. On top 250 of the Wikipedia list we find different inflections of the word “artikkel” (article), which could explain why I have artikkel in top 10.

Some limitations with my list (top 1000) are:

  1. I didn’t remove the templates, and I strongly suspect the templates have skewed the result. For example I believe “wikipedia”, “wikispecies”, “wikimedia” and “commons” are mentioned in many templates, and they will thus have been counted more than they should.
  2. I didn’t manage to preserve big capital letter in names and places. These are uncommon to have in an Alias game, so ideally I should have removed them.
  3. It contains random English words.
  4. It contains all the months, and probably a lot the number words.

The three first limitations I find hard to fix in code, which is why I just will accept this list and use it in the game. I manually removed the English words I found as well as the Wikipedia specific ones.

Creating the UI

Using React with TypeScript, I created a simple UI with swipable cards. The code can be found on Github and you can play it (in Norwegian only) on stianjo.no/also-known-as/.

Alias online image of phone


Last updated April 1 2021

Comments on this text? Create an issue on Github!