Analyzing cultural biases in Wikispeedia

Data story made in the context of the course at

A story about biases

Today, content on the internet is still mostly skewed towards Western societies [1] [2]. Interestingly, those same societies also produce most of the human knowledge, which can be proxied by the number of citable publications [3]. Wikispeedia is an online game built on 4604 Wikipedia articles from 2007, during which players are navigating from a given start to a target end article through the links contained in the articles. In this project we intend to investigate how players navigate through the game and how this navigation is influenced by the production of scientific knowledge in the world. More precisely, we are interested in understanding whether players are attracted towards articles linked to countries producing a lot of scientific knowledge.

Why does this make sense?

As a small appetizer, let us throw a quick look at figure 1 and 2. We can already see the link between the two issues at hand: the distribution of articles per country in the Wikispeedia graph seems very closely related to the distribution of scientific knowledge production in the world. But how strong is this link? And how does it impact the players’ behavior in the game? Let’s dive into the details to find out!

What is the plan?

The first step will be to understand the navigation patterns of players in the game. For this, we will compare two hypotheses, namely the “passive” and the “active” hypothesis:

  • “Passive” hypothesis: we assume that players “passively” play on a graph that over-represents countries producing a lot of scientific knowledge. The players are not biased in themselves; they are only influenced by the graph. Thus, by removing the bias of the graph (i.e., by controlling as much as possible for the confounders of the graph), there shouldn't be any bias detectable in the players’ behavior.
  • “Active” hypothesis: players “actively” add their own intrinsic biases when playing on that graph. That is, players are inherently attracted towards some countries, while disregarding others. By removing the graph bias, the player’s preference for some countries would, in this hypothesis, still be visible.

Once this first analysis is done, we will be in a good position to investigate what is the players’ intrinsic bias, and whether it is related in any way to the production of scientific knowledge in the world.

Now, to succeed in this quest we are obliged to meet certain requirements. First, as we are working with a global database and as we intend to investigate worldwide geographical biases, we need to associate each article with a country. Next, to quantify the players’ behavior in the game we will use their clicking patterns, more precisely the number of times each article is clicked. Finally, as we are interested in showing how the production of scientific knowledge impacts players’ behavior, we need to match each of the previously defined countries to the number of publications produced within those countries during the year 2007.

Let’s jump to the next section to gain precious insights into our methods and data preprocessing!