A quick word about articles!

As we said in the introduction, we want to associate each article to a country. But why does this actually make sense? Well, if we hover our mouse over the nodes in figure 1 and 2 below, we can see that almost all atricles among the top 20 can be associated to a country (e.g. English_language to England). The same is true for the bottom 20 articles (e.g. Afghan hound to Afghanistan). But we see that the most clicked articles are densely connected with each other, whereas the least clicked articles are not.

This motivates our decision to analyze the Wikispeedia game in terms of countries. In the end, we are interested in whether players of the Wikispeedia game really tend to click more on articles that are associated with western countries or if this feature is due to the properties of the Wikipedia graph itself. Let’s dive into the analysis and see if we can conclude anything!

Which countries are most represented in the Wikipedia graph?

Now that all (most) articles are associated to a country, we can look at the distribution of those countries. Is there a country that is associated to more articles? (we have our little idea, but let’s check).

We see that 8 countries make up for 1/2 of the articles in Wikipedia, namely the US, UK, Australia, France, Germany, Italy, India and China. Those are all in the top 10 of countries that publish the most (see here!)!

With figure 3, we get a first intuition of the cultural bias that is present in the Wikipeedia graph: countries that publish most are also represented by more articles. When players play on this already biased graph, they will obviously click most on articles from those countries. So, later, when we analyze the behavior of the players, we will have to keep this in mind and normalize the click counts of the players by the number of articles per country! But this is for later, let’s go on here!

Figure 3: Proportion of articles per country in Wikispeedia (2007)

Let’s now look at the connectivity between countries.

Two countries are said to be connected if at least one article from the first country contains a link to an article associated to the other country.

We contruct a graph of the Wikispeedia network overlayed with a world map for better visualization of the countries’ importance. On this graph (figure 4), we have:

  • each node is a country
  • the size and color of a node is proportional to the number of articles associtated to that node
  • edges represent connnections between countries in the Wikipedia graph

With the slider on the top left part of the map we can select edges that occur more than a certain amount of times. We can see that edges that appear most (above 130 times) are those connecting countries that are associated to the most articles (e.g. USA, UK, France, Germany, Italy, India, Australia, China and Russia).

We also see that those same countries are central hubs of the network. They are part of most edges and are also connected to most other countries. This makes sense since the more articles there are, the more out-links there are and ultimately the higher the chance to be referenced by other articles.

Figure 4: World map of connections between countries

Lastly, let’s look at the in and out-degree of each country.

For a given article A, the in-degree is the number of links found in other articles targetting A, while the out-degree is the number of outgoing links found in A. The in-degree of a country is then defined as the sum of the in-degrees of its articles. Same for the out-degree.

Thus, the higher the in degree of a country, the more central it is meaning that the more it is accessible from other countries.

As expected, we observe that those same central hub countries, that make up for most of the articles in the graph, have also much more links that lead in and out of them.

Figure 5: Node degree of countries

Distribution of countries among source and target articles

This is a crucial analysis. Indeed, both the source and the target articles are not chosen by the player but are still counted as clicks in the way we calculated the click count variable. It is thus important to get an idea of the distribution of the countries among those pseudo-clicks to then take the decision of removing them in case the distribution of not uniform.

To investigate this we need to look at the finished paths in order to extract source and target articles.

In the following steps of the analysis the source and target articles will be discarded from the click counts.

Figure 6: Top countries by source and target articles in finished paths

Naive analysis of players click count

Now, let us analyze the players’ clicking behavior in the Wikispeedia game.

As seen previously, there is a unequal distribution of articles in Wikipedia, some countries are more represented than others. But, are those countries clicked more often by players within the Wikispeedia game? Or are there other countries that are clicked more often? We will now investigate if we see a bias independent from the countries distribution and whether the click count can be a good approximation of player’s intention.

To do so, a first naïve approach is simply to detect countries with higher click counts (see figure 7). With this approach, it seems that players are highly biased in their way to play Wikispeedia as some countries like United States, United Kingdom, and Australia are represented by enormous dots due to their higher click count while other are almost not visible on the map. Edges between countries represent game paths. Darker paths are the most used ones, among those we can see that paths linking United States to United Kingdom, Australia, France, China, Germany or Japan dominate. There also seems to be commonly used paths between different countries in Europe.

However, a high click count can simply be due to the high number of articles associated to a particular country within the game. This does not necessarily tell us something about player’s biases. Therefore, we rather focus on the ratio of click count divided by the number of articles to get a result closer to reality. On figure 8, we see an overrepresentation of some countries like Vatican city, Brazil, or South Africa which are different from the previous ones. Vatican city is a particular case in this dataset as there is only one article associated with this country so the click count is not influenced by the scaling. It could be considered as an outlier, not necesarilly indicating something about player’s biases. Therefore, the scaled click count map indicates that part of a high click count can simply be explained by the high number of articles associated to a particular country. But scaling creates some artefacts like Vatican city so it does not seem to be the best approach. There appear to be another factor influencing the click count per country as some countries remain more represented than other even when considering a scaled version of the click count.

But, can we rationally explain a differentially distributed click count? Are there other factors influencing the click count of player’s? Onto the next topic to figure it out!

What do we call a dead-end ?

Dead ends in the context of Wikipedia navigation are articles that players encounter but struggle to move forward from, leading to either backtracking or abandoning the path altogether. Understanding dead ends provides valuable insight into user behavior, article connectivity, and potential biases in navigation patterns. By identifying these points of friction, we can better interpret how graph structures and content influence player decisions.

In this analysis, we will:

  1. Identify countries as potential dead ends based on click counts and mean failure ratio (unique). Are highly connected countries inherently more prone to becoming dead ends due to visibility and accessibility? Or are less-connected countries genuinely more challenging for navigation?
  2. Examine last articles in unfinished paths. Which countries frequently appear at the end of failed navigation attempts? What does scaling these occurrences by outgoing links reveal about genuine dead ends versus highly connected hubs?
  3. Analyze backtracking behavior. What countries or articles trigger backtracking, and how does scaling highlight less obvious dead ends that trap players?

Spotting dead ends through click counts and mean failure ratio

In this part, we sort the countries based on their click count and unique mean failure ratio. To calculate the unique failure ratio, we count each occurrence of a country in unfinished paths only once. This approach eliminates circular patterns and repeated entries, ensuring a clearer and more accurate representation of how often each country contributes to navigation failures.

Uh oh, this plot might be biased. Not only has the United States received an overwhelmingly high number of clicks, but it also has significantly more outgoing links (16,338) than other countries. This likely inflates its visibility and accessibility, making it appear more frequently in player paths. Familiarity bias (e.g., cultural or linguistic factors) further skews the data toward countries like the US and UK. To reduce this bias, we scale the number of clicks by the sum of outgoing links per country. This reveals a more nuanced picture, where less-connected countries such as Greenland, Bolivia, Brazil emerge as significant dead ends. Scaling sheds light on genuine navigation patterns, helping us differentiate between structural biases and true player challenges.

Figure 9: Top 10 countries by click count and mean failure ratio (scaled vs. unscaled)

Analyzing last article in unfinished paths: which countries trap players?

What happens when a player’s navigation ends unsuccessfully? By analyzing the last articles in unfinished paths, we identify which countries are the most frequent dead ends. Initially, highly connected countries like the United States dominate this list, reflecting their prominence in raw data.

However, scaling by outgoing links tells a different story. Countries like Greenland, Bolivia, and South Africa emerge as true dead ends, suggesting specific navigational patterns or challenges that lead players to abandon these paths. These insights highlight the limitations of raw data in capturing genuine player behavior.

Backtracking behavior: what leads players to retreat?

Before players give up, they often hit a point where they backtrack. By examining the most common articles preceding the “go back” action, we identify key friction points. Unsurprisingly, highly connected countries like the United States and the United Kingdom again dominate the raw data due to their frequent presence in navigation paths.

However, scaling by outgoing links uncovers, once more, a more nuanced picture. While prominent countries remain significant, others like South Africa, Italy, and Spain also emerge as frequent backtracking points. This highlights specific challenges players face in navigation, where even well-connected or moderately connected countries can become roadblocks, revealing the complexity of player decision-making beyond raw prominence. We observe a relatively similar distribution between last country articles and backtracked country articles, suggesting that the same navigational patterns and challenges that lead players to abandon paths often also influence their decision to retreat, emphasizing the importance of connectivity and link structure in shaping player behavior.

Accounting for the influence of the graph

As we saw in the two parts above, the click count metric, even when scaled by the number of articles per country or by the number of outgoing links, is still not a good enough proxy for analyzing the players intentions. Indeed, it seems to be heavily influenced by the graph’s structure. We need something more advanced to account for this influence. So now let’s jump into the next section !

What variables influence the click count?

As seen before, countries with more articles have more clicks. This is very much expected, as the more articles a country has, the more likely it is that a user will encounter an article of that country. This is the most obvious example of a confounding variable that influences the click count. But this is not necessarily the only one. Indeed, it could be that other variables like the in-degree, out-degree, or even the categories of the articles have an influence on the click count. For example, for a given country, if the distribution of in-degrees is higher than the average, it means that players will see more links to that country, and thus might click more often on it.

Ideal rebalancing

To make sure we account for as much confounders as possible, the ideal thing to do would be to set up some kind of propensity score matching. But how? Usually, propensity score matching is done between two groups that are compared in the experiment: a treatment group and a control group. However, in our case, we are comparing click counts across countries, meaning we do not have 2, but rather a maximum of 195 groups that are compared with one another (one group per country, indeed, only 195 out of the 249 llm classified countries actually have articles associated to them).

The natural thing to do would then be to simply extend the propensity score matching to the 195 groups! Instead of looking for pairs of articles that have a similar propensity score, we would look for k-tuples, with k being the number of countries. But there are two issues with this approach:

  1. Propensity score matching requires an algorithm that finds maximum cardinality matchings. Although there exists such algorithms that run in polynomial time when k=2, when k>2 the problem is NP-hard, meaning all currently known algorithms for this problem run in exponential time (pretty bad).
  2. Alright, but our dataset is not that big! Couldn’t we just use an exponential time algorithm and be done with it? Well, another problem would then arise: for a lot of countries, the number of articles assigned to them is 1. This means that we would only be able to create a single k-tuple, and then we would already have exhausted all articles for a lot of countries. That would mean that our rebalanced dataset would contain only one article per country, which of course is not enough to make any kind of analysis.

A simpler approach

This means we need to consider something simpler. We will first analyze how much each variable influences the click count, and then manually normalize the click count by the variables that seem to have the most influence. This is a very naive approach, but it is the best we can do given the constraints of the problem.

Normalizing the click count

By doing a regression analysis, we found that the number of links leading into an article is highly positively correlated with the player’s click counts (rho=0.78, R2=0.63), we will thus only consider two confounders, which seem to have the most influence on the click count: the number of articles per country and the in-degree of each article. We will define a new metric, the normalized click count: for each country, we will divide the total number of clicks made to articles of that country by the total in-degree of that country. Note that given the high correlation between the in-degree of a country and the number of articles in that country (the more articles, the more links to those articles), this new metric is essentially accounting for both confounders at once.

Although it might seem like the way we computed this new metric is quite arbitrary, it actually still makes a lot of sense: we are essentially counting, for a given country, the average number of clicks that a single link to that country receives.

As we can see in the graph below, this already looks a lot more interesting. The top countries are no longer dominated by Western countries. We see countries like South Africa, Jordan or Mexico among the top. The USA is still pretty high (ending up at the second position), but it is now with Canada the only two Western countries left in the top 10.

Figure 12: World map of the normalized click count per country

PageRank as a way to account for a lot of confounders at once

We will now dive into a last attempt at accounting for the influence of the graph on our analysis of the players behavior. To do so, we will use the PageRank algorithm. This algorithm simulates a random player that would start on a random article and then randomly click on links. It then outputs a score for each article, which is the probability that this random player clicked on that article at any given time. We will then compare this PageRank probability with the probability that a player from our dataset clicked on that article (we call this second probability the player rank). For a given article, if the two probabilities are very different, it means that players from our dataset click more often (or less often) on that article than what would be expected from a random player.

What confounders do we account for?

By studying the difference between the PageRank probability and the player rank probability, we are essentially accounting for all confounding variables coming from the graph structure. Indeed, given that both our players and the PageRank algorithm were given the exact same graph, if they behaved differently, it cannot be explained by variables coming from the graph!

However, it is important to note that we are only accounting for confounders in the structure of the graph, and not in the content of the articles. Variables like categories or titles might have a significant influence on the players behavior, and their distribution might change from one country to another. This is something that we cannot account for with the PageRank algorithm, and that we decided to ignore in this analysis.

PageRank analysis

As we can see in the plot in figure 13, the player rank is very similar to the PageRank for the top 10 countries. This shows a strong influence from the graph structure on the players behavior. In figure 14, we subtract the PageRank from the player rank. With that, we are essentially doing a change of reference frame, now computing how much more (or less) often a player clicked on an article compared to a random player. The differences seem very small, but it is good to remember that they must be interpreted as probabilities. To make sure the difference is significant, we computed a chi-square test (the null hypothesis being that the player behaves exactly like a random walker), and the p-value was found to be very close to 0 (\(p \ll 0.05\)). This proves that although the players are highly influenced by the graph, they still have some intrinsic biases. However those biases are very small (less than 4% in probabilities), so most of the players behavior can be explained by the graph structure.

Let’s wrap up!

We were able to show that the Wikipedia graph is accurately representing the world knowledge, and that the countries dominating the web are also the major world powers (i.e. USA, UK, China). Since, the Wikispeedia game is built on Wikipedia, it is biased toward countries producing the most publications.

Then, we focused on the behavior of players and tried to show that they are intrinsically biased toward some countries. A first naive analysis showed that the player’s clicks are strikingly skewed towards the major world. But after scaling and accounting for multiple factors influencing players’ behaviors all countries seem to be approximately equally represented in terms of clicks of players. We observe small preferences towards some countries but they are too small to be considered for our analysis. This indicates that players are mostly biased due to the graph structure but do not have additional intrisic biases. The passive hypothesis is therby validated.

📢📢📢📢📢 The take home message here is: a biased web makes you biased 📢📢📢📢📢!!