In this article of The Practical Corner, we’re taking a small break from simulating neurons and doing something light instead. We’ll be looking at how to generate a word cloud from site traffic data, what we can learn from such visualizations, and some quirks that might pop up when working with text data across more than one language. As a bonus, we’ll get some idea about what people search for before landing on this site and we’ll try to see if we can spot some cross-language differences. Fair warning, this isn’t groundbreaking data science, but a vanilla intro to the topic of bilingual data and language detection. If you’d like to follow along or try out fancier stuff, you can do so with the code and data from here.

Step one: get the data, make a cloud

For today, we’ll be using the search queries and impression counts for our site in April 2025, i.e what people typed into Google before they landed here and how often that happened.

Unlike more typical text data, which contains capitalized words and punctuation, the search queries are relatively clean. Still, it’s good practice to clean the input data before analyzing it, and that’s what I also did here: I changed everything to lower case (just in case), removed punctuation (some queries contained apostrophes or dashes), and made sure there were no extra spaces between words.

Then I jumped directly into action and threw everything into our first word cloud (and yes, I did shape it like a brain, because this is a neuroscience website and we really like brains here. Hmm, something we have in common with zombies, it seems.)

A word cloud in the shape of a brain. The largest entries, shown in red, are "glial cells", "nocebo effect" and "replication crisis".

This is by no means a sophisticated analysis, but it’s pretty and it allows us to make a couple of observations right off the bat:

  • unsurprisingly, people find us both for English and Romanian queries;
  • we see that the three most frequent searches are in English, and they refer to glial cells, the replication crisis (most likely referring to the psychology replication crisis which started in 2015), and the nocebo effect (the opposite of the placebo);
  • English is also the dominant language, showing the most queries, but this is also something we’ll clarify in the next step;
  • finally, there are a couple of large queries in Romanian as well, namely ce este ptsd (what is PTSD) and neurofeedback pareri (opinions about neurofeedback).

So far, so good, but it looks like we really need to analyze the two languages separately.

Step two: language split

To do that, I need to split the queries by language. I could do this manually, but:

  1. it’s incredibly boring;
  2. if there were a lot of queries, I’d die of old age before finishing;
  3. if people found us in many languages I barely recognize, I wouldn’t even know where to begin.

Luckily, there are automatic language detection packages in Python. Since I’m no expert in the topic, I started with the easiest out-of-the-box solution, a little package called langid.py. This has a function which automatically classifies text into one of 97 different languages and it runs quite fast. So these are the results after splitting for English:

A word cloud in the shape of a brain, containing the results for the English language after the language classification step. The largest entries, shown in red, are now "replication crisis", "hill bill" and "types of glial cells".

and for Romanian:

A much sparser word cloud in the shape of a brain, containing only the queries for the Romanian language after the language classification step. The largest entries, shown in red, are "ce este ptsd", "antidepresive cu serotonina" and "harta gustului".

The differences are quite striking, but let’s go through them one by one and check whether they’re all coming from the underlying data and not something that went wrong during the analysis. In English:

  • the word cloud is still very rich;
  • we identify some of the queries from the aggregated cloud, particularly those related to glial cells and the replication crisis;
  • we also see more prominently some searches related to hill bill. This is definitely in relation to our article about Bill from King of the Hill and borderline personality disorder;
  • if we zoom in a lot, we can also see some funnier searches, such as how to tell if a fuse is bad (don’t know, but maybe I should learn?), or what will be invented in the next 100 years (good question, if I knew the answer, I’d hopefully be rich);
  • on a more serious note, the nocebo effect is gone – that’s weird;
  • and we see queries such as ptsd simptome – that’s in Romanian and it means “PTSD symptoms”, but it doesn’t belong here.

From the last two bullet points above, we already have a strong indication that something must’ve gone awry, but let’s look at the Romanian word cloud and we’ll discuss all the problems together at the end. This cloud:

  • is much sparser, meaning there are fewer search queries;
  • we see clearly now that people want to learn about PTSD;
  • we also find out that they’re interested in the effects of alcohol on the brain, serotonin-based antidepressants, and the taste map;
  • again, it’s problematic that neurofeedback pareri is gone;
  • and we have an English query (membrane fluidity) where it clearly doesn’t belong.

The last two points, namely the missing and misclassified queries, point to potential issues during the language classification part.

Step three: diagnosis and potential solution

The misclassified queries are easy to spot (we see them in the corresponding word clouds). To confirm that the missing ones are also a symptom of wrongly attributed language, we can have a look at those that were classified neither as English, nor as Romanian. Sure enough, by inspecting these results, we see that, for example, the “nocebo effect” was assigned to Italian, and “neurofeedback pareri” as French. Many other queries were also wrongly classified by langid.

But why? Well, for a couple of reasons. Automatic language classification is no easy feat and this task is even more difficult for short text, as in our data. The less context you have, the more difficult it is to make an accurate decision, especially for words that are identical across multiple languages, like “nocebo” or “neurofeedback”. Another reason is that langid is not exactly the sharpest tool in the shed. If you have a look at this comparison, you’ll see that tools like FastText and CLD3 tend to perform better, especially on short inputs.

In this case, the solution would be to pick another tool, use it to classify the queries, and re-plot the word clouds. And if there are still misclassifications that no tool can handle, one would have to manually assign the queries to their correct languages. However, I’ve already gotten what I needed from this analysis, so the follow-up will be left as an exercise for the curious reader.

Conclusion and other improvements

I think word clouds are great to get a quick impression of the underlying data and visually, they look pretty neat. However, even such a simple analysis as this tends to be complicated by the messiness of real-world data. As we saw above, the text needs to be cleaned and brought in some standardized form.

Side note: we didn’t even consider semantic equivalence. For example, “neurofeedback pareri” and “pareri neurofeedback” mean the same thing, “opinions about neurofeedback”. But the order of the words is reversed, so they’re treated as distinct searches. Similarly, misspelled words are currently also treated separately. Ideally, all of these should be clustered and aggregated to get a clearer picture.

Working across languages adds another layer of complexity. Automatic classification, especially for short text, is a non-trivial problem that needs the right mix of proper tool selection, sanity checks, and, sometimes, manual input.

All in all, even a “quick and simple” project like this can teach us a lot: about structuring a task, checking assumptions, and spotting where tools fall short. And if it also gives us a pretty picture to show at the end, that’s just a bonus.

What did you think about this post? Let us know in the comments below. And if you’d like to support our work, feel free to share it with your friends, buy us a coffee here, or even both.

Subscribe to our RSS feed here.

You might also like:

Leave a Reply

Discover more from Neurofrontiers

Subscribe now to keep reading and get access to the full archive.

Continue reading