The main purpose of this article is to explain the basic theoretical concepts of topic modeling and show an example creating such a model on a collection of articles.

Topic modeling is an efficient way to discover the topics that occur in a corpus (collection of documents). A topic is defined as a collection of words that frequently appear together in the same context. Words can have a different meaning depending on the context. Let's assume the following example, where we have a collection of 4 articles and the resulting 2 topics:

You'll notice that we have a few homonyms. Spring and Hibernate are both software development frameworks and also have a more natural meaning. Topic modeling must take that into account and detect the meaning/topic that the word is used in.

In our example the documents collection is made of all the articles from an internal newsletter that is sent every month here at Poppulo. We're going to identify the topics used in these articles and see if we can find any interesting information: what are the main topics, if there's any variation in time and also if there's any correlation between the weight of a topic in an article and its click rate. Running such a model gives you the collection of topics (each topic having a weight in the corpus/document) and for each topic we'll get the collection of words that it's made from (and again, each word having a weight assigned, some words being more important in the topic than the others).

You can see this clearly illustrated in the "Health" topic above. The words having a bigger weight are displayed in a bigger font. When doing the analysis, the topics are not predefined, they are automatically built from the context. What's predefined is the number of topics. This may seem arbitrary (given that we don't know what the actual topics are) but you can try different values until you're satisfied with the result.

A bit of theory

Topic modeling is easy to use and complex to understand, which is always a dangerous combination. It is very easy to come to the wrong conclusions based on it!

There are different topic models in use but the most common and simplest is LDA (Latent Dirichlet allocation). We're going to use this one in our example. I find the math behind it quite complex by my standards but I wanted to understand at least 2 basic things: how it works and where is its probabilistic nature coming from.

For achieving its purpose LDA uses the Gibbs sampling algorithm. The steps of building the model are the following:

  1. Create the topics (remember that we know the number of topics, this being the input for the model)
  2. Randomly assign words to topics
  3. For each word (W) in each document (D) and for each topic (Z) calculate the probability that the word W belongs to the topic Z
  4. Use this probability to reassign the word (W) to a another topic (Z);the one with the highest probability.
  5. Repeat steps 3 and 4 until the model becomes consistent (the words are stable in topics)

This is basically what probabilistic topic modeling is. You can actually run the model and get data out of it without understanding what it's all about. Before doing that you should be aware of LDA assumptions:

  1. "bag of words" assumption - LDA assumes that the order of words doesn't count. This is quite restrictive and depending on the goals of the research (for example, language generation) it might be a deal breaker. You can read more here if you're interested in relaxing this assumption.
  2. the number of topics is assumed to be known - this shouldn't be too much of a problem since we can try different values until we're satisfied with the results
  3. probabilistic nature of the process - this was explained in the section above and it should count when deciding how to use the results.
  4. expensive - due to its iterative nature building the topic model might take a lot of computer resources and time. In our example this is far from being a problem since we only have 400 documents (articles) which are rather small in size

Getting the data

We're going to need 2 sets of data:

  1. the list of articles. In our system an article content is made of 2 parts: synopsis and body. The synopsis is displayed in the newsletter e-mail and since this is what the employee sees before clicking through to the full article, I've decided to use this one in the research.
  2. the click rate for each article. We calculate the click rate as
Relative click rate = Number of clicks for the article / Average number of clicks for articles included in the email

Preparing the data

As with every statistical research a good part of the time is spent on preparing the data. The main problem with our data was that the article synopsis is in HTML and it needed to be cleaned up. Otherwise we risk having topics made of words like "div" and "span". Initially I've tried html2text but it didn't quite work out the way I was hoping so I ended up writing a small Ruby script using Nokogiri.

require 'rubygems'
require 'nokogiri'

documentHTML = Nokogiri::HTML(open("synopsisHTML.csv"))
documentText = documentHTML.text"synopsis.csv", 'w') { |file| file.write(documentText) }

Run topic model in Mallet

There are different software implementations of LDA. For our example I've used Mallet. It allows LDA model estimation on both a new set of documents and from an existing model (topic inference), using a highly scalable implementation of Gibbs sampling.

First we'll import the data into Mallet format:

 mallet import-file --input synopsis.csv --output synopsis-input.mallet --keep-sequence --remove-stopwords

The main option here is remove-stopwords which tells Mallet to ignore a standard list of very common English words like the, is, at etc. (this operation is typical in processing of natural language data). Mallet gives the option to extend the default dictionary in case the analyzed text contains some common domain specific stopwords (using the --extra-stopwords option).

It is also possible to reuse the results of a previously trained model for a new document (topic inference). In this case use the --use-pipe-from option when importing. One could use this feature to train the model on a large collection of articles (documents) and then when a new article is created just run the model on this new article only to get its topics. If we knew topic preferences for an employee (from previous emails) we could decide if the employee would be interested in this new article.

Once we have the data imported we can run the model:

mallet train-topics --input synopsis-input.mallet --num-topics 20 --output-state all-topic-state.gz  --output-topic-keys topic\_keys.txt --output-doc-topics topics\_composition.txt --word-topic-counts-file word-topic.txt

The main input here is the number of topics. I've tried different values and finally settled for 20. The model has a few output files. You can find a full description of each of them here.


Now that we have the results we can view the list of topics and the weight of each topic in the whole collection (how often does each topic appear). Due to the large number of topics (20) it's hard to display all of them in a chart so I've select the top 7.

I've named each topic based on the words they were made of. The ??? topic is a rather vague one so I couldn’t find a proper name for it.

For each of the 20 topics we can also see the words that make the topics and the weight of each word in the topic. Again, displaying all the them in a graph is quite hard so I’ve selected the top 4 only:

Also, it would be interesting to see if any of the topics weight has a seasonal appearance. We've selected the 'Health' topic for that. This topic is made of the following words: spread virus flu website nhs germs information irish tissues hands wash symptoms cover surfaces regularly sanitizers worst home viruses cold which looks like content warning employees about seasonal health problems, like flu.

As expected, this is quite seasonal and shows up mostly in the cold season (tempting as it may be one should not take this chart as a proof of Irish seasons existence)

Topic weight - click rate correlation

Now, let's get to analyzing a possible correlation between the topics and the click rate. To calculate that we're going to use R, a free software for statistical analysis.

First, I've imported the generated topic model data into R

topics <- read.csv("/work/research/topics_composition.txt",header=TRUE)

Then, I've got greedy and tried to visually analyze a possible correlation between the click rate and the topic weight in articles. It didn't give any results because, as expected, the correlation is quite low.

Afterwards I calculated the correlation coefficient between each topic weight in the article and the click rate.


The correlation coefficient can go between -1 and 1 where -1 means that the variables are negatively linearly related, 1 means that they are positively linearly related and 0 means that there's no relation at all. After visually analyzing the data I wasn't expecting and values above/below 0.5/-0.5 but I was still curious to see the results and if they match the expectations dictated by common sense.

If you expect to find a very strong correlation between the topics and the click rate you'll probably be more disappointed than my parents. Remember that the number of topics is an arbitrary choice in topic modeling and also an employee looks at internal email as an opportunity to get up to date with everything that happens in the company so he'll give attention to all the articles.

Most popular topic

  1. with a correlation coefficient 0.15 it's the Sales topic that is made from the following words team sales wins year <sales person #1> read good <sales person #2> <sales person #3> weeks <sales person #4> <sales person #5> <sales person #6> strong significant number financial <sales person #7>

Most 'unpopular' topic

  1. with a correlation coefficient of -0.15 the Release topic made of the following words:
    release sunday iteration planned enterprise codenamed module users customers work email features brings subscribers improvements scheduled account code-named beta introduction

The low values of the correlation coefficients indicates that there's very little correlation (if any) between the topics included in the article and the click rate. However, it's clear that there's still a difference between two main topics (Sales vs Release) and this information can be considered as a starting point for further research. One easy explanation is that the Release information is already available to the employees from other channels and this decreases the click rate for this topic whereas Sales is usually brand new information for most of the employees. This shows that apart from knowing how to run any statistical model and interpret its results it's more important to know the business environment you're analyzing.


Topic modeling is an easy way to go through a large collection of documents and try to make up what it's all about. I've found it great fun to research it from a theoretical perspective and try it on our own content. A few warnings, though:

  1. Its probabilistic nature can be easily overlooked and one can consider its results as the ultimate truth
  2. Human factor. Due to the need to arbitrarily set the number topics (and interpret them) there's still a need for a human supervision. There are methods to do this automatically. Split the dataset in two parts: training set and test set. Run the model on the training set of data and then evaluate its performance against the test data. Model performance can measured using perplexity. However, some researchers warn that even when using automatic methods human supervision is still needed.
  3. High expectations. I found that some topics are rather vague. This might be normal but I blame it on my lack of experience with Mallet configuration parameters and low volume of data. Just be prepared for it.