Python
No Comments

Topic Modeling using Latent Dirichlet Allocation (LDA)

topic modeling latent dirichlet allocation

Topic Modeling is an unsupervised approach used for finding a set of words called “topics” in a text document. These topics consist of words that frequently occur together and usually share a common theme. And hence these topics with the predefined set of words can be used as factors to best describe the entire document.

Topic modeling provides us with methods to organize, understand and summarize large collections of text data.

There are many approaches for obtaining topics from a text document. In this post, I will explain one of the widely used topic model called  Latent Dirichlet Allocation (LDA).

What is LDA

Latent Dirichlet Allocation (LDA) is an example of topic model where each document is considered as a collection of topics and each word in the document corresponds to one of the topics. 

So, given a document LDA basically clusters the document into topics where each topic contains a set of words which best describe the topic.

For example, consider the following product reviews:

Review 1: A Five Star Book: I just finished reading. I expected an average romance read, but instead I found one of my favorite books of all time. If you are a lover of romance novels then this is a must read.

Review 2: Delicious cookie mix: This is the first time I have ever tried baking with a cookie mix. Mixing up the dough can get VERY messy. However, with a cookie mix like this you have a lot of flexibility in the ratio of ingredients (I like to add some extra butter) and was able to make no mess super delicious cookies.

Review 3: A fascinating insight into the life of modern Japanese teens: I thoroughly enjoyed reading this book. Steven Wardell is clearly a talented young author, adopted for some of his schooling into this family of four teens, and thus able to view family life in Japan from the inside out. A great read!

In this case LDA considers each review as a document and finds the topics corresponding to these documents.  Each topic group contains a set of words along with their percentage contribution to the topic.

In the case of above reviews, the results of LDA would be

Topic 1:40% books, 30%read, 20% romance

Topic 2:45% japan, 30%read, 20%author

Topic 3:30% cookie, 30% mix, 20% delicious

From the above, we could interpret that Topic 3 is related to Review 2 and Topics 1 and 2 are partially related to Reviews 1 and 3.

To get a much better understanding let me explain this by implementing LDA in python.

Implementing LDA in Python

The following are the steps to implement LDA in Python.

  1. Import the dataset.
  2. Preprocess the text data
  3. Create Gensim dictionary and corpus
  4. Building the Topic Model
  5. Analyze the results
  6. Dominant topic within documents

Import the dataset:

Here we will be using the Amazon reviews dataset which contains the customer reviews of different amazon products.

topic modeling LDA

Preprocess the text data:

Importing text preprocessing libraries

Here we are using three functions to preprocess the text data.

The initial_clean function performs an initial clean by removing punctuations, uppercase text etc.

The words are then tokenized where just the words are separated from the text data. For e.g., for the below text data.

The tokenized output would be

The next function remove_stop_words() removes all the stop words from the text data. Stopwords are basically the most commonly used words in English language such as the, an is etc.

It is common to remove these stopwords from text data as they could be considered as noise or distracting features when used in text algorithms.

The next function stem_words() stems the words to its base forms to reduce variant forms of words.

For e.g., the sentence “obesity causes many problems” would be stemmed as “Obes caus mani problem”. Here we are using porters stemming algorithm to perform stemming.

Applying all the above preprocessing steps using apply_all() function.

The new cleaned and tokenized data looks as below.

cleaned and tokenized dataset

Create Gensim Dictionary and Corpus:

Importing LDA genism libraries

To perform topic modeling using LDA the two main inputs are the dictionary(id2word) and the corpus.Here we are using gensim library for building the dictionary and the corpus.

In Gensim, the words are referred to as “tokens” and the index of each word in the dictionary is called “id”. Dictionary is nothing but the collection of unique word-id’s and corpus is the mapping of  (word_id, word_frequency).Lets create them as below.

Below is the output

From the above output of the corpus (0,1) implies that the word-id 0 has occurred only once in the first document and (1,1) implies that word-id 1 has occurred once and so on. It just maps the word-ids to their frequency of occurrence.

Let’s find the corpus with the words and their frequencies using the below code.

The output is:

The corpus output thus created as above is also called the Document Term Matrix and is given as input for the LDA topic model.

Building the Topic Model:

Here num_topics is the number of topics to be created and passes is the number of times to iterate through the entire corpus.

Analyze the results:

The LDA algorithm creates two matrices called the document-topic matrix and a topic-words matrix.

Topic-Words matrix contains the probability distribution of words generated from those topics. By running the LDA algorithm on the above data produces the below outputs.

This output shows the Topic-Words matrix for the 7 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., Topic 1 is related to alexa and echo’s music, whereas Topic 2 is about reading books using amazon kindle).

Document-Topic matrix contains the probability distribution of the topics present in the documents. Now, let’s use the Document-Topic matrix to find the probability distribution of the topics present in each document.

Using the above code for the first review as below

“I thought it would be as big as small paper but turn out to be just like my palm. I think it is too small to read on it… not very comfortable as regular Kindle. Would definitely recommend a paperwhite instead.”

The topic proportions produced are

It is clearly evident from the output that the above review which speaks about the readability of kindle screens is 95% related to Topic 4(4, ‘0.029*”kindl” + 0.017*”read” + 0.016*”one” + 0.015*”screen”‘) which seems to be pretty accurate.

Visualizing topics using pyLDAvis:

Using genism pyLDAvis feature the topics created could be visualized as below.

gensim pyLDAvis

The above display shows the correlation between the topics as well as the top most relevant terms for each selected topic (topic 1 in this case). 

Dominant Topic within documents:

Now to get a much better idea and also to verify our results lets create a function called dominant_topic() which finds the most dominant topic for each review and displays it along with their topic proportions and keywords.

From the above output its clearly seen that the topics created and their percentage contribution greatly relate to the context of the reviews.

Conclusion:

So, to summarize, in this article we explained about Topic Modeling using LDA, how it works, steps involved in creating an LDA topic model, visualizing the topics, finding dominant topics etc.

Hope the above article helped you get an overall idea about LDA topic modeling.  Do let us know your comments and feedbacks about this article below.

Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book

data-science-hand-book


You must be logged in to post a comment.
Improve Your Data Science Skills Today!

Subscribe To Get Your Free Python For Data Science Hand Book


data-science-hand-book

Arm yourself with the most practical data science knowledge available today.

KEEP LEARNING

Menu