Table of Contents
Topic Modeling is an unsupervised approach used for finding a set of words called “topics” in a text document. These topics consist of words that frequently occur together and usually share a common theme. And hence these topics with the predefined set of words can be used as factors to best describe the entire document.
Topic modeling provides us with methods to organize, understand and summarize large collections of text data.
There are many approaches for obtaining topics from a text document. In this post, I will explain one of the widely used topic model called Latent Dirichlet Allocation (LDA).
What is LDA
Latent Dirichlet Allocation (LDA) is an example of topic model where each document is considered as a collection of topics and each word in the document corresponds to one of the topics.
So, given a document LDA basically clusters the document into topics where each topic contains a set of words which best describe the topic.
For example, consider the following product reviews:
Review 1: A Five Star Book: I just finished reading. I expected an average romance read, but instead I found one of my favorite books of all time. If you are a lover of romance novels then this is a must read.
Review 2: Delicious cookie mix: This is the first time I have ever tried baking with a cookie mix. Mixing up the dough can get VERY messy. However, with a cookie mix like this you have a lot of flexibility in the ratio of ingredients (I like to add some extra butter) and was able to make no mess super delicious cookies.
Review 3: A fascinating insight into the life of modern Japanese teens: I thoroughly enjoyed reading this book. Steven Wardell is clearly a talented young author, adopted for some of his schooling into this family of four teens, and thus able to view family life in Japan from the inside out. A great read!
In this case LDA considers each review as a document and finds the topics corresponding to these documents. Each topic group contains a set of words along with their percentage contribution to the topic.
In the case of above reviews, the results of LDA would be
Topic 1:40% books, 30%read, 20% romance
Topic 2:45% japan, 30%read, 20%author
Topic 3:30% cookie, 30% mix, 20% delicious
From the above, we could interpret that Topic 3 is related to Review 2 and Topics 1 and 2 are partially related to Reviews 1 and 3.
To get a much better understanding let me explain this by implementing LDA in python.
Implementing LDA in Python
The following are the steps to implement LDA in Python.
- Import the dataset.
- Preprocess the text data
- Create Gensim dictionary and corpus
- Building the Topic Model
- Analyze the results
- Dominant topic within documents
Import the dataset:
Here we will be using the Amazon reviews dataset which contains the customer reviews of different amazon products.
1 2 3 4 5 6 |
import pandas as pd import numpy as np #read the csv file with amazon reviews reviews_df=pd.read_csv('reviews.csv',error_bad_lines=False) reviews_df['Reviews'] = reviews_df['Reviews'].astype(str) reviews_df.head(6) |

Preprocess the text data:
Importing text preprocessing libraries
1 2 3 4 5 6 7 |
#text processing import re import string import nltk from gensim import corpora, models, similarities from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer |
Here we are using three functions to preprocess the text data.
The initial_clean function performs an initial clean by removing punctuations, uppercase text etc.
1 2 3 4 5 6 7 8 |
def initial_clean(text): """ Function to clean text-remove punctuations, lowercase text etc. """ text = re.sub("[^a-zA-Z ]", "", text) text = text.lower() # lower case text text = nltk.word_tokenize(text) return text |
The words are then tokenized where just the words are separated from the text data. For e.g., for the below text data.
1 |
text = "All work and no play makes jack a dull boy, all work and no play" |
The tokenized output would be
1 2 |
initial_clean(text) ['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a', 'dull', 'boy', ',', 'all', 'work', 'and', 'no', 'play'] |
The next function remove_stop_words() removes all the stop words from the text data. Stopwords are basically the most commonly used words in English language such as the, an is etc.
It is common to remove these stopwords from text data as they could be considered as noise or distracting features when used in text algorithms.
1 2 3 4 5 |
stop_words = stopwords.words('english') stop_words.extend(['news', 'say','use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do','took','time','year', 'done', 'try', 'many', 'some','nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line','even', 'also', 'may', 'take', 'come', 'new','said', 'like','people']) def remove_stop_words(text): return [word for word in text if word not in stop_words] |
The next function stem_words() stems the words to its base forms to reduce variant forms of words.
For e.g., the sentence “obesity causes many problems” would be stemmed as “Obes caus mani problem”. Here we are using porters stemming algorithm to perform stemming.
1 2 3 4 5 6 7 8 9 10 11 |
stemmer = PorterStemmer() def stem_words(text): """ Function to stem words """ try: text = [stemmer.stem(word) for word in text] text = [word for word in text if len(word) > 1] # no single letter words except IndexError: pass return text |
Applying all the above preprocessing steps using apply_all() function.
1 2 3 4 5 |
def apply_all(text): """ This function applies all the functions above into one """ return stem_words(remove_stop_words(initial_clean(text))) |
1 2 3 4 5 6 |
# clean reviews and create new column "tokenized" import time t1 = time.time() reviews_df['tokenized_reviews'] = reviews_df['Reviews'].apply(apply_all) t2 = time.time() print("Time to clean and tokenize", len(reviews_df), "reviews:", (t2-t1)/60, "min") #Time to clean and tokenize 3209 reviews: 0.21254388093948365 min |
The new cleaned and tokenized data looks as below.

Create Gensim Dictionary and Corpus:
Importing LDA genism libraries
1 2 3 |
#LDA import gensim import pyLDAvis.gensim |
To perform topic modeling using LDA the two main inputs are the dictionary(id2word) and the corpus.Here we are using gensim library for building the dictionary and the corpus.
In Gensim, the words are referred to as “tokens” and the index of each word in the dictionary is called “id”. Dictionary is nothing but the collection of unique word-id’s and corpus is the mapping of (word_id, word_frequency).Lets create them as below.
1 2 3 4 5 6 7 8 9 |
#Create a Gensim dictionary from the tokenized data tokenized = reviews_df['tokenized_reviews'] #Creating term dictionary of corpus, where each unique term is assigned an index. dictionary = corpora.Dictionary(tokenized) #Filter terms which occurs in less than 1 review and more than 80% of the reviews. dictionary.filter_extremes(no_below=1, no_above=0.8) #convert the dictionary to a bag of words corpus corpus = [dictionary.doc2bow(tokens) for tokens in tokenized] print(corpus[:1]) |
Below is the output
1 |
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1)]] |
From the above output of the corpus (0,1) implies that the word-id 0 has occurred only once in the first document and (1,1) implies that word-id 1 has occurred once and so on. It just maps the word-ids to their frequency of occurrence.
Let’s find the corpus with the words and their frequencies using the below code.
1 |
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]] |
The output is:
1 |
[[('big', 1), ('comfort', 1), ('definit', 1), ('instead', 1), ('kindl', 1), ('palm', 1), ('paper', 1), ('paperwhit', 1), ('read', 1), ('recommend', 1), ('regular', 1), ('small', 2), ('thought', 1), ('turn', 1)]] |
The corpus output thus created as above is also called the Document Term Matrix and is given as input for the LDA topic model.
Building the Topic Model:
1 2 3 4 5 6 |
#LDA ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 7, id2word=dictionary, passes=15) ldamodel.save('model_combined.gensim') topics = ldamodel.print_topics(num_words=4) for topic in topics: print(topic) |
Here num_topics is the number of topics to be created and passes is the number of times to iterate through the entire corpus.
Analyze the results:
The LDA algorithm creates two matrices called the document-topic matrix and a topic-words matrix.
Topic-Words matrix contains the probability distribution of words generated from those topics. By running the LDA algorithm on the above data produces the below outputs.
1 2 3 4 5 6 7 |
(0, '0.046"echo" + 0.033"alexa" + 0.026"show" + 0.025"music"') (1, '0.049"read" + 0.047"book" + 0.040"kindl" + 0.029"love"') (2, '0.042"kid" + 0.023"great" + 0.018"tablet" + 0.014"set"') (3, '0.025"work" + 0.024"great" + 0.023"amazon" + 0.022"app"') (4, '0.029"kindl" + 0.017"read" + 0.016"one" + 0.015"screen"') (5, '0.107"love" + 0.065"bought" + 0.040"gift" + 0.038"one"') (6, '0.088"tablet" + 0.051"great" + 0.031"price" + 0.026"fire"') |
This output shows the Topic-Words matrix for the 7 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., Topic 1 is related to alexa and echo’s music, whereas Topic 2 is about reading books using amazon kindle).
Document-Topic matrix contains the probability distribution of the topics present in the documents. Now, let’s use the Document-Topic matrix to find the probability distribution of the topics present in each document.
1 2 |
get_document_topics = ldamodel.get_document_topics(corpus[0]) print(get_document_topics) |
Using the above code for the first review as below
“I thought it would be as big as small paper but turn out to be just like my palm. I think it is too small to read on it… not very comfortable as regular Kindle. Would definitely recommend a paperwhite instead.”
The topic proportions produced are
1 |
[(4, 0.94627726)] |
It is clearly evident from the output that the above review which speaks about the readability of kindle screens is 95% related to Topic 4(4, ‘0.029*”kindl” + 0.017*”read” + 0.016*”one” + 0.015*”screen”‘) which seems to be pretty accurate.
Visualizing topics using pyLDAvis:
Using genism pyLDAvis feature the topics created could be visualized as below.
1 2 3 4 |
#visualizing topics lda_viz = gensim.models.ldamodel.LdaModel.load('model.gensim') lda_display = pyLDAvis.gensim.prepare(lda_viz, corpus, dictionary, sort_topics=True) pyLDAvis.display(lda_display) |

The above display shows the correlation between the topics as well as the top most relevant terms for each selected topic (topic 1 in this case).
Dominant Topic within documents:
Now to get a much better idea and also to verify our results lets create a function called dominant_topic() which finds the most dominant topic for each review and displays it along with their topic proportions and keywords.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def dominant_topic(ldamodel, corpus, texts): #Function to find the dominant topic in each review sent_topics_df = pd.DataFrame() # Get main topic in each review for i, row in enumerate(ldamodel[corpus]): row = sorted(row, key=lambda x: (x[1]), reverse=True) # Get the Dominant topic, Perc Contribution and Keywords for each review for j, (topic_num, prop_topic) in enumerate(row): if j == 0: # => dominant topic wp = ldamodel.show_topic(topic_num,topn=4) topic_keywords = ", ".join([word for word, prop in wp]) sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True) else: break sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] contents = pd.Series(texts) sent_topics_df = pd.concat([sent_topics_df, contents], axis=1) return(sent_topics_df) |
1 2 |
df_dominant_topic = dominant_topic(ldamodel=ldamodel, corpus=corpus, texts=reviews_df['Reviews']) df_dominant_topic.head() |
From the above output its clearly seen that the topics created and their percentage contribution greatly relate to the context of the reviews.

Conclusion:
So, to summarize, in this article we explained about Topic Modeling using LDA, how it works, steps involved in creating an LDA topic model, visualizing the topics, finding dominant topics etc.
Hope the above article helped you get an overall idea about LDA topic modeling. Do let us know your comments and feedbacks about this article below.