Topic modeling in Python

In this post, I will introduce you to topic modeling in Python (or) topic identification, which you can apply to any text you encounter in the wild.

Using basic NLP(Natural Language Processing) models, we will identify topics from texts based on term frequencies.

We will learn a simple method – bag-of-words and then use pre-processing techniques like Lemmatization to find the topic of a Wikipedia article.

Word counts with bag-of-words

Bag-of-words is a basic method for finding topics in a text.
First, we need to create tokens using tokenization and then count up all the tokens.

The more frequent a word, the more important it might be.Can be a great way to determine the significant words in a text.

Example
Text: “The cat is in the box. The cat likes the box. The box is over the cat.”

generate bag-of-words Counter in Natural Language Processing

You might get the error “SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)” while importing word_tokenize and Counter.

To resolve that error:

1.Go to /Applications/Python 3.7 folder and run Install Certificates.cmd.This should resolve the SSL error.

Install SSL Certificates in Python 3

2.Open Python Shell and run the commands I’ve given below.

Building a Counter with bag-of-words

Let’s build another bag-of-words counter using a Wikipedia article.

We will try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you’d like to peek at the title at the end, look at the title of the Wikipedia link here.

Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

Wikipedia article on debugging for natural language processing

bag-of-words example in Natural Language Processing

As you can see, we could not find the topic of the article by using the Counter and tokenization.

Let’s do some preprocessing on the article and clean up to remove stop words like ‘the’, punctuation characters like quotes etc.

Simple text preprocessing

Preprocessing helps make for better input data, useful when performing machine learning or other statistical methods.

Examples of preprocessing:

Tokenization to create a bag of words.

Lowercasing words.

Lemmatization/Stemming.

Shorten words to their root stems.

Removing stop words, punctuation, or unwanted tokens.

Good to experiment with different approaches.

Text preprocessing for the initial example

Lets take the sample text , that we took for generating the bag-of-words, and apply preprocessing techniques. Note the output now and compare it with the initial output.

Initial code gave us output[(‘The’, 3), (‘cat’, 3)] , while this code gives us output [(‘cat’ , 3), (‘box’ , 3)]

preprocessing example in NLP

Text preprocessing for the Wikipedia article

Now, let’s apply the preprocessing techniques  you’ve learned to help clean up text of the Wikipedia article for better NLP results.

We’ll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on cleaned text.

Let’s use the tokens created in the last Wikipedia parsing example: lower_tokens.

We will import the WordNetLemmatizer class from nltk.stem. Then, we will create a list called alpha_only that iterates through lower_tokens and retains only alphabetical characters. You can use the .isalpha() method to check for this.

Create another list called no_stops in which you remove all stop words, which are held in a list called english_stops.

english_stops=[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’, ‘hers’, ‘herself’, ‘it’, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, ‘should’, ‘now’, ‘d’, ‘ll’, ‘m’, ‘o’, ‘re’, ‘ve’, ‘y’, ‘ain’, ‘aren’, ‘couldn’, ‘didn’, ‘doesn’, ‘hadn’, ‘hasn’, ‘haven’, ‘isn’, ‘ma’, ‘mightn’, ‘mustn’, ‘needn’, ‘shan’, ‘shouldn’, ‘wasn’, ‘weren’, ‘won’, ‘wouldn’, ”]

Initialize a WordNetLemmatizer object called wordnet_lemmatizer and use its .lemmatize() method on the tokens in no_stops to create a new list called lemmatized.

Finally, create a new Counter called bow with the lemmatized words and show the 10 most common tokens.

See the output of this code, and the word with most occurences is debugging, and the Wikipedia article that we had was also on debugging !!

preprocessing using WordNetLemmatizer in Natural Language Processing

That’s it for now on topic modeling. We will learn advanced concepts of data science in my live sessions.

Hope you like this article ! Feel free to suggest on this article on topic modeling in the comments section below.

 

Recent Posts

Menu