Natural Language Processing with Python

In this post, we will learn to perform Natural language processing with Python.

Natural language processing, also called NLP, is the ability of a software program to understand human language.

NLP can be done with Python using NLTK, Natural Language Tool Kit. Gensim is one of the most commonly used libraries within NLTK.

We will learn to use Gensim dictionaries and Tf-Idf Model.

Introduction to Gensim

Gensim is a popular open-source Natural Language Processing library.

It uses top academic models to perform complex tasks, like building document or word vectors (corpora) and performing topic identification , document comparison.

What is a document or word vector? –  Here are some examples in visual form.

A word embedding or vector is trained from a larger corpus, and is a multi dimensional representation of a word. You can think of it as a multi-dimensional array, normally with sparse features, which means lot of zeros and some ones.

With these vectors, we can see relationships among words or documents, based on how near or far they are. For example, we can see that the vector operation (king – queen) is approximately equal to (man – woman) or that (Spain is to Madrid) as  (Italy is to Rome)

Gensim allows you to build corpora or dictionaries using simple classes and functions.

A corpus or corpora(plural of corpus) is a set of text useful for performing natural language processing tasks.

How to create a Gensim dictionary and corpus?

Take a look at example of creating a gensim dictionary, here our documents are a list of strings containing movie reviews about sci-fi films.

First, we need to do some basic pre-processing.For brevity, we will only tokenize in lower case.

Then we can pass the tokenized documents to gensim dictionary. This will create a mapping, with an id for each token, this is the beginning of our corpus !

We can now represent whole documents using the list of their token id’s and how often each token appears in their document.

We can see tokens and their id’s using token2id attribute, which returns a dictionary of all tokens and their respective id’s.

Using this dictionary, we can create a gensim corpus.A gensim corpus is a bit different from a normal corpus, which is usually just a collection of documents.

how to create gensim bag-of-words and corpus

Gensim uses a simple bag-of-words model, which transforms each document into a bag of words using the token id’s and the frequency of each token in the document.

Here we can see that the Gensim corpus is a list of lists, each list item representing one document.

Each document is now a series of tuples, the first item representing the token id from the dictionary ,a the second item representing the token frequency in the document.

In only a few lines of code, we have a new bag of words model and corpus available, thanks to Gensim.

Unlike our previous Counter based bag of words, this Gensim model can be easily saved, updated, and reused, thanks to the extra tools we have available in Gensim.

How to query a Gensim corpus?

It’s time to apply the methods you learned in the previous example to create another Gensim dictionary and corpus , but this time we will use articles from Wikipedia.

We’ll use these data structures to investigate word trends and potential interesting topics in your document set.

To get started, I have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation.

These were then stored in a list of document tokens called articles.

articles for processing using Gensim

We’ll need to do some light preprocessing and then generate the Gensim dictionary and corpus.

processing wikipedia article using gensim

How to find the most common terms per document across all documents using Gensim?

We will use the dictionary and corpus objects created in the previous example, as well as the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

The fifth document from corpus is stored in the variable doc, which has been sorted in descending order.

We will print the top five words of bow_doc using each word_id with the dictionary alongside word_count. The word_id can be accessed using the .get() method of dictionary.

Now, let’s print the top 5 words across all documents alongside the count:

To do this, let’s create a defaultdict called total_word_count in which the keys are all the token ids (word_id) and calculate the values as the sum of their occurrence across all documents (word_count).

Remember to specify int when creating the defaultdict, and inside the for loop, increment each word_id of total_word_count by word_count.

We’ll create a sorted list from the defaultdict, using words across the entire corpus. To achieve this, use the .items() method on total_word_count inside sorted().

Similar to how we printed the top five words of bow_doc earlier, we will print the top five words of sorted_word_count as well as the number of occurrences of each word across all the documents.

The execution results of both these scenarios are given below for your reference:-

using Gensim to find most occuring terms in a document

How to build Tf-idf model with Gensim?

Tf-idf means Term frequency – inverse document frequency. It is a commonly used Natural Language Processing model that helps you determine the most important words in each document in the corpus.

The idea is that each corpus might have more shared words than stop words. These common words are like stop words which might need to be removed or down weighted in importance.

For example, if I am an astronomer, “Sky” might be used often but it is not important , so I want to downweigh that word.

Tf-Idf does precisely that. It will take text that share a common language and ensure that the most common words across the entire corpus don’t show up as key words.

Tf-Idf helps keep the document specific frequent words rated high, and common words across the corpus rated low.

Tf-idf formula

tf idf formula

Lets unpack this a bit.

Weight will be low if a term doesn’t appear often in the document, because tf variable will then be low.

Weight will also be low if log is close to zero, meaning the internal equation is low.

If total no .of documents divided by  no. of documents containing the word is close to 1, then logarithm will be close to zero.

So words that occur across many or all documents will have a very low tf idf weight. On the contrary,

if the word occurs in only a few documents, that logarithm will return a high number.

Process to build Tf-idf model using Gensim

Let us build a Tf-idf model using  the corpus that we developed in the first example containing sci-fi movie reviews.

Simply pass the bag-of-words corpus during initialization of Tf-idf model.

building a tf-idf model using gensim

We can then refer each document by using it like a dictionary key with the new Tf-idf model.

For the second document in our corpus, we can see the token weight along with token ids.

Notice there are some large differences. Token id 12 has weight of 0.77, whereas tokens 5,7 and  9 have weights below 0.18.

These weights can help you determine good topics in a corpus with shared vocabulary !

Now let us determine new significant terms for the corpus generated from the Wikipedia articles by applying Gensim’s tf-idf.

We will use same corpus and dictionary objects we had created in the previous example using the Wikipedia articles – dictionary, corpus, and doc.

Tf-idf analysis using Wikipedia article

Does Tf-idf make for more interesting results on the document level?

Feel free to comment using the section below.

Recent Posts

Menu