An introduction to using NLTK with Python


Natural language processing is an aspect of machine learning that allows you to process words written in user-friendly language. These texts then become editable and you can run calculation algorithms on them at your leisure.

The logic behind this captivating technology seems complex, but it is not. And even now, with a solid grasp of basic Python programming, you can create a new DIY word processor with the Natural Language Toolkit (NLTK).

Here’s how to get started with Python’s NLTK.

What is NLTK and how does it work?

Written with Python, NLTK offers a variety of string manipulation features. It is a versatile natural language library with a large repository of models for various natural language applications.

With NLTK, you can process raw texts and extract meaningful characteristics from them. It also offers text analysis models, feature-based grammars, and rich lexical resources to build a comprehensive language model.

How to configure NLTK

First, create a project root folder anywhere on your PC. To start using the NLTK library, open your terminal in the root folder you created earlier and create a virtual environment.

Then install the Natural Language Toolkit in this environment using seed:

pip install nltk

NLTK, however, offers a variety of datasets that serve as the basis for new natural language models. To access it, you need to launch the NLTK built-in data downloader.

So, once you have successfully installed NLTK, open your Python file using any code editor.

Then import the nltk module and instantiate the data downloader using the following code:

pip install nltk

Running the above code through the terminal brings up a graphical user interface for selecting and downloading data packages. Here you will need to choose a package and click on the button To download button to get it.

Any data packet you download goes to the specified directory written in the Download the directory field. You can change this if you want. But try to keep the default location at this level.

Related: The Best Free Code Editors To Write Your First App

To note: Data packages are added to system variables by default. So you can continue to use them for future projects no matter what Python environment you are using.

How to use NLTK tokenizers

Ultimately, NLTK offers trained tokenization models for words and phrases. Using these tools, you can generate a word list from a sentence. Or turn a paragraph into a meaningful array of sentences.

Here is an example of using NLTK word_tokenizer:

import nltk
from nltk.tokenize import word_tokenize
word = "This is an example text"
tokenWord = word_tokenizer(word)
['This', 'is', 'an', 'example', 'text']

NLTK also uses a pre-trained sentence tokenizer called PunktSentenceTokenizer. It works by breaking up a paragraph in a list of sentences.

Let’s see how it works with a two sentence paragraph:

import nltk
from nltk.tokenize import word_tokenize, PunktSentenceTokenizer
sentence = "This is an example text. This is a tutorial for NLTK"
token = PunktSentenceTokenizer()
tokenized_sentence = token.tokenize(sentence)
['This is an example text.', 'This is a tutorial for NLTK']

You can still tokenize each sentence of the array generated from the above code using word_tokenizer and Python for the loop.

Examples of using NLTK

So, while we can’t demonstrate every possible use case of NLTK, here are some examples of how you can start using it to solve real-world problems.

Get definitions of words and their parts of speech

NLTK offers models for determining parts of speech, obtaining detailed semantics and possible contextual use of various words.

You can use the wordnet template to generate variables for a text. Then determine its meaning and part of the speech.

For example, let’s check the possible variables for “Monkey:”

import nltk
from nltk.corpus import wordnet as wn
[Synset('monkey.n.01'), Synset('imp.n.02'), Synset('tamper.v.01'), Synset('putter.v.02')]

The above code generates possible word alternatives or syntaxes and parts of speech for “Monkey”.

Now check the meaning of “Monkey” using the definition method:

Monkey = wn.synset('monkey.n.01').definition()
any of various long-tailed primates (excluding the prosimians)

You can replace the string in parentheses with other generated alternatives to see what NLTK generates.

The pos_tag pattern, however, determines the parts of a word’s speech. You can use it with the word_tokenizer Where PunktSentenceTokenizer () if you are dealing with longer paragraphs.

Here is how it works:

import nltk
from nltk.tokenize import word_tokenize, PunktSentenceTokenizer
word = "This is an example text. This is a tutorial on NLTK"
token = PunktSentenceTokenizer()
tokenized_sentence = token.tokenize(word)
for i in tokenized_sentence:
tokenWordArray = word_tokenize(i)
partsOfSpeech = nltk.pos_tag(tokenWordArray)
[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('text', 'NN'), ('.', '.')]
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('tutorial', 'JJ'), ('on', 'IN'), ('NLTK', 'NNP')]

The code above associates each tokenized word with its voice tag in a tuple. You can check the meaning of these tags on Penn treebank.

For a cleaner result, you can remove the periods in the output by using the to replace() method:

for i in tokenized_sentence:
tokenWordArray = word_tokenize(i.replace('.', ''))
partsOfSpeech = nltk.pos_tag(tokenWordArray)
Cleaner output:
[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('text', 'NN')]
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('tutorial', 'JJ'), ('on', 'IN'), ('NLTK', 'NNP')]

Extracting features from raw text is often tedious and time consuming. But you can show the most powerful feature determinants in text using the NLTK frequency distribution trend plot.

NLTK, however, syncs with matplotlib. You can take advantage of this to show a specific trend in your data.

The code below, for example, compares a set of positive and negative words on a distribution diagram using their last two alphabets:

import nltk
from nltk import ConditionalFreqDist
Lists of negative and positive words:
negatives = [
'abnormal', 'abolish', 'abominable',
'abominably', 'abominate','abomination'
positives = [
'abound', 'abounds', 'abundance',
'abundant', 'accessable', 'accessible'
# Divide the items in each array into labeled tupple pairs
# and combine both arrays:
pos_negData = ([("negative", neg) for neg in negatives]+[("positive", pos) for pos in positives])
# Extract the last two alphabets from from the resulting array:
f = ((pos, i[-2:],) for (pos, i) in pos_negData)
# Create a distribution plot of these alphabets
cfd = ConditionalFreqDist(f)

The alphabet distribution diagram looks like this:

NLTK Alphabet Distribution Diagram

Looking closely at the graph, the words ending in this, ds, the, sd, and NT are more likely to be positive texts. But those that end with Al, is lying, to, and you are more likely negative words.

To note: While we’ve used autogenerated data here, you can access some of the NLTK’s built-in datasets using its Corpus player by calling them from the corpus a kind of nltk. You may want to watch the corpus package documentation to see how you can use it.

With the emergence of technologies like Alexa, spam detection, chatbots, sentiment analysis, etc., natural language processing appears to be moving into its sub-human phase. While we’ve only considered a few examples of what NLTK has to offer in this article, the tool has more advanced applications than the scope of this tutorial.

After reading this article, you should have a good idea of ​​how to use NLTK at a basic level. All you have to do is put this knowledge into practice yourself!

AI-generated people with technology overlay

7 machine learning libraries for budding experts

Interested in the field of Machine Learning? Start with these libraries.

Read more

About the Author


Leave A Reply