Want to perform Natural language processing tasks like predicting text, Handwriting Recognition, Text analysis to detect sentiments(Tweets and Blogs) in your GUI app? This post will get to understand how to use NLTK Python Library using Python4Delphi in the Delphi/C++ Builder application and perfom some basic NLP tasks. NLTK is a leading platform for building Python programs to work with human language data. NLP for short — in a wide sense to cover any kind of computer manipulation of natural language.
Python for Delphi (P4D) is a set of free components that wrap up the Python DLL into Delphi and Lazarus (FPC). They let you easily execute Python scripts, create new Python modules and new Python types. You can use Python4Delphi a number of different ways such as:
- Create a Windows GUI around your existing Python app.
- Add Python scripting to your Delphi Windows apps.
- Add parallel processing to your Python apps through Delphi threads.
- Enhance your speed-sensitive Python apps with functions from Delphi for more speed.
Prerequisites.
- If not python and Python4Delphi is not installed on your machine, Check this how to run a simple python script in Delphi application using Python4Delphi sample app
- Open windows open command prompt, and type pip install -U nltk to install nltk. For more info for Installing Python Modules check here
- Run the Python interpreter and type the commands:
1 2 |
<strong>>>> </strong><strong>import</strong> <strong>nltk</strong> <strong>>>> </strong>nltk.download() |
- First, run the Demo1 project for executing Python script in Python for Delphi. Then load the NLTK sample script in the Memo1 field and press the Execute Script button to see the result. On Clicking Execute Button the script strings are executed using the below code. Go to GitHub to download the Demo1 source.
1 2 3 4 |
procedure TForm1.Button1Click(Sender: TObject); begin PythonEngine1.ExecStrings( Memo1.Lines ); end; |
Key NLP terminologies.
Token: Each linguistic units such as words, punctuation, numbers, or alphanumerics in an Input text are known as tokens.
Sentence: An ordered sequence of tokens.
Tokenization: The process of splitting a sentence into its constituent tokens.
Corpus: A body of text, usually containing a large number of sentences.
Part-of-speech (POS) Tag: A word can be classified into one or more of a set of lexical or part-of-speech categories such as Nouns, Verbs,
Adjectives and Articles, to name a few. A POS tag is a symbol representing such a lexical category – NN(Noun), VB(Verb), JJ(Adjective),
AT(Article).
Parse Tree: A tree defined over a given sentence that represents the syntactic structure of the sentence as defined by a formal grammar.
NLTK Python Library sample script details:
- How to tokenize the input text and get the parts of speech tags for the tokens, represent the tagged token.
- Extracting information from text like Identify named entities from the tagged tokens by a technique called Chunking(which segments and labels multi-token sequences)
- Tagged words from brown nltk corpus.
- A simple classification task(the task of choosing the correct class label for a given input e.g, Deciding whether the gender is male or female). During training, a feature extractor is used to convert each input value to a feature set. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model. During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These feature sets are then fed into the model, which generates predicted labels. For more details check here.
- Display a parse tree from corpus treebank,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
import nltk,random from nltk.corpus import treebank from nltk.corpus import names from nltk.corpus import wordnet as wn sentence = """At eight o'clock on Thursday morning. Arthur didn't feel very good.""" #Tokenization tokens = nltk.word_tokenize(sentence) print(tokens) #POS Tag tagged = nltk.pos_tag(tokens) print(tagged[0:6]) # Representing Tagged token tagged_token = nltk.tag.str2tuple('fly/NN') print(tagged_token) entities = nltk.chunk.ne_chunk(tagged) print(entities) print(nltk.corpus.brown.tagged_words()) # classification : Gender identification def gender_features(word): return {'last_letter': word[-1]} print(gender_features('Shrek')) labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) random.shuffle(labeled_names) featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names] train_set, test_set = featuresets[500:], featuresets[:500] classifier = nltk.NaiveBayesClassifier.train(train_set) print(classifier.classify(gender_features('Neo'))) print(classifier.classify(gender_features('Trinity'))) print(nltk.classify.accuracy(classifier, test_set)) #Wordnet print(wn.synsets('motorcar')) print(wn.synset('car.n.01').lemma_names()) print(wn.synset('car.n.01').definition()) # display a prase tree form corpus treebank t = treebank.parsed_sents('wsj_0001.mrg')[0] t.draw() # opens a new window. |
Note: Samples used for demonstration were picked from here with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.
You have read the quick overview of the NLTK library, download this library from here and perform various tasks such as Access Text Corpora and Lexical Resources, Process Raw Text, categorizing, learn to classify text, extract information from the text, etc in your applications. Check out Python4Delphi and easily build Python GUIs for Windows using Delphi.