Build Robust Topic Modelling Capabilities In Your Python GUI App With Powerful Gensim Library

Do you want to Train Large scale semantic NLP Models in your Delphi GUI App? This post will get to understand how to use Gensim Python Library using Python4Delphi in Delphi/C++ application and learn the core concepts of Gensim – A Superfast, Proven, Data Streaming, Platform Independent library with some pretrained models for specific domains like legal or health.

Python for Delphi (P4D) is a set of free components that wrap up the Python DLL into Delphi. They let you easily execute Python scripts, create new Python modules and new Python types. You can use Python4Delphi a number of different ways such as:

Create a Windows GUI around your existing Python app.
Add Python scripting to your Delphi Windows apps.
Add parallel processing to your Python apps through Delphi threads.
Enhance your speed-sensitive Python apps with functions from Delphi for more speed.

Prerequisites.

If not python and Python4Delphi is not installed on your machine, Check this how to run a simple python script in Delphi application using Python4Delphi sample app
Open windows open command prompt, and type pip install -U gensim to install GenSim. For more info for Installing Python Modules check here
First, run the Demo1 project for executing Python script in Python for Delphi. Then load the Texblob sample script in the Memo1 field and press the Execute Script button to see the result. On Clicking Execute Button the script strings are executed using the below code. Go to GitHub to download the Demo1 source.

procedure TForm1.Button1Click(Sender: TObject);

begin

PythonEngine1.ExecStrings( Memo1.Lines );

end;

Gensim Core concepts :

Document: A document is an object of the text sequence type (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.
Corpus: a collection of documents. Serve 2 purposes.
1. Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.
2. Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.
Vector: a mathematically convenient representation of a document.
Model: an algorithm for transforming vectors from one representation to another.

Gensim Python Library sample script details: The sample scripts helps to understand how the core concepts were implemented for a simple Corpus.

A Corpus consists of 9 documents where each document consisting of a string.
Created a set of frequent words where to be ignored by splitting it by white space.
Get the word count frequencies and just keep the words which is occurring more than once.
Assign to dictionary in corpora and print the tokens and its id by calling token2id.
Create the bag-of-word representation for a new document using the doc2bow and convert our entire original corpus to a list of vectors.
Train using the model ‘tf-idf‘ – transforms vectors from the bag-of-words representation to a vector space
Transform the “system minors” string from the dictionary using doc2bow
Transform the whole corpus via TfIdf and index it, in preparation for similarity queries.

import pprint

from gensim import corpora

from gensim import models

from gensim import similarities

# Corpus - It consists of 9 documents, where each document is a string consisting of a single sentence.

text_corpus = [

"Human machine interface for lab abc computer applications",

"A survey of user opinion of computer system response time",

"The EPS user interface management system",

"System and human system engineering testing of EPS",

"Relation of user perceived response time to error measurement",

"The generation of random binary unordered trees",

"The intersection graph of paths in trees",

"Graph minors IV Widths of trees and well quasi ordering",

"Graph minors A survey",

]

# Create a set of frequent words

stoplist = set('for a of the and to in'.split(' '))

# Lowercase each document, split it by white space and filter out stopwords

texts = [[word for word in document.lower().split() if word not in stoplist]

for document in text_corpus]

# Count word frequencies

from collections import defaultdict

frequency = defaultdict(int)

for text in texts:

for token in text:

frequency[token] += 1

# Only keep words that appear more than once

processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

pprint.pprint(processed_corpus)

dictionary = corpora.Dictionary(processed_corpus)

print(dictionary)

pprint.pprint(dictionary.token2id)

# create the bag-of-word representation for a document using the doc2bow

new_doc = "Human computer interaction"

new_vec = dictionary.doc2bow(new_doc.lower().split())

print(new_vec)

# convert our entire original corpus to a list of vectors:

bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

pprint.pprint(bow_corpus)

# Model `tf-idf - transforms vectors from the bag-of-words representation to a vector space

# where the frequency counts are weighted according to the relative rarity of

# each word in the corpus. Here's a simple example. Let's initialize the tf-idf model, training it on

# our corpus and transforming the string "system minors":

# train the model

tfidf = models.TfidfModel(bow_corpus)

# transform the "system minors" string

words = "system minors".lower().split()

print(tfidf[dictionary.doc2bow(words)])

# to transform the whole corpus via TfIdf and index it, in preparation for similarity queries:

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

# and to query the similarity of our query document ``query_document`` against every document in the corpus:

query_document = 'system engineering'.split()

query_bow = dictionary.doc2bow(query_document)

sims = index[tfidf[query_bow]]

print(list(enumerate(sims)))

# Document 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc.

# We can make this slightly more readable by sorting:

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):

print(document_number, score)

Note: Samples used for demonstration were picked from here with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.

You have read the quick overview of Gensim library, download this library from here, and perform NLP tasks quickly with help of models such as word2vec, Latent Dirichlet Allocation Model, FastText Model, etc. Check out Python4Delphi and easily build Python GUIs for Windows using Delphi.

Build Robust Topic Modelling Capabilities In Your Python GUI App With Powerful Gensim Library

Watch the Python GUI Apps Con 2023 sessions today!

Download RAD Studio And Build Python GUI Windows Apps 5x Faster with Less Code

PyScripter is an open-source Python Integrated Development Environment (IDE)

Leave a Reply Cancel reply

Something Fresh

Unlock the Power of Python for Deep Learning with Diffusion Model - The Engine behind Stable Diffusion

How To Make More Than 20 ChatGPT Prompts Work With Python GUI Builders And OpenCV Library?

Unlock the Power of Python for Deep Learning with Radial Basis Function Networks (RBFNs)

What People Reading

6 Best Python GUI Frameworks in December 2021

Top 5 Ways To Build A Python Desktop App in 2021

Compare DelphiVCL4Python With Python GUI Frameworks Like Tkinter For Windows

Powerful Data Analysis And Manipulation Using Pandas Library In A Delphi Windows App

7 Recommended Python Tools to Easily Build GUIs

Categories

Python4Delphi Latest Topics

Newest questions tagged python – Stack Overflow

Newest questions tagged python user-interface – Stack Overflow

Python GUI

Categories

Useful Links

Follow us

Build Robust Topic Modelling Capabilities In Your Python GUI App With Powerful Gensim Library

Related posts

Leave a Reply Cancel reply

Something Fresh

What People Reading

Categories

Python GUI

Categories

Useful Links

Follow us