Do you want to perform Text Mining or Natural Language Processing tasks like Topic Modeling, Similarity Queries, etc. in your GUI app? This post will get you to understand how to use Gensim Python Library using Python4Delphi (P4D) in the Delphi/C++ Builder application and perform some interesting Text Mining tasks.
Gensim is an open-source library for Unsupervised Topic Modeling and Natural Language Processing, using Modern Statistical Machine Learning. Gensim has been used and cited in over 1400 commercial and academic applications as of 2018, in a diverse array of disciplines from medicine to insurance claim analysis to patent search.
Gensim is implemented in Python and Cython. Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.
Design principles of Gensim:
- Practicality – As industry experts, they focus on proven, battle-hardened algorithms to solve real industry problems. More focus on engineering, less on academia.
- Memory independence – There is no need for the whole training corpus to reside fully in RAM at any one time. Can process large, web-scale corpora using data streaming.
- Performance – Highly optimized implementations of popular vector space algorithms using C, BLAS and memory-mapping.
By now, Gensim is known to be the most robust, efficient and hassle-free piece of software to realize unsupervised semantic modeling from plain text.
Table of Contents
Hands-On
This post will guide you on how to perform Similarity Queries tasks using Python’s Gensim and then display it in the Delphi Windows GUI app.
First, open and run our Python GUI using project Demo1 from Python4Delphi with RAD Studio. Then insert the script into the lower Memo, click the Execute button, and get the result in the upper Memo. You can find the Demo1 source on GitHub. The behind the scene details of how Delphi manages to run your Python code in this amazing Python GUI can be found at this link.
Let’s perform a demo of the Gensim library, like similarity queries example. The following code is credited to Radim Řehůřek, the creator of Gensim (visit the original source here):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # Creating the Corpus from collections import defaultdict from gensim import corpora documents = [ "Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey", ] # Remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [ [word for word in document.lower().split() if word not in stoplist] for document in documents ] # Remove words that appear only once frequency = defaultdict(int) for text in texts: for token in text: frequency[token] += 1 texts = [ [token for token in text if frequency[token] > 1] for text in texts ] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] # Similarity interface from gensim import models lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) doc = "Human computer interaction" vec_bow = dictionary.doc2bow(doc.lower().split()) vec_lsi = lsi[vec_bow] # Convert the query to LSI space print(vec_lsi) # We will be considering `cosine similarity <http://en.wikipedia.org/wiki/Cosine_similarity>`_ # to determine the similarity of two vectors. # Initializing query structures from gensim import similarities index = similarities.MatrixSimilarity(lsi[corpus]) # Transform corpus to LSI space and index it index.save('C:/Users/ASUS/deerwester.index') index = similarities.MatrixSimilarity.load('C:/Users/ASUS/deerwester.index') # Performing queries sims = index[vec_lsi] # Perform a similarity query against the corpus print(list(enumerate(sims))) # Print (document_number, document_similarity) 2-tuples # Cosine measure returns similarities in the range `<-1, 1>` (the greater, the more similar), # so that the first document has a score of 0.99809301 etc. sims = sorted(enumerate(sims), key=lambda item: -item[1]) for doc_position, doc_score in sims: print(doc_score, documents[doc_position]) |
The result in Python GUI:
Congratulations, now you have learned how to perform Similarity Queries tasks using Python’s Gensim and then display it in the Delphi Windows GUI app.
Check out the Gensim library for Python and use it in your projects: https://pypi.org/project/gensim/ and
Check out Python4Delphi which easily allows you to build Python GUIs for Windows using Delphi: https://github.com/pyscripter/python4delphi