Do you want to Train Large scale semantic NLP Models in your Delphi GUI App? This post will get to understand how to use Gensim Python Library using Python4Delphi in Delphi/C++ application and learn the core concepts of Gensim – A Superfast, Proven, Data Streaming, Platform Independent library with some pretrained models for specific domains like legal or health.
Python for Delphi (P4D) is a set of free components that wrap up the Python DLL into Delphi. They let you easily execute Python scripts, create new Python modules and new Python types. You can use Python4Delphi a number of different ways such as:
- Create a Windows GUI around your existing Python app.
- Add Python scripting to your Delphi Windows apps.
- Add parallel processing to your Python apps through Delphi threads.
- Enhance your speed-sensitive Python apps with functions from Delphi for more speed.
Prerequisites.
- If not python and Python4Delphi is not installed on your machine, Check this how to run a simple python script in Delphi application using Python4Delphi sample app
- Open windows open command prompt, and type pip install -U gensim to install GenSim. For more info for Installing Python Modules check here
- First, run the Demo1 project for executing Python script in Python for Delphi. Then load the Texblob sample script in the Memo1 field and press the Execute Script button to see the result. On Clicking Execute Button the script strings are executed using the below code. Go to GitHub to download the Demo1 source.
1 2 3 4 |
procedure TForm1.Button1Click(Sender: TObject); begin PythonEngine1.ExecStrings( Memo1.Lines ); end; |
Gensim Core concepts :
- Document: A document is an object of the text sequence type (commonly known as
str
in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book. - Corpus: a collection of documents. Serve 2 purposes.
- Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.
- Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.
- Vector: a mathematically convenient representation of a document.
- Model: an algorithm for transforming vectors from one representation to another.
Gensim Python Library sample script details: The sample scripts helps to understand how the core concepts were implemented for a simple Corpus.
- A Corpus consists of 9 documents where each document consisting of a string.
- Created a set of frequent words where to be ignored by splitting it by white space.
- Get the word count frequencies and just keep the words which is occurring more than once.
- Assign to dictionary in corpora and print the tokens and its id by calling token2id.
- Create the bag-of-word representation for a new document using the doc2bow and convert our entire original corpus to a list of vectors.
- Train using the model ‘tf-idf‘ – transforms vectors from the bag-of-words representation to a vector space
- Transform the “system minors” string from the dictionary using doc2bow
- Transform the whole corpus via TfIdf and index it, in preparation for similarity queries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
import pprint from gensim import corpora from gensim import models from gensim import similarities # Corpus - It consists of 9 documents, where each document is a string consisting of a single sentence. text_corpus = [ "Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey", ] # Create a set of frequent words stoplist = set('for a of the and to in'.split(' ')) # Lowercase each document, split it by white space and filter out stopwords texts = [[word for word in document.lower().split() if word not in stoplist] for document in text_corpus] # Count word frequencies from collections import defaultdict frequency = defaultdict(int) for text in texts: for token in text: frequency[token] += 1 # Only keep words that appear more than once processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts] pprint.pprint(processed_corpus) dictionary = corpora.Dictionary(processed_corpus) print(dictionary) pprint.pprint(dictionary.token2id) # create the bag-of-word representation for a document using the doc2bow new_doc = "Human computer interaction" new_vec = dictionary.doc2bow(new_doc.lower().split()) print(new_vec) # convert our entire original corpus to a list of vectors: bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus] pprint.pprint(bow_corpus) # Model `tf-idf - transforms vectors from the bag-of-words representation to a vector space # where the frequency counts are weighted according to the relative rarity of # each word in the corpus. Here's a simple example. Let's initialize the tf-idf model, training it on # our corpus and transforming the string "system minors": # train the model tfidf = models.TfidfModel(bow_corpus) # transform the "system minors" string words = "system minors".lower().split() print(tfidf[dictionary.doc2bow(words)]) # to transform the whole corpus via TfIdf and index it, in preparation for similarity queries: index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12) # and to query the similarity of our query document ``query_document`` against every document in the corpus: query_document = 'system engineering'.split() query_bow = dictionary.doc2bow(query_document) sims = index[tfidf[query_bow]] print(list(enumerate(sims))) # Document 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc. # We can make this slightly more readable by sorting: for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True): print(document_number, score) |
Note: Samples used for demonstration were picked from here with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.
You have read the quick overview of Gensim library, download this library from here, and perform NLP tasks quickly with help of models such as word2vec, Latent Dirichlet Allocation Model, FastText Model, etc. Check out Python4Delphi and easily build Python GUIs for Windows using Delphi.