In order to solve complicated issues, deep learning algorithms need enormous volumes of data and computational power. They can operate with nearly any form of data. The Long-Short-Term Memory Networks (LSTM), one of the most well-known deep learning techniques, will now be examined in-depth, in this article.
Table of Contents
What is Deep Learning?
Deep learning, a branch of machine learning, addresses intricate problems through the utilization of artificial neural networks. These networks consist of interconnected nodes organized in multiple layers, extracting features from input data. Extensive datasets are employed to train these models, enabling them to identify patterns and correlations that might be challenging or impossible for humans to perceive.
The impact of deep learning on artificial intelligence has been substantial. It has paved the way for the development of intelligent systems capable of independent learning, adaptation, and decision-making. Deep learning has led to remarkable advancements in various domains, encompassing image and speech recognition, natural language processing, machine translation, autonomous driving, and numerous others.
Why Use Python for Deep Learning, Machine Learning, and Artificial Intelligence?
Python has gained widespread popularity as a programming language due to its versatility and ease of use in diverse domains of computer science, especially in the field of deep learning, machine learning, and AI.
We’ve reviewed several times about why Python is great for Deep Learning, Machine Learning, and Artificial Intelligence (also all the prerequisites), in the following articles:
What is a Long-Short-Term Memory Network (LSTM)?
Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks, which makes it easier to remember past data in memory. LSTM is introduced to solve the performance degradation of RNNs in long-term sequences (vanishing gradient problem). Read more about it from Hochreiter, S., & Schmidhuber, J. paper (reference [4]).
LSTM is well-suited to classify, process, and predict time series given time lags of unknown duration. It trains the model by using back-propagation. In an LSTM network, three gates are present: The input gate, forget gate, and the output gate.
I’ve talked a little bit about LSTM as an advanced RNN architectures, as well as GRU, in the previous deep learning article:
How does LSTM work for Machine Translation?
Sequence to Sequence modeling is one of the many intriguing uses of natural language processing. Both language translation systems and question-answering systems make extensive use of it.
The goal of sequence-to-sequence (Seq2Seq) modeling is to develop models that can convert sequences from one domain to another, such as translating English to German. The LSTM encoder and decoder execute this Seq2Seq modeling [7].
Here’s how it works:
- Feed the embedding vectors for source sequences (German), to the encoder network, one word at a time.
- Encode the input sentences into fixed-dimension state vectors. At this step, we get the hidden and cell states from the encoder LSTM and feed it to the decoder LSTM.
- These states are regarded as initial states by the decoder. Additionally, it also has embedding vectors for target words (English).
- Decode and output the translated sentence, one word at a time. In this step, the output of the decoder is sent to a softmax layer over the entire target vocabulary.
A typical seq2seq model has 2 major components: An encoder, and a decoder. Both these parts are essentially two different recurrent neural network (RNN) models combined into one giant network:
How do I build and train an LSTM for Machine Translation from scratch?
Let’s get hands-on with some Python code to build and train your own LSTMs from scratch.
In this article, we will create a language translation model using seq2seq architecture and LSTM network, as it is a very famous application of neural machine translation (including Google Translate). Brace yourself, this article is a little bit more intense, compared to all my previous tutorials.
We will work with the Kaggle Bilingual Sentence Pairs dataset (reference [3]) to train the LSTM, so it can predict the unseen data, or even more, perform machine translation. The original source of the dataset is the Tatoeba Project (to download the dataset, see Reference [1 & 3]).
The actual data contains over 150,000 sentence pairs. However, we will use only the first 50,000 sentence pairs in the 1st demo, and the first 20,000 sentence pairs in the 2nd demo to reduce the training time of the model, (of course, this will lead to not-really-satisfying results, but this article will still serve its purpose as proof-of-concept). You can increase this number if you are equipped with a powerful computer.
Hands-on and selected outputs (1st example)
This example is modified from reference [5]. The original reference already shows excellent results when predicting the unseen data in English. However, I modified the code slightly, so we can test it to predict the unseen data in German.
The following is the code example of implementing LSTM for machine translation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
# Import libraries import string import re from numpy import array, argmax, random, take import pandas as pd from keras.models import Sequential from keras.layers import Dense, LSTM, Embedding, RepeatVector from keras.preprocessing.text import Tokenizer from keras.callbacks import ModelCheckpoint from keras.preprocessing.sequence import pad_sequences from keras.models import load_model from keras import optimizers import matplotlib.pyplot as plt pd.set_option('display.max_colwidth', 200) # Function to read raw text file def read_text(filename): # open the file file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() file.close() return text # Split a text into sentences def to_lines(text): sents = text.strip().split('n') sents = [i.split('t') for i in sents] return sents # Load dataset data = read_text("data/bilingual-sentence-pairs/deu.txt") deu_eng = to_lines(data) deu_eng = array(deu_eng) ## We will use only the first 50,000 sentence pairs to reduce the training time of the model deu_eng = deu_eng[:50000,:] # Remove punctuation deu_eng[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in deu_eng[:,0]] deu_eng[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in deu_eng[:,1]] deu_eng # Convert text to lowercase for i in range(len(deu_eng)): deu_eng[i,0] = deu_eng[i,0].lower() deu_eng[i,1] = deu_eng[i,1].lower() # Empty lists eng_l = [] deu_l = [] # Populate the lists with sentence lengths for i in deu_eng[:,0]: eng_l.append(len(i.split())) for i in deu_eng[:,1]: deu_l.append(len(i.split())) ## Plot the distributions import pylab as pl length_df = pd.DataFrame({'eng':eng_l, 'deu':deu_l}) length_df.hist(bins = 30) pl.suptitle("Distributions of sentence lengths (eng vs deu)") plt.show() ## Find the max sentence length for each language max_eng_sentence_length = max(length_df['eng']) max_deu_sentence_length = max(length_df['deu']) print('Max sentence length for eng: %d' % max_eng_sentence_length) print('Max sentence length for deu: %d' % max_deu_sentence_length) # Function to build a tokenizer def tokenization(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer # Prepare english tokenizer eng_tokenizer = tokenization(deu_eng[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 ## Choose "7" as the max sentence length eng_length = 7 print('English Vocabulary Size: %d' % eng_vocab_size) # Prepare Deutch tokenizer deu_tokenizer = tokenization(deu_eng[:, 1]) deu_vocab_size = len(deu_tokenizer.word_index) + 1 ## Choose "7" as the max sentence length deu_length = 7 print('Deutch Vocabulary Size: %d' % deu_vocab_size) # Encode and pad sequences def encode_sequences(tokenizer, length, lines): seq = tokenizer.texts_to_sequences(lines) ## Pad sequences with 0 values seq = pad_sequences(seq, maxlen=length, padding='post') return seq # Model building from sklearn.model_selection import train_test_split ## Split data into train and test set train, test = train_test_split(deu_eng, test_size=0.2, random_state = 12) # Prepare training data trainX = encode_sequences(deu_tokenizer, deu_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) # Prepare validation data testX = encode_sequences(deu_tokenizer, deu_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0]) # Define the model ## Build NMT model def define_model(in_vocab,out_vocab, in_timesteps,out_timesteps,units): model = Sequential() model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True)) model.add(LSTM(units)) model.add(RepeatVector(out_timesteps)) model.add(LSTM(units, return_sequences=True)) model.add(Dense(out_vocab, activation='softmax')) return model ## Model compilation model = define_model(deu_vocab_size, eng_vocab_size, deu_length, eng_length, 512) rms = optimizers.RMSprop(lr=0.001) model.compile(optimizer=rms, loss='sparse_categorical_crossentropy') # Fit the model ## Save the model with the lowest validation loss filename = 'model.h1.28_may_23' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') # Train model history = model.fit(trainX, trainY.reshape(trainY.shape[0], trainY.shape[1], 1), epochs=30, batch_size=512, validation_split = 0.2,callbacks=[checkpoint], verbose=1) ## Plot validation loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.legend(['train','validation']) plt.show() # Prediction on unseen data from keras.models import load_model model = load_model('model.h1.28_may_23') preds = model.predict_classes(testX.reshape((testX.shape[0],testX.shape[1]))) # Present the results in dataframe (eng) def get_word(n, tokenizer): for word, index in tokenizer.word_index.items(): if index == n: return word return None preds_text_eng = [] for i in preds: temp = [] for j in range(len(i)): t = get_word(i[j], eng_tokenizer) if j > 0: if (t == get_word(i[j-1], eng_tokenizer)) or (t == None): temp.append('') else: temp.append(t) else: if(t == None): temp.append('') else: temp.append(t) preds_text_eng.append(' '.join(temp)) pred_df_eng = pd.DataFrame({'actual' : test[:,0], 'predicted' : preds_text_eng}) ## Print 15 rows randomly print(pred_df_eng.head(15)) # Present the results in dataframe (deu) def get_word(n, tokenizer): for word, index in tokenizer.word_index.items(): if index == n: return word return None preds_text_deu = [] for i in preds: temp = [] for j in range(len(i)): t = get_word(i[j], deu_tokenizer) if j > 0: if (t == get_word(i[j-1], deu_tokenizer)) or (t == None): temp.append('') else: temp.append(t) else: if(t == None): temp.append('') else: temp.append(t) preds_text_deu.append(' '.join(temp)) pred_df_deu = pd.DataFrame({'actual' : test[:,1], 'predicted' : preds_text_deu}) ## Print 15 rows randomly print(pred_df_deu.head(15)) |
To execute the code provided, we can utilize the PyScripter IDE. Here are a few selected outputs:
1. Visualizing the distribution of sentence lengths (eng vs deu)
We will generate a plot to illustrate the distribution of sentence lengths. For this purpose, we will store the lengths of all English sentences in one list and the lengths of all German sentences in another.
2. Maximum sentence length and vocabulary size (eng vs deu)
It is quite intuitive that the maximum length of German sentences is 15
, whereas for English phrases, it is 7
.
To facilitate the utilization of a Seq2Seq model, it is necessary to convert both input and output sentences into fixed-length integer sequences. To achieve this, we employ the Tokenizer()
class from Keras, which transforms our sentences into sequences of integers. These sequences are then padded with zeros to ensure uniform length across all sequences.
In order to prepare for this process, we create tokenizers for both German and English sentences. At the same time, we also counted the vocabulary size for both languages and printed them out as can be seen above.
3. Model training and saving the best result
For the training process, we will run the model for 30 epochs
, utilizing a batch_size
of 512
and a validation_split
of 20%. This means that 80% of the data will be allocated for training the model, while the remaining 20% will be used for evaluation. Feel free to experiment and modify these hyperparameters to suit your needs.
To ensure that we capture the model’s best performance, we will employ the ModelCheckpoint()
function, which saves the model with the lowest validation loss (val_loss
). The resulting model with the best performance will be automatically stored in the “model.h1.28_may_23
” folder.
Here is the validation loss score for each epoch:
Epoch-01:
Epoch-02:
Epoch-03:
Epoch-04:
Epoch-05:
Epoch-06:
Epoch-07:
Epoch-08:
Epoch-09:
Epoch-10:
Epoch-11:
Epoch-12:
Epoch-13:
Epoch-14:
Epoch-15:
Epoch-16:
Epoch-17:
Epoch-18:
Epoch-19:
Epoch-20:
Epoch-21:
Epoch-22:
Epoch-23:
Epoch-24:
Epoch-25:
Epoch-26:
Epoch-27:
Epoch-28:
Epoch-29:
Epoch-30:
The model training processes are executed flawlessly within the PyScripter IDE, ensuring a smooth and error-free experience without any delays or disruptions.
4. Plot train loss vs validation loss
Let’s plot and compare the training loss and the validation loss.
As you can see in the plot above, the validation loss plateaus after the 20th epoch, indicating that the model has likely converged and will not improve further with additional training.
5. Generating predictions for unseen data
The generated predictions consist of sequences represented by integers. To make these predictions more understandable, we must convert these integers back into their respective words.
Once the conversion is complete and the original sentences are placed in the test dataset while the predicted sentences are stored in a data frame, we can randomly display some instances of actual sentences compared to their corresponding predicted sentences. This allows us to assess the performance of our model:
Prediction results (eng):
For the English results, the model is doing a pretty decent job.
Let’s do similar things for Deutsch, and here are the prediction results (deu):
Unfortunately, the predictions for Deutsch are still unsatisfactory, and in some cases, completely incorrect. For instance, the model failed to identify that “Boston” refers to a city and “Tom” is a person’s name. To identify these errors, you can cross-reference them using Google Translator.
Eventually, after enough training epochs, using more training data and building a better (or more complex) model (if you have enough computational power to do so), the results will gradually improve over time. These are the challenges we will face regularly in NLP. But these aren’t immovable obstacles.
This is how you would use LSTM to solve a sequence prediction task. Let’s try another scenario of implementation, in the next subsection.
Hands-on and selected outputs (2nd example)
This second implementation scenario refers to Reference [7]. But, that excellent blog post is implementing the neural machine translation from English to French. In this article, we will try English to Deutsch, instead.
In this 2nd example of LSTM implementation for machine translation, we still use the same dataset, but, for the word embedding, we will utilize the GloVe (Global Vectors for Word Representation) word embeddings (see reference [8]).
The following is the second example of implementing LSTM for machine translation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
# Import libraries import os import sys from keras.models import Model from keras.layers import Input, LSTM, GRU, Dense, Embedding from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical import numpy as np import matplotlib.pyplot as plt from numpy import array # For loading GloVe data from numpy import asarray # For loading GloVe data from numpy import zeros # For loading GloVe data from keras.utils import plot_model # For plotting DL models # Set values for different parameters BATCH_SIZE = 64 EPOCHS = 30 LSTM_NODES = 256 NUM_SENTENCES = 20000 MAX_SENTENCE_LENGTH = 50 MAX_NUM_WORDS = 20000 EMBEDDING_SIZE = 100 # Data preprocessing input_sentences = [] output_sentences = [] output_sentences_inputs = [] count = 0 for line in open("data/bilingual-sentence-pairs/deu.txt", encoding="utf-8"): count += 1 if count > NUM_SENTENCES: break if 't' not in line: continue input_sentence, output, _ = line.rstrip().split('t') output_sentence = output + ' <eos>' output_sentence_input = '<sos> ' + output input_sentences.append(input_sentence) output_sentences.append(output_sentence) output_sentences_inputs.append(output_sentence_input) print("num samples input:", len(input_sentences)) print("num samples output:", len(output_sentences)) print("num samples output input:", len(output_sentences_inputs)) # Randomly print sentences print(input_sentences[172]) print(output_sentences[172]) print(output_sentences_inputs[172]) # Tokenization (for inputs) input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS) input_tokenizer.fit_on_texts(input_sentences) input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences) word2idx_inputs = input_tokenizer.word_index print('Total unique words in the input: %s' % len(word2idx_inputs)) max_input_len = max(len(sen) for sen in input_integer_seq) print("Length of longest sentence in input: %g" % max_input_len) # Tokenization (for outputs) output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='') output_tokenizer.fit_on_texts(output_sentences + output_sentences_inputs) output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences) output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs) word2idx_outputs = output_tokenizer.word_index print('Total unique words in the output: %s' % len(word2idx_outputs)) num_words_output = len(word2idx_outputs) + 1 max_out_len = max(len(sen) for sen in output_integer_seq) print("Length of longest sentence in the output: %g" % max_out_len) # Padding encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len) print("encoder_input_sequences.shape:", encoder_input_sequences.shape) print("encoder_input_sequences[172]:", encoder_input_sequences[172]) ## Verify the integer values for "go" and "away" (sentence index 172) print(word2idx_inputs["go"]) print(word2idx_inputs["away"]) ## In the same way, padd the decoder outputs and the decoder inputs (deu): decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post') print("decoder_input_sequences.shape:", decoder_input_sequences.shape) print("decoder_input_sequences[172]:", decoder_input_sequences[172]) ### Print the corresponding integers from the word2idx_outputs (sentence index 172) print(word2idx_outputs["<sos>"]) print(word2idx_outputs["mach"]) print(word2idx_outputs["’ne"]) print(word2idx_outputs["fliege!"]) # Create word embeddings for the inputs by load the GloVe word vectors into memory embeddings_dictionary = dict() glove_file = open("data/glove/glove.6B.100d.txt", encoding="utf-8") for line in glove_file: records = line.split() word = records[0] vector_dimensions = asarray(records[1:], dtype='float32') embeddings_dictionary[word] = vector_dimensions glove_file.close() ## Create a matrix where the row number will represent the integer value for the word and the columns will correspond to the dimensions of the word num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1) embedding_matrix = zeros((num_words, EMBEDDING_SIZE)) for word, index in word2idx_inputs.items(): embedding_vector = embeddings_dictionary.get(word) if embedding_vector is not None: embedding_matrix[index] = embedding_vector ## Print the word embeddings for the word "go" using the GloVe word embedding dictionary. print(embeddings_dictionary["go"]) print(embedding_matrix[20]) ## Creates the embedding layer for the input embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len) # Create the model ## The final shape of the output: (number of inputs, length of the output sentence, the number of words in the output) ## Creates the empty output array: decoder_output_sequences = [] # Define decoder_output_sequences variable for seq in output_integer_seq: decoder_output_sequences.append(seq[1:]) # Remove the first element "<sos>" decoder_targets_one_hot = np.zeros(( len(input_sentences), max_out_len, num_words_output ), dtype='float32') ## Prints the shape of the decoder: print(decoder_targets_one_hot.shape) ## To make predictions, the final layer of the model will be a dense layer, therefore we need the outputs in the form of one-hot encoded vectors. for i, d in enumerate(decoder_output_sequences): for t, word in enumerate(d): decoder_targets_one_hot[i, t, word] = 1 ## Create the encoder for LSTM: encoder_inputs_placeholder = Input(shape=(max_input_len,)) x = embedding_layer(encoder_inputs_placeholder) encoder = LSTM(LSTM_NODES, return_state=True) encoder_outputs, h, c = encoder(x) encoder_states = [h, c] ## Create the decoder for LSTM: decoder_inputs_placeholder = Input(shape=(max_out_len,)) decoder_embedding = Embedding(num_words_output, LSTM_NODES) decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder) decoder_lstm = LSTM(LSTM_NODES, return_sequences=True, return_state=True) decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states) ## Pass the output from the decoder LSTM through a dense layer, to predict decoder outputs decoder_dense = Dense(num_words_output, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs) # Compile the model model = Model([encoder_inputs_placeholder, decoder_inputs_placeholder], decoder_outputs) model.compile( optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'] ) ## Plot our model plot_model(model, to_file='plot_LSTMModelForMachineTranslation.png', show_shapes=True, show_layer_names=True) # Train the model using the fit() method: r = model.fit( [encoder_input_sequences, decoder_input_sequences], decoder_targets_one_hot, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=0.1, ) # Modifying the model for predictions ## The encoder model remains the same: encoder_model = Model(encoder_inputs_placeholder, encoder_states) ## Modify our model to accept the hidden and cell states decoder_state_input_h = Input(shape=(LSTM_NODES,)) decoder_state_input_c = Input(shape=(LSTM_NODES,)) decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c] ## At each time step, there will be only single word in the decoder input, we need to modify the decoder embedding layer as follows: decoder_inputs_single = Input(shape=(1,)) decoder_inputs_single_x = decoder_embedding(decoder_inputs_single) ## Create the placeholder for decoder outputs: decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs) ## To make predictions, the decoder output is passed through the dense layer: decoder_states = [h, c] decoder_outputs = decoder_dense(decoder_outputs) ## The final step is to define the updated decoder model, as shown here: decoder_model = Model( [decoder_inputs_single] + decoder_states_inputs, [decoder_outputs] + decoder_states ) ## Plot our modified decoder LSTM that makes predictions: plot_model(decoder_model, to_file='plot_modifiedLSTMModelForMachineTranslation.png', show_shapes=True, show_layer_names=True) # Making predictions ## Create new dictionaries for both inputs and outputs where the keys will be the integers and the corresponding values will be the words: idx2word_input = {v:k for k, v in word2idx_inputs.items()} idx2word_target = {v:k for k, v in word2idx_outputs.items()} ## Create translate_sentence() method to accept an input-padded sequence English sentence (in the integer form) and will return the translated French sentence. def translate_sentence(input_seq): states_value = encoder_model.predict(input_seq) target_seq = np.zeros((1, 1)) target_seq[0, 0] = word2idx_outputs['<sos>'] eos = word2idx_outputs['<eos>'] output_sentence = [] for _ in range(max_out_len): output_tokens, h, c = decoder_model.predict([target_seq] + states_value) idx = np.argmax(output_tokens[0, 0, :]) if eos == idx: break word = '' if idx > 0: word = idx2word_target[idx] output_sentence.append(word) target_seq[0, 0] = idx states_value = [h, c] return ' '.join(output_sentence) # Testing the model i = np.random.choice(len(input_sentences)) input_seq = encoder_input_sequences[i:i+1] translation = translate_sentence(input_seq) print('-') print('Input:', input_sentences[i]) print('Response:', translation) |
Let’s run the above code using PyScripter IDE. And the following are some selected outputs:
1. Word embeddings
This part shows the implementation of word embeddings for neural machine translation.Here is the printed result of the word embeddings for the word “go
” using the GloVe word embedding dictionary.
check the 20th index of the word embedding matrix (the word “go
”), and its shows a consistent result:
2. Plot our LSTM model for machine translation
Another interesting part of this second approach is, after we compile the model, we can plot it using tf.keras.utils.plot_model
. So, we can keep tracking and communicate about all the inputs, outputs, steps, layers, etc clearly.
3. Plot the modified model for prediction
Before making any predictions, first, we need to modify our model.
The following is the plot of our model after some modification performed to make predictions:
4. Train the model
I trained the model in 30 epochs
, you can modify the number of epochs to see if you can get better results. The model is trained on 18,000
sentences and tested on the remaining 2,000
sentences. You can also add the number of records if you want to get better results, and if you have capable computational resources.
5. Test the model
To test the code, we will randomly choose a sentence from the input_sentences list, retrieve the corresponding padded sequence for the sentence, and will pass it to the translate_sentence() method. The method will return the translated sentence as shown below.
1st attempt of the test:
2nd attempt:
3rd attempt:
Again, it seems that the results in Deutsch are still far from satisfying. To identify these errors, you can cross-reference them using Google Translator.
Eventually, after enough training epochs, using more training data and building a better (or more complex) model (if you have enough computational power to do so), will give better and better results over time. These are the challenges we will face regularly in NLP.
Endnotes
LSTMs are a very promising solution to sequence-related problems. It has tons of very useful implementations out there, from time series prediction, weather forecasting, machine translation, speech recognition, and many more. According to Google Scholar, no other computer science paper of the 20th century is receiving as many citations per year as the original 1997 journal publication on Long Short-Term Memory (LSTM) artificial neural networks (NNs) [9].
However, the one disadvantage that I find about them is the difficulty in training them. A lot of data, training epochs/time, and system resources are needed to go into training even a simple model. But that is just a hardware constraint, and PyScripter IDE handles it very well, lightweight, with zero error or lag!
I hope this article was successful in giving you a basic understanding and workflow of how these networks work.
Check out the full repository here:
github.com/Embarcadero/DL_Python03_LSTM
Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.
Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.
Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.
Also, look into DelphiVCL, which makes it simple to create Windows GUIs with Python.
References & further readings
[1] All sentences and translations are from Tatoeba’s (tatoeba.org) massive and awesome dataset, released under a CC-BY License.
[2] Biswal, A. (2023).
Top 10 Deep Learning Algorithms You Should Know in 2023. Simplilearn. simplilearn.com/tutorials/deep-learning-tutorial/deep-learning-algorithm
[3] Cijov, A. (2021).
Bilingual Sentence Pair: Dataset for Translator Projects. Kaggle. kaggle.com/datasets/alincijov/bilingual-sentence-pairs
[4] Hochreiter, S., & Schmidhuber, J. (1997).
Long short-term memory. Neural computation, 9(8), 1735-1780.
[5] Jain, H. (2021).
Machine Translation | Seq2Seq | LSTMs. Kaggle. kaggle.com/code/harshjain123/machine-translation-seq2seq-lstms
[6] Kumar, V. (2020).
Sequence-to-Sequence Modeling using LSTM for Language Translation. Analytics India Magazine. analyticsindiamag.com/sequence-to-sequence-modeling-using-lstm-for-language-translation
[7] Malik, U. (2022).
Python for NLP: Neural Machine Translation with Seq2Seq in Keras. StackAbuse. stackabuse.com/python-for-nlp-neural-machine-translation-with-seq2seq-in-keras
[8] Pennington, J., Socher, R., & Manning, C. D. (2014, October).
Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). nlp.stanford.edu/projects/glove
[9] Schmidhuber, J. (2022).
2022: 25th anniversary of 1997 papers: Long Short-Term Memory. All computable metaverses. Hierarchical reinforcement learning (RL). Meta-RL. Abstractions in generative adversarial RL. Soccer learning. Low-complexity neural nets. Low-complexity art. Others. AI Blog. IDSIA, Lugano, Switzerland.
[10] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … & Dean, J. (2016).
Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.