Unlock the Power of Python for Deep Learning with Transformer Architecture – The Engine Behind ChatGPT

Unlock the Power of Python for Deep Learning with Transformer Architecture The Engine Behind ChatGPT

To address intricate problems, extensive amounts of data and substantial computational capabilities are essential for the functioning of deep learning algorithms. These algorithms are versatile in handling various types of data. This article will delve into a comprehensive exploration of the Transformer Architecture, a prominent member of the deep learning domain and the driving force behind ChatGPT, which is presently ubiquitous and widely discussed across different platforms.

Please note: As GPT-3 and higher versions are not available for public access at the time of writing, the demonstration showcased in this article employs GPT-2.

Before we begin, let’s see the remarkable ability of ChatGPT to describe itself:

chatgpt demo transformer architecture

It can even praise itself very highly :). Read reference [2] and [3], if you want to see how powerful ChatGPT is in assisting you to understand a research paper in the deep learning field.

Table of Contents

What is Deep Learning?

Deep learning is a subfield of machine learning that solves complex problems using artificial neural networks. These neural networks are made up of interconnected nodes arranged in multiple layers that extract features from input data. Large datasets are used to train these models, allowing them to detect patterns and correlations that humans would find difficult or impossible to detect.

Deep learning has had a significant impact on artificial intelligence. It has facilitated the development of intelligent systems capable of learning, adapting, and making decisions on their own. Deep learning has enabled remarkable progress in a variety of fields, including image and speech recognition, natural language processing, machine translation, large-language models; chatbots; & content generators (as would be reviewed in this article), image generations, autonomous driving, and many others.

example of image generated by ai using stable diffusion xl
Example of AI generated image using Stable Diffusion XL model that I generate using the following prompt Illustration of deep learning and AI community

Why Python for Deep Learning, Machine Learning, and Artificial Intelligence?

Python has gained widespread popularity as a programming language due to its versatility and ease of use in diverse domains of computer science, especially in the field of deep learning, machine learning, and AI. 

We’ve reviewed several times about why Python is great for Deep Learning, Machine Learning, and Artificial Intelligence (also all the requirements), in the following articles:

What is GPT?

GPT stands for “Generative Pre-trained Transformer,” and it is a type of artificial intelligence language model developed by OpenAI. The GPT models are built on the Transformer architecture, which is a deep learning model architecture specifically designed for natural language processing tasks.

gpt transformer architecture example on language understanding
An example of GPT architecture used for language understanding Image source Reference 5

The key features of GPT are:

  1. Generative: GPT is capable of generating human-like text based on the input it receives. It can produce coherent and contextually relevant responses, making it useful for a variety of natural language generation tasks.
  2. Pre-trained: GPT models are initially trained on large-scale datasets that contain diverse text from the internet, books, articles, and other sources. This pre-training phase helps the model learn grammar, syntax, semantics, and factual knowledge from the vast amount of data it processes.
  3. Transformer architecture: The Transformer architecture is a neural network architecture that allows the model to process input text in parallel, making it more efficient and scalable compared to earlier sequential models. The self-attention mechanism in Transformers enables the model to weigh the importance of different words in the input context, leading to better contextual understanding.
  4. Transfer Learning: GPT leverages the concept of transfer learning. After pre-training on a large corpus of text, the model can be further fine-tuned on specific tasks or datasets to adapt its capabilities to more targeted use cases.

GPT has seen several iterations, with each version being an improvement over its predecessors in terms of size, performance, and language understanding capabilities. These models have found applications in various domains, including chatbots, language translation, text summarization, content generation, and more, due to their ability to understand and generate human-like text.

What is GPT-2?

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data.

What is Transformer architecture?

The Transformer architecture is a deep learning model architecture specifically designed for natural language processing (NLP) tasks. It was introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017 (see Reference [8]) and has since become a fundamental building block for many state-of-the-art NLP models, including GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).

The key innovation of the Transformer architecture is the concept of self-attention. Traditional sequence-to-sequence models, like RNNs (Recurrent Neural Networks), process input sequentially, which can lead to inefficiencies and limitations in capturing long-range dependencies. In contrast, the Transformer allows for parallel processing of input sequences, making it highly scalable and efficient.

transformer architecture diagram
The Transformer model architecture Image source Reference 8

What are the main components of Transformer architecture?

1. Encoder-Decoder Structure

The Transformer architecture consists of an encoder and a decoder. In NLP tasks like machine translation, the encoder processes the input sequence, while the decoder generates the output sequence.

2. Self-Attention Mechanism

Self-attention allows the model to weigh the importance of different words in the input sequence concerning each other. It computes attention scores between all pairs of words in the sequence, and the words with higher attention scores have a stronger influence on the representation of each word. This enables the model to capture dependencies between words regardless of their distance in the sequence.

3. Multi-Head Attention

The self-attention mechanism is extended with multiple attention heads. Each head learns a different representation of the input sequence, allowing the model to capture different types of relationships and dependencies.

4. Feed-Forward Neural Networks

After the self-attention layers, the model typically employs feed-forward neural networks to process the attended representations further.

5. Positional Encoding

Since the Transformer processes words in parallel, it lacks the inherent order information found in sequential models. To address this, positional encoding is added to the input embeddings, providing the model with information about the word’s position in the sequence.

transformer architecture outperform others
Transformer architecture successfully outperforms other more classic models Image source Reference 7

The Transformer architecture has shown remarkable performance improvements in various NLP tasks, and its ability to capture long-range dependencies and context has been instrumental in the success of modern language models. Its widespread adoption has transformed the NLP landscape, leading to the development of more powerful and efficient AI language models.

What are the requirements for running the GPT-2?

To perform this experiment, I use the following hardware, software, and environment setup:


Regular laptop (I am not using any additional GPU). 

Processor: Intel(R) Core(TM) i5-8th Gen

Memory: 12Gb of RAM

Graphic: Intel(R) UHD Graphics 620

OS: Windows 10


Please be very careful in following each step of installations, to avoid any complex errors!

Create new Python environment using conda

This step is very necessary to make you stay away from any problems in the future.

The following is the command to create new Python environment named gpt2_4D:

create environment for transformer architecture

Activate the environment using this command:

activate or deactivate environment for transformer architecture

Python version: Python 3.6. 

Since GPT-2 needs TensorFlow 1.12 that is compatible only with Python 3.6., I even need to downgrade my Python version.

Use the following command to install specific Python version:

install python 36 for transformer architecture

Setup for PyScripter IDE, so it can finds your recently installed Python 3.6. distribution in Windows

On PyScripter IDE, click the following menu, to load additional Python versions, in our case Python 3.6 (it even can successfully load Python that is installed in virtual environments):

Run -> Python Versions -> Setup Python Versions -> Click the + Button -> Add the C:/Users/YOUR_USERNAME/anaconda3/envs/gpt2_4D environment directory. 

If successfully loaded, it will show up like the following screenshots:

load python 36 env to pyscripter for transformer architecture

Python 3.6 and all the environment for running the GPT-2 is loaded successfully:

successfully load python 36 env to pyscripter for transformer architecture

Install TensorFlow v1.14 as already mentioned before

Use the following command to install a specific version of TensorFlow (v1.14):

install tensorflow 1140 for transformer architecture

If the TensorFlow version 1.14.0 not work well with you, try the following versions:

install tensorflow 1120 for transformer architecture

Other library requirements

The following command is to install all the required library individually:

Or, simply run this command, and you will install all of the requirements seamlessly:

install requirements for transformer architecture

Download models

Use the following command to install the 124M model (don’t forget to keep yourself stay on the gpt-2 folder, or if not, go to the gpt-2 folder using cd gpt-2):

download model 124m for transformer architecture

Or, if you prefer a bigger model, use the following command to install the 345M (but, as a consequence, it would consume very large memory, or even can’t run at all, on the regular laptop without an additional GPU):


download model 345m for transformer architecture

How to test if the GPT-2 is already installed and working correctly?

Test it by running added with temperature parameter

Run the file using the following command:

If you installed everything correctly, you will get the following output:

test installation gpt 2 for transformer architecture

For the sake of curiosity, I let the code run for hours, and here is the output of SAMPLE 100:

test installation gpt 2 for transformer architecture

Test it by samples

Next, you can test the installation by running the code using the following command:

test installation gpt 2 for transformer architecture

Again, for the sake of curiosity, I let the code run for hours, and here is the output of SAMPLE 100:

test installation gpt 2 for transformer architecture

Test it to perform interactive conversations on cmd

To do this, you need to run using the following command (if you use the lowest spec model (124M)):

And I used the following prompts:

  • First prompt: “What is Python language?
test installation gpt 2 for transformer architecture
  • Second prompt: “What is Delphi programming language?
test installation gpt 2 for transformer architecture
  • Third prompt: “How to cook delicious eggs?
test installation gpt 2 for transformer architecture

For the fourth prompt, I test it using the same example as the one provided by the official article by OpenAI (see Reference [6]).“Please continue the following paragraph: "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

test installation gpt 2 for transformer architecture

How’s the results? Convincing enough to made-up a sci-fi or fairy tale stories, right (or even worse, generate fake news and hoaxes, as concerned and worried by the creators of GPT products)? 🙂

It is stated by OpenAI that GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input; the model is chameleon-like in that it adapts to the style and content of the conditioning text, allowing the user to generate realistic and coherent continuations about a topic of their choice, as demonstrated by the sample above.

Why not use bigger models?

“Why not use bigger models?”. You might ask.

For example, to use bigger model (in this example, 345M) use the following command:

The answer is simple, it would exceed the allocated memory of a regular laptop.

test installation gpt 2 for transformer architecture

And if you still insist on using the model to answer your prompts, you might get less satisfying answers, compared to the answers provided by the lowest spec model (124M).

Here are the answers on the same questions:

test installation gpt 2 for transformer architecture
test installation gpt 2 for transformer architecture
test installation gpt 2 for transformer architecture
test installation gpt 2 for transformer architecture

How to retrain GPT-2 to generate custom text content?

Retrain GPT-2 to built Progressive Metal lyrics generator

In this section, we will retrain GPT-2 to generate any text that suits our purposes. We will explore all the required steps to retrain GPT-2 using any custom text, on Windows.

We can use any kind of text data, as long as they are in English. For example:

  1. Song lyrics
  2. Poems
  3. Short stories, light novels, novels
  4. Questions and answers
  5. Synopsis or abstract
  6. News, letters, articles, or papers

In this article, I retrained the GPT-2 with Progressive Metal lyrics, out of curiosity. I chose Progressive Metal genre to retrain GPT-2, as this genre has very specific characteristics, such as powerful, deep, dark, and more intricate and complex lyrics, compared to other genres (even compared to other Metal subgenres). The lyrics might explore concept albums or thematic lyrics that explore philosophical, spiritual, existential themes, exploring unusual sides of relationships and humanity, etc.

So, I think it must be fun to see how the machine can handle such complicated things. 🙂

Download required files

The required repository for this training is a clone and fine-tuned version of the original GPT-2 repository provided by @nshepperd. Go to the following Github link [4] and click on the “Clone or download” button.

Download 124M model

The next thing to do is to download the base model like we’ve done in the previous sections. But, this time, we downloaded the model for the GPT-2 cloned version.

First, don’t forget to navigate to the gpt-2-finetuning directory, and run the following command:

Preparing custom text dataset

For the experiment performed in this article, I am using collections of Progressive Metal lyrics as training data. I collected all studio album lyrics ever created by four Progressive Metal legends: Such as Dream Theater, Opeth, Porcupine Tree, and TOOL, which achieve 20,000+ lines of text file, that I save as lyrics.txt, as training data.  

I built the training data manually via copy and paste method from the following website: and Once you have completed your training data, move the file to the src directory.

Encode the data

First, don’t forget to navigate to the gpt-2-finetuning directory, and run the following command:

If you run it successfully, you will get the following output:

encode custom dataset for transformer architecture

Train the data

Change the directory to gpt-2-finetuningsrc, and then, use the following command to train model using the new dataset:

If everything is working correctly, the training should start and you should have the following output after a while:

load and train custom dataset for transformer architecture

From the screenshot above, we can interpret the output [1 | 45.25] loss=3.92 avg=3.92 as follow:

  1. 1: Refers to the number of training steps. Think of it as a counter that will increase by 1 after each run.
  2. 45.25: Time elapsed since the start of training in seconds. You can use the first step as reference to determine how long it takes to run one step.
  3. loss and avg: Both of them refer to the cross-entropy (log loss) and the average loss. You can use this to determine the performance of your model. In theory, as training steps increase, the loss should decrease until it converges at a certain value. The lower, the better.

How to stop the training?

You can stop the training by using Ctrl+C.

ctrl+c to stop the training of transformer architecture

By default, the model will be saved once every 1000 steps and a sample will be generated once every 100 steps.

The following are the automatically generated samples after the first 100 steps of training:

train and generate sample step100 for transformer architecture

And the following are the checkpoint files inside the run1 folder:

run1 folder for transformer architecture

The following is the complete directory to the checkpoints:

The outputs that related with our last step of training:

  2. model-431.index
  3. model-431.meta

How to resume training from the last checkpoint?

You can simply use the following code to resume the training:

Or the following command:

That’s the same as before.

The following is the log loss and average loss achieved after 1000 steps of training:

loss and avg step after 1000 steps of training the transformer architecture

Let’s see what the model can do, after such a long training.

Generate samples

Create a folder for the model

In the src/models folder, you should have just one folder called 124M (if you only installed this model). 

Create another folder to store your model alongside with the original model. I made a new folder called lyric. Now, I have two folders in src/models, one is called 124M and the other is called lyric.

Go to src/checkpoint/run1 folder, and copy the following files:

  1. checkpoint
  3. model-xxxx.index
  4. model-xxxx.meta

xxxx refers to the step number. Since I have trained for 1394 steps, I have model-1394.index.

Paste them into the newly created folder (in my case, the folder is called lyric). Next, go to the 124M folder and copy the following files:

  1. encoder.json
  2. hparams.json
  3. vocab.bpe

Paste them into the lyric folder. Double check that you should have 7 files in it. With this, we are ready to generate samples.

output lyric folder after 1394 steps of training for transformer architecture

In general, there are two ways to generate samples: By generating unconditional vs interactive conditional samples.

Generate unconditional sample

Unconditional sample refers to randomly generated samples without taking into account any user input. Think of it as a random sample. 

Make sure you are in the src directory and simply use the following code to generate samples.

Beware of the output! Some of it may be quite humorous,  some others may use strong and profanity languages (for unknown reason, but it’s likely based on the some lyrics written by TOOL band), while some of it seems good enough in producing lyrics that are a little bit similar with progressive metal style that used in training data.

The quality of the lyrics might get better if we add the training data, or stop the training at the right time or right training step (to avoid overfitting or underfitting).

You can also get better results by removing the bad words using profanity filter API, as follow:

output generate samples for transformer architecture

For the sake of curiosity and completeness, I save all the 100 output samples into different files, and upload them to the following directory /samples/ inside the following code repository.

Generate unconditional sample using parameter tuning

The most important parameters we need to know before tuning them are top_k and temperature.

top_k: Integer value controlling diversity. 1 means only 1 word is considered for each step (token), resulting in deterministic completions, while 40 means 40 words are considered at each step. 0 (default) is a special setting that means no restrictions. 40 generally is a good value.

temperature: Float value controlling randomness in boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions. Default value is 1.

For example, to generate unconditional sample using 0.8 temperature and 40 top_k, type the following command:

For the sake of curiosity and completeness, I save all the 100 sample outputs using the command above, and upload them to the following directory /samples/ inside the following code repository.

Generate interactive conditional sample

Interactive conditional sample refers to generating samples based on user input. In other words, you input some text, and GPT-2 will do its best to fill in the rest.

Type the following command to generate interactive conditional samples:

After running the command above, type your desired prompt, for example:

And here is the screenshot of the excerpt of the output:

output generate samples for transformer architecture
Generate interactive conditional sample on PyScripter IDE

Before running the on PyScripter IDE, you need to change the model_name parameter on the interact_model function from 124M to lyric (see the line 12 on the file).

And then, you can run the code normally, and type your prompt. For example:

output generate samples pyscripter for transformer architecture
output generate lyrics samples pyscripter for transformer architecture


The Transformer architecture has revolutionized the field of natural language processing and machine learning as a whole. Its self-attention mechanism and parallel processing capabilities have led to remarkable breakthroughs in tasks like machine translation, sentiment analysis, and text generation (as we demonstrated in this article).

This article has highlighted and demonstrated the potential use of deep learning, specifically within the context of the Transformer architecture in the domain of music or arts, specifically in producing lyrics in certain musical genre.

I hope this article was successful in giving you a basic understanding and workflow of how to retrain GPT-2 according to your project goals.

Check out the full repository here:

Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.

Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.

Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.

Also, look into DelphiVCL, which makes it simple to create Windows GUIs with Python

References & further readings

[1] Foong, Ng Wai. (2019).

Beginner’s Guide to Retrain GPT-2 (117M) to Generate Custom Text Content. AI2 Labs, Medium.

[2] Hakim, M. A. (2023).

How to read Machine Learning/Deep Learning papers for a busy (or lazy) man. Paper-001: “Hinton et al., 2015”. hkaLabs AI blog.

[3] Hakim, M. A. (2023).

How to bypass the ChatGPT information cutoff? A busy (or lazy) man guide to read more recent ML/DL papers. Paper-001: “Rombach et al., 2022”. hkaLabs AI blog.

[4] nshepherd. (2021).

Forked and fine-tuned version of GPT-2 for custom datasets: Code for the paper “Language Models are Unsupervised Multitask Learners”. GitHub repository.

[5] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018).

Improving language understanding by generative pre-training.

[6] Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J., Brundage, M., & Sutskever, I. (2019).

Better language models and their implications. OpenAI blog, 1(2).

[7] Uszkoreit, J. (2017).

Transformer: A Novel Neural Network Architecture for Language Understanding. Google Research Blog.

[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017).

Attention is all you need. Advances in neural information processing systems, 30.

Related posts
CodeIDELearn PythonPythonPython GUITkinter

How To Make More Than 20 ChatGPT Prompts Work With Python GUI Builders And NumPy Library?


Unlock the Power of Python for Deep Learning with Generative Adversarial Networks (GANs) - The Engine behind DALL-E

CodeIDELearn PythonPythonPython GUITkinter

How To Make More Than 20 ChatGPT Prompts Work With Python GUI Builders And Matplotlib Library?

CodeIDELearn PythonPythonPython GUITkinter

How To Make More Than 20 ChatGPT Prompts Work With Python GUI Builders And Pillow Library?

Leave a Reply

Your email address will not be published. Required fields are marked *