To address intricate problems, extensive amounts of data and substantial computational capabilities are essential for the functioning of deep learning algorithms. These algorithms are versatile in handling various types of data. This article will delve into a comprehensive exploration of the Transformer Architecture, a prominent member of the deep learning domain and the driving force behind ChatGPT, which is presently ubiquitous and widely discussed across different platforms.
Please note: As GPT-3 and higher versions are not available for public access at the time of writing, the demonstration showcased in this article employs GPT-2.
Before we begin, let’s see the remarkable ability of ChatGPT to describe itself:
It can even praise itself very highly :). Read reference  and , if you want to see how powerful ChatGPT is in assisting you to understand a research paper in the deep learning field.
Table of Contents
What is Deep Learning?
Deep learning is a subfield of machine learning that solves complex problems using artificial neural networks. These neural networks are made up of interconnected nodes arranged in multiple layers that extract features from input data. Large datasets are used to train these models, allowing them to detect patterns and correlations that humans would find difficult or impossible to detect.
Deep learning has had a significant impact on artificial intelligence. It has facilitated the development of intelligent systems capable of learning, adapting, and making decisions on their own. Deep learning has enabled remarkable progress in a variety of fields, including image and speech recognition, natural language processing, machine translation, large-language models; chatbots; & content generators (as would be reviewed in this article), image generations, autonomous driving, and many others.
Why Python for Deep Learning, Machine Learning, and Artificial Intelligence?
Python has gained widespread popularity as a programming language due to its versatility and ease of use in diverse domains of computer science, especially in the field of deep learning, machine learning, and AI.
We’ve reviewed several times about why Python is great for Deep Learning, Machine Learning, and Artificial Intelligence (also all the requirements), in the following articles:
What is GPT?
GPT stands for “Generative Pre-trained Transformer,” and it is a type of artificial intelligence language model developed by OpenAI. The GPT models are built on the Transformer architecture, which is a deep learning model architecture specifically designed for natural language processing tasks.
The key features of GPT are:
- Generative: GPT is capable of generating human-like text based on the input it receives. It can produce coherent and contextually relevant responses, making it useful for a variety of natural language generation tasks.
- Pre-trained: GPT models are initially trained on large-scale datasets that contain diverse text from the internet, books, articles, and other sources. This pre-training phase helps the model learn grammar, syntax, semantics, and factual knowledge from the vast amount of data it processes.
- Transformer architecture: The Transformer architecture is a neural network architecture that allows the model to process input text in parallel, making it more efficient and scalable compared to earlier sequential models. The self-attention mechanism in Transformers enables the model to weigh the importance of different words in the input context, leading to better contextual understanding.
- Transfer Learning: GPT leverages the concept of transfer learning. After pre-training on a large corpus of text, the model can be further fine-tuned on specific tasks or datasets to adapt its capabilities to more targeted use cases.
GPT has seen several iterations, with each version being an improvement over its predecessors in terms of size, performance, and language understanding capabilities. These models have found applications in various domains, including chatbots, language translation, text summarization, content generation, and more, due to their ability to understand and generate human-like text.
What is GPT-2?
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.
GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data.
What is Transformer architecture?
The Transformer architecture is a deep learning model architecture specifically designed for natural language processing (NLP) tasks. It was introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017 (see Reference ) and has since become a fundamental building block for many state-of-the-art NLP models, including GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).
The key innovation of the Transformer architecture is the concept of self-attention. Traditional sequence-to-sequence models, like RNNs (Recurrent Neural Networks), process input sequentially, which can lead to inefficiencies and limitations in capturing long-range dependencies. In contrast, the Transformer allows for parallel processing of input sequences, making it highly scalable and efficient.
What are the main components of Transformer architecture?
1. Encoder-Decoder Structure
The Transformer architecture consists of an encoder and a decoder. In NLP tasks like machine translation, the encoder processes the input sequence, while the decoder generates the output sequence.
2. Self-Attention Mechanism
Self-attention allows the model to weigh the importance of different words in the input sequence concerning each other. It computes attention scores between all pairs of words in the sequence, and the words with higher attention scores have a stronger influence on the representation of each word. This enables the model to capture dependencies between words regardless of their distance in the sequence.
3. Multi-Head Attention
The self-attention mechanism is extended with multiple attention heads. Each head learns a different representation of the input sequence, allowing the model to capture different types of relationships and dependencies.
4. Feed-Forward Neural Networks
After the self-attention layers, the model typically employs feed-forward neural networks to process the attended representations further.
5. Positional Encoding
Since the Transformer processes words in parallel, it lacks the inherent order information found in sequential models. To address this, positional encoding is added to the input embeddings, providing the model with information about the word’s position in the sequence.
The Transformer architecture has shown remarkable performance improvements in various NLP tasks, and its ability to capture long-range dependencies and context has been instrumental in the success of modern language models. Its widespread adoption has transformed the NLP landscape, leading to the development of more powerful and efficient AI language models.
What are the requirements for running the GPT-2?
To perform this experiment, I use the following hardware, software, and environment setup:
Regular laptop (I am not using any additional GPU).
Processor: Intel(R) Core(TM) i5-8th Gen
Memory: 12Gb of RAM
Graphic: Intel(R) UHD Graphics 620
OS: Windows 10
Please be very careful in following each step of installations, to avoid any complex errors!
Create new Python environment using
This step is very necessary to make you stay away from any problems in the future.
The following is the command to create new Python environment named
conda create --name gpt2_4D
Activate the environment using this command:
conda activate gpt2_4D
Python version: Python 3.6.
Since GPT-2 needs TensorFlow 1.12 that is compatible only with Python 3.6., I even need to downgrade my Python version.
Use the following command to install specific Python version:
conda install python=3.6
Setup for PyScripter IDE, so it can finds your recently installed Python 3.6. distribution in Windows
On PyScripter IDE, click the following menu, to load additional Python versions, in our case Python 3.6 (it even can successfully load Python that is installed in virtual environments):
Python Versions ->
Setup Python Versions -> Click the
+ Button -> Add the
C:/Users/YOUR_USERNAME/anaconda3/envs/gpt2_4D environment directory.
If successfully loaded, it will show up like the following screenshots:
Python 3.6 and all the environment for running the GPT-2 is loaded successfully:
v1.14 as already mentioned before
Use the following command to install a specific version of TensorFlow (
pip install tensorflow==1.14.0
If the TensorFlow version
1.14.0 not work well with you, try the following versions:
conda install -c issxia tensorflow=1.12
conda install -c issxia tensorflow=1.13.1
Other library requirements
The following command is to install all the required library individually:
pip install fire>=0.1.3
pip install regex==2017.4.5
pip install requests==2.21.0
pip install tqdm==4.31.1
pip install toposort==1.5
Or, simply run this command, and you will install all of the requirements seamlessly:
pip install -r requirements.txt
Use the following command to install the
124M model (don’t forget to keep yourself stay on the
gpt-2 folder, or if not, go to the
gpt-2 folder using cd
python download_model.py 124M
Or, if you prefer a bigger model, use the following command to install the
345M (but, as a consequence, it would consume very large memory, or even can’t run at all, on the regular laptop without an additional GPU):
python download_model.py 345M
How to test if the GPT-2 is already installed and working correctly?
Test it by running
generate_unconditional_samples.py added with temperature parameter
generate_unconditional_samples.py file using the following command:
python src/generate_unconditional_samples.py --temperature=2.0
If you installed everything correctly, you will get the following output:
For the sake of curiosity, I let the code run for hours, and here is the output of
Test it by
Next, you can test the installation by running the
generate_unconditional_samples.py code using the following command:
python src/generate_unconditional_samples.py --model_name=124M
Again, for the sake of curiosity, I let the code run for hours, and here is the output of
Test it to perform interactive conversations on cmd
To do this, you need to run
interactive_conditional_samples.py using the following command (if you use the lowest spec model (
And I used the following prompts:
- First prompt: “
What is Python language?”
- Second prompt: “
What is Delphi programming language?”
- Third prompt: “
How to cook delicious eggs?”
For the fourth prompt, I test it using the same example as the one provided by the official article by OpenAI (see Reference ).“
Please continue the following paragraph: "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.“
How’s the results? Convincing enough to made-up a sci-fi or fairy tale stories, right (or even worse, generate fake news and hoaxes, as concerned and worried by the creators of GPT products)? 🙂
It is stated by OpenAI that GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input; the model is chameleon-like in that it adapts to the style and content of the conditioning text, allowing the user to generate realistic and coherent continuations about a topic of their choice, as demonstrated by the sample above.
Why not use bigger models?
“Why not use bigger models?”. You might ask.
For example, to use bigger model (in this example,
345M) use the following command:
python src/interactive_conditional_samples.py --model_name=345M
The answer is simple, it would exceed the allocated memory of a regular laptop.
And if you still insist on using the model to answer your prompts, you might get less satisfying answers, compared to the answers provided by the lowest spec model (
Here are the answers on the same questions:
How to retrain GPT-2 to generate custom text content?
Retrain GPT-2 to built Progressive Metal lyrics generator
In this section, we will retrain GPT-2 to generate any text that suits our purposes. We will explore all the required steps to retrain GPT-2 using any custom text, on Windows.
We can use any kind of text data, as long as they are in English. For example:
- Song lyrics
- Short stories, light novels, novels
- Questions and answers
- Synopsis or abstract
- News, letters, articles, or papers
In this article, I retrained the GPT-2 with Progressive Metal lyrics, out of curiosity. I chose Progressive Metal genre to retrain GPT-2, as this genre has very specific characteristics, such as powerful, deep, dark, and more intricate and complex lyrics, compared to other genres (even compared to other Metal subgenres). The lyrics might explore concept albums or thematic lyrics that explore philosophical, spiritual, existential themes, exploring unusual sides of relationships and humanity, etc.
So, I think it must be fun to see how the machine can handle such complicated things. 🙂
Download required files
The required repository for this training is a clone and fine-tuned version of the original GPT-2 repository provided by @nshepperd. Go to the following Github link  and click on the “
Clone or download” button.
Download 124M model
The next thing to do is to download the base model like we’ve done in the previous sections. But, this time, we downloaded the model for the GPT-2 cloned version.
First, don’t forget to navigate to the
gpt-2-finetuning directory, and run the following command:
python download_model.py 124M
Preparing custom text dataset
For the experiment performed in this article, I am using collections of Progressive Metal lyrics as training data. I collected all studio album lyrics ever created by four Progressive Metal legends: Such as
Porcupine Tree, and
TOOL, which achieve 20,000+ lines of text file, that I save as
lyrics.txt, as training data.
I built the training data manually via copy and paste method from the following website:
azlyrics.com. Once you have completed your training data, move the file to the
Encode the data
First, don’t forget to navigate to the
gpt-2-finetuning directory, and run the following command:
python encode.py lyrics.txt lyrics.npz
If you run it successfully, you will get the following output:
Train the data
Change the directory to
gpt-2-finetuningsrc, and then, use the following command to train model using the new dataset:
python train.py --dataset lyrics.npz --batch_size 2 --learning_rate 0.0001
If everything is working correctly, the training should start and you should have the following output after a while:
From the screenshot above, we can interpret the output
[1 | 45.25] loss=3.92 avg=3.92 as follow:
1: Refers to the number of training steps. Think of it as a counter that will increase by 1 after each run.
45.25: Time elapsed since the start of training in seconds. You can use the first step as reference to determine how long it takes to run one step.
avg: Both of them refer to the cross-entropy (log loss) and the average loss. You can use this to determine the performance of your model. In theory, as training steps increase, the loss should decrease until it converges at a certain value. The lower, the better.
How to stop the training?
You can stop the training by using Ctrl+C.
By default, the model will be saved once every 1000 steps and a sample will be generated once every 100 steps.
The following are the automatically generated samples after the first 100 steps of training:
And the following are the checkpoint files inside the
The following is the complete directory to the checkpoints:
The outputs that related with our last step of training:
How to resume training from the last checkpoint?
You can simply use the following code to resume the training:
python train.py --dataset lyric.npz
Or the following command:
python train.py --dataset lyrics.npz --batch_size 2 --learning_rate 0.0001
That’s the same as before.
The following is the log loss and average loss achieved after
1000 steps of training:
Let’s see what the model can do, after such a long training.
Create a folder for the model
src/models folder, you should have just one folder called
124M (if you only installed this model).
Create another folder to store your model alongside with the original model. I made a new folder called
lyric. Now, I have two folders in
src/models, one is called
124M and the other is called
src/checkpoint/run1 folder, and copy the following files:
xxxx refers to the step number. Since I have trained for
1394 steps, I have
Paste them into the newly created folder (in my case, the folder is called
lyric). Next, go to the
124M folder and copy the following files:
Paste them into the
lyric folder. Double check that you should have 7 files in it. With this, we are ready to generate samples.
In general, there are two ways to generate samples: By generating unconditional vs interactive conditional samples.
Generate unconditional sample
Unconditional sample refers to randomly generated samples without taking into account any user input. Think of it as a random sample.
Make sure you are in the
src directory and simply use the following code to generate samples.
python generate_unconditional_samples.py --model_name lyric
Beware of the output! Some of it may be quite humorous, some others may use strong and profanity languages (for unknown reason, but it’s likely based on the some lyrics written by
TOOL band), while some of it seems good enough in producing lyrics that are a little bit similar with progressive metal style that used in training data.
The quality of the lyrics might get better if we add the training data, or stop the training at the right time or right training step (to avoid overfitting or underfitting).
You can also get better results by removing the bad words using profanity filter API, as follow:
For the sake of curiosity and completeness, I save all the 100 output samples into different files, and upload them to the following directory
/samples/generate_unconditional_samples.py--model_name_lyric_outputs inside the following code repository.
Generate unconditional sample using parameter tuning
The most important parameters we need to know before tuning them are
top_k: Integer value controlling diversity.
1 means only
1 word is considered for each step (token), resulting in deterministic completions, while
40 words are considered at each step.
0 (default) is a special setting that means no restrictions.
40 generally is a good value.
temperature: Float value controlling randomness in boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions. Default value is
For example, to generate unconditional sample using
0.8 temperature and
40 top_k, type the following command:
python generate_unconditional_samples.py --temperature 0.8 --top_k 40 --model_name lyric
For the sake of curiosity and completeness, I save all the 100 sample outputs using the command above, and upload them to the following directory
/samples/generate_unconditional_samples.py--temperature0.8top_k40 inside the following code repository.
Generate interactive conditional sample
Interactive conditional sample refers to generating samples based on user input. In other words, you input some text, and GPT-2 will do its best to fill in the rest.
Type the following command to generate interactive conditional samples:
python interactive_conditional_samples.py --temperature 0.8 --top_k 40 --model_name lyric
After running the command above, type your desired prompt, for example:
Write me a lyrics in the style of Dream Theater band
And here is the screenshot of the excerpt of the output:
Generate interactive conditional sample on PyScripter IDE
Before running the
interactive_conditional_sample.py on PyScripter IDE, you need to change the
model_name parameter on the
interact_model function from
lyric (see the line 12 on the
And then, you can run the code normally, and type your prompt. For example:
Please continue the following lyrics: "Somewhere like a scene from a memory"
The Transformer architecture has revolutionized the field of natural language processing and machine learning as a whole. Its self-attention mechanism and parallel processing capabilities have led to remarkable breakthroughs in tasks like machine translation, sentiment analysis, and text generation (as we demonstrated in this article).
This article has highlighted and demonstrated the potential use of deep learning, specifically within the context of the Transformer architecture in the domain of music or arts, specifically in producing lyrics in certain musical genre.
I hope this article was successful in giving you a basic understanding and workflow of how to retrain GPT-2 according to your project goals.
Check out the full repository here:
Click here to get started with PyScripter, a free, feature-rich, and lightweight Python IDE.
Download RAD Studio to create more powerful Python GUI Windows Apps in 5x less time.
Check out Python4Delphi, which makes it simple to create Python GUIs for Windows using Delphi.
Also, look into DelphiVCL, which makes it simple to create Windows GUIs with Python
References & further readings
 Foong, Ng Wai. (2019). Beginner’s Guide to Retrain GPT-2 (117M) to Generate Custom Text Content. AI2 Labs, Medium. medium.com/ai-innovation/beginners-guide-to-retrain-gpt-2-117m-to-generate-custom-text-content-8bb5363d8b7f
 Hakim, M. A. (2023). How to read Machine Learning/Deep Learning papers for a busy (or lazy) man. Paper-001: “Hinton et al., 2015”. hkaLabs AI blog. hkalabs.com/blog/how-to-read-deep-learning-papers-for-a-busy-or-lazy-man-paper-001
 Hakim, M. A. (2023). How to bypass the ChatGPT information cutoff? A busy (or lazy) man guide to read more recent ML/DL papers. Paper-001: “Rombach et al., 2022”. hkaLabs AI blog. hkalabs.com/blog/how-to-read-deep-learning-papers-using-bing-chat-ai-001
 nshepherd. (2021). Forked and fine-tuned version of GPT-2 for custom datasets: Code for the paper “Language Models are Unsupervised Multitask Learners”. GitHub repository. github.com/nshepperd/gpt-2
 Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
 Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J., Brundage, M., & Sutskever, I. (2019). Better language models and their implications. OpenAI blog, 1(2).
 Uszkoreit, J. (2017). Transformer: A Novel Neural Network Architecture for Language Understanding. Google Research Blog.
 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.