May 29

So, what is ChatGPT and how does it work?


This post provides an intuitive high-level overview of what a language transformer such as GPT is and the principles on which it operates. There is no space here to barrel down the rabbit hole of what neural networks are, or what machine learning is. But this can be the topics of following posts, if you are interested.


The letters GPT stand for Generative Pretrained Transformer, a type of neural network with a structure and type of training that lends it to language processing tasks. The company OpenAI has developed various versions (GPT-1, GPT-2, GPT-3, and now GPT-4) with increasing capacity to understand and generate human-like conversations and text manipulations.

ChatGPT is a conversational variant of the GPT model, designed specifically for chat-like applications. It is fine-tuned on a narrower dataset, often including conversation-like exchanges, to make it more capable of carrying on interactive, coherent, and context-aware conversations compared to the general GPT model.

Both are statistical software, that have nothing to do with this 😉


If you are interested in understanding what a neural network is, let me know in the comments, and I’ll prepare a post for this. For now, let’s just say it’s a piece of software that takes an input (a cardiogram, an image, a text …) and produces an output (a classification such as “yes, this is a cat”, another image, a diagnosis, another text …) using a set of mathematical functions that each operate in a way that mimics the neurons in human brains.

A brief history of Natural Language Processing (NLP)

This non-technical history aims at providing an intuitive understanding of what processes NLP relied on over the years.

  • 1950s and 1960s: The field of NLP emerged with early research in machine translation, where scientists aimed to automatically translate languages.
  • 1970s: The focus shifted to rule-based systems, where linguists and computer scientists created grammars and formal language rules to process natural language. Expert systems also emerged at that period.
  • 1980s: Because computing power had increased, statistical approaches gained popularity, and researchers started applying machine learning techniques applied to NLP, techniques such Hidden Markov Models or n-grams.
  • 1990s: Publication on the internet made vast amounts of text data available to all, so data-driven approaches, such as statistical models and other machine learning algorithms came to the fore, helped by the development of corpora and linguistic resources for training and evaluation.
  • 2000s: With the growth of the web and social media (everyone became a media, yielding more content), new goals emerged in NLP, such as sentiment analysis, named entity recognition, and information extraction, driving research towards new techniques such as Support Vector Machines (SVMs), Conditional Random Fields (CRFs), and neural networks..
  • 2010s: Deep learning revolutionized NLP, with the introduction of dedicated neural network architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Language representation improved.
  • 2020s: Large-scale pre-trained language models, containing billions of neurons organized in dedicated topologies and trained on enormous amounts of data, such as GPT (Generative Pre-trained Transformer), hogged all the limelight. These can generate text, translate, summarize, converse …

As you can see, processing power and greater availability of training data gradually enabled the use of more power-intensive and larger models. And, as this happened, methods became less and less human driven (expert rules, corpora …) and more and more empirical (data driven). Early models knew very little but with more certainty, today’s models “guess” much more, but with such huge training data and network size that the guessing is eerily good.

But the tendancy towards totally generic models (initial neural networks) shifted towards topologies that favour specific uses. So the human input of early programs, that figured in the rules and data (corpora) is now present in the move to more specialized topologies.

A statue of a transformer robot to illustrate an article on generative pretrained transformers
Those are not the transformers you are looking for

Topology wars: CNN vs RNN vs Transformer

All modern NLP software uses neural networks.

Imagine those as a huge grid of team members, each performing an identical, simple task. It’s the sheer number of team members that achieves something meaningful. In an early neural network, those team members were all aligned in rows. Each performed a calculation and each handed the result on to all the other team members in the row immediately in front of theirs.

Picture the vast grids of soldiers in Clone Wars or Lord of the Ring, and you get a fairly accurate idea. Each company could be a small neural network.

Clone troopers in a grid
(c) JRT2010

The team members / storm troopers at the back were handed the input data (a pixel colour each, for instance), did their stuff and handed their results to all the guys in front, who repeated the same type of work on the new data and so on. When every row had performed their tasks the neurons at the front handed out an output (say, a new pixel colour). If you have 196608 (256x256x3) neurons in each row, you can process a 256 x 256 pixel colour image in this way. And, generally speaking, the more rows you have the more advanced the work you can perform on this image. Morphing a dog into a cat is theoretically possible, through this process.

Important: No neuron/clone trooper has any idea what they are doing. They have a simple mathematical function to execute. Nothing in the neural network has any instructions to change a dog into a cat or anything else.

The process works by training the network by making each neuron propagate its work from one row to the next and evaluating the quality of the work at the end of the network. That evaluation involves calculating the “loss” , ie the difference between the actual result, and the intended result (ie the training data: the image of a cat) then sending back information to each neuron on how to adjust the maths function being performed by each, in order to correct the errors (minimise the loss) via a reverse process called backpropagation.

Each storm trooper (neuron) performs the exact same type of simple function. The only difference being the parameters of the function for each neuron. For example: one might calculate 2 x + 3 and the neighbour 2.7 x + 2 and the neighbour 2.3 x + 2.6, and so on. After each propagation/backpropagation cycle, starting with an input (a picture of a dog, say) and ending with a comparison of the transformed image to the photo of a cat, each neuron sees the paramaeters of its own simple function adapted. For instance 2.3 x + 2.6 might be updated to 2.33 x + 2.61. After a large number of back and forth cycles, each function is as optimized as it will get, and the whole network performs the best image transformation it can, given its topology and training data.

A 3D wireframe of a human head that represents what most humans erroneously imagine an artificial intelligence system to be.

If this seems weird, consider the wireframe above. Now, imagine that each neuron in the network was resposible for drawing a segment (a simple mathematical function). Initially, each segment would have random length, x & y position and orientation (the 4 parameters of each neuron, in this example). If each comparison with an image of a human being produced a small correction to those parameters, in the end, the segments would line up to create something like this picture.

Early AI systems were given rules to follow by experts of the field. This worked for knowledge that can be explicitly described via simple rules. But how do you formally describe what a cat looks like, or what a sarcastic sentence means? It’s impossible.

New models became entirely empirical, letting the training teach each neuron what function to calculate so that the whole would perform the global task expected of the network.

If that sounds weird, think of your favourite symphonic orchestra. Every musician has a “simple” task to perform, which never changes: read a score (the input) and move your fingers or blow or … (the output) None of the musicians perform the symphony. Each is restricted to its own score and instrument. But the whole produces the wonderful musical experience thousands have gathered to listen to. In this example, the network only has one row of neurons/musicians. Each reads the score and creates its output. During rehearsals (the training phase of the orchestra) the conductor evaluates the result based on his/her tastes and provides feedback to each musician (or group of musician) to alter their function slightly, so that the whole performs better. After a number of rehearsals, the orchestra plays the piece as close to the conductor’s vision as possible.


Soon, though, research highlighted that this totally empirical approach has its limits. But instead of inserting rules into the network, specific structures were formed between the neurons, in an attempt to build specific functionality into the network. Instead of having each neuron propagate its work forward to the buddies in front, some passed data a few rows back, for instance. This made it possible to inject some feedback into the network, or “remember” some data … Imagine the storm troopers in the middle rows waving at those in the back row to send them info rather than the troopers at the back relying only on input data, for instance.

Two types of networks using new topologies became useful for language processing: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).

Convolutional Neural Networks work hierarchically to allow some areas of the network to work on specific portions of the input data (for example, identify edges in an image). Unlike generic neural networks (in which data flows through all neurons and all neurons in one layer are connected to all of those immediately in front and behind), CNNs have special types of layers. In convolutional layers, neurons apply mathematical filters to identify patterns, while pooling layers reduce the dimensionality of the data to focus only on the most important layers and, fully connected layers combine the work of the previous two to achieve the end-result. So, CNN can be used to analyse the contents of a sentence, for example.

Recurrent Neural Networks are designed to deal with sequential data, and create loops that feed back data output by some neurons to others further back. This allows the network to capture dependencies and patterns that exist across time or sequence.

Not that type of self-attention

Unlike CNNs and RNNs, Transformers do not rely on sequential processing or convolutional filters. Instead, they use a mechanism called self-attention to capture relationships between different elements in the input sequence, through a specific layer called the “self-attention layer”, in which multiple “attention heads” are each responsible for learning different dependencies and patterns within the sequence being analyzed.

For example, consider the sentence “The food at the restaurant was terrible, which is why Jane didn’t like it.” For a sequential model, by the time it gets to “Jane didn’t like it”, it might have forgotten about “The food at the restaurant was terrible”. But a Transformer model can pay “attention” to all parts of the sentence equally, making it better at understanding the overall context.

Whereas previous models were more-or-less able to count words in a sentence, self-attention calculates the importance of each element in the sequence by considering its relationship with all other elements. This allows the model to assign higher weights to relevant elements and lower weights to irrelevant or unrelated elements, hence finding or creating “meaning” in sentences. Transformers are able consider the context and relationships between all elements simultaneously, resulting in a more comprehensive understanding of the input.

What about “Generative”?

So, now we know that GPT, Generative Pretrained Transformer, is a neural network using a transformer (self-attention) topology, that has been pre-trained on textual data. The pre-training is what gives the various neurons in the network the parameters of the simple calculations they each make (like the orchestra learning to play the symphony), and the topology is what makes the network well suited to the processing of language and other textual data (like the orchestra is structured differently for chamber music, choral, symphonic …).

Generative, in this context, simply means the network has the ability to generate outputs similar to the inputs, but not present in the input data, or in the training data. ChatGPT isn’t a database that scrolls websites to regurgitate some content to the user. Instead, it generates entirely new conent based on the statistical analysis of those websites, storing relationships between terms in the parameters of its neuron functions.

Some neural networks have been used for classification. Give them protein sequences as input, and they will tell you what family these belong to. CNNs tend to perform well for this, because of their ability to process long sequences. The output is a class number.

Other networks can be used for clustering. Give them the descriptions of 200 000 000 human beings as input, and they will find similarities and sort them into groups (clusters). The output is a list of groups.

Generative networks output stuff that is similar to the input. Feed them sentences and they output sentences. Feed them images and they output images. Or feed them sentences and they can output images, as Midjourney does. Or feed them images and they can output text describing the contents of the image (such as GPT4) The point is that they generate new content that is not directly present in the training data. (Important)

No feelings, no intentions, just a maths model

In an example, above, we saw that a neural network could learn to orient black segments on a white background to create an image that looks like a human face (wireframe). GPT networks go through a similar – but far more extensive – process to learn patterns (small and large) in text. It will know that “top of the” is often followed by “morning”. Or that “Proin fermentum ipsum eu rutrum viverra. Curabitur pretium enim ex, ut sagittis orci pellentesque ut. Suspendisse ac justo odio. Nam luctus nulla nec risus varius, semper tempus quam volutpat. Sed vestibulum elementum odio, quis dignissim lacus lobortis ut. Nam nec gravida augue. Phasellus ut iaculis orci. Pellentesque a dui tincidunt velit rutrum hendrerit ut sit amet orci. Integer tristique condimentum erat, sed lacinia diam imperdiet quis. Mauris lacus neque, ultricies in scelerisque ut, gravida a nisl. Nam a ornare felis. Nullam convallis, ipsum vitae laoreet volutpat, turpis mi efficitur risus, accumsan congue augue tellus consectetur ante. Integer quis neque molestie, placerat mi sit amet, sollicitudin erat.” is often found close to “Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla congue scelerisque dui, sit amet blandit magna varius ut. Cras ut elit eros. Donec urna dui, cursus in faucibus id, molestie vel ipsum. Phasellus facilisis diam odio, in tincidunt elit mollis a. Maecenas tempus mi non risus mollis, ac faucibus lectus iaculis. Morbi condimentum neque dui, id tempor metus hendrerit a. Fusce non purus sed purus aliquet pharetra eu vel ligula. Aliquam convallis tortor sit amet tristique malesuada. Phasellus hendrerit in mi ac tincidunt.”

This for almost all possible combinations of words and paragraphs, thanks to its ability to pay “attention” to various segments in a text sequence.


In summary

So that’s it. GPT is a piece of software that is (1) based on a neural network containing self-attention layers that allow it to analyse the context and relative importantes of various elements in the input sequence, (2) pretrained on (massive amounts of) textual data and (3) aimed at generating new sentences that are not present as such in the training data. And chatGPT has been optimized for conversation purposes rather than other types of textual manipulation.

In other words, the huge difference with a vast database containing all the internet, is that ChatGPT does not spit out text memorized from a website, but creates sentences from scratch, adapting to the context and meaning its structure allows it to infer from the sentences fed to it. It so happens that we often feed it questions, so it feels like ChatGPT is some sort of conversing sentient being. But it’s just a mathematical model, albeit an advanced one, that denies having any form of understanding of what it reads or generates. Here’s what ChatGPT has to say about ChatGPT:

Remember, while GPT is powerful and can generate impressively human-like text, it doesn’t truly understand the text it’s generating or processing. It’s essentially learning patterns in the data and using those patterns to make predictions. It does not have consciousness, beliefs, desires, or intentions.

Besides that lack of intention or consciousness, the other thing to remember about the above text is that ChatGPT constructs sentences from scratch. I once searched for bibliography and it found plenty, all of it fake (read about this here). You cannot use ChatGPT as a trusted source of information! As of mid-2023, do fact check everything you read.

Final note. While text-manipulation and prediction features are built into the network of transformers such as GPT, and their training is based on masive amounts of textual data, not all of the abilities of those networks were intended. GPT4’s ability to write programming code that runs smoothly is one example. Nothing in the training data, or in the network structure was specifically intended to give GPT4 te ability to generate valide code. Abilities such as this one are called emergent properties. They are what scares observers of AI the most, and you can read more about them here.


You may also like

Leave a Reply

Your email address will not be published. Required fields are marked

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

Get in touch

0 of 350