Mastering Transformers

1. Introduction 

  • Hook: Large Language Models have totally changed the scenario for artificial intelligence in bringing breakthroughs to natural language understanding and generation. The core of all this innovation is the transformer architecture, which revolutionized how people use traditional models for NLP tasks, like those that replaced RNNs and CNNs.
  • Purpose: This blog will introduce the basics of transformers and walk you through building a simple transformer model from scratch.
  • Audience: This guide is helpful for beginner developers and AI enthusiasts in learning transformers in an accessible and usable way.
AD 4nXeI 3qeoB64kthtVL0REhxvvMAxdGac oM7FFPJeH9fT1H9rKpu3jmUUwZQn2Ky1sxvSNLGd4548aQHfz9md6q32FZup3tzMKTRjNZGmKjnm6ONL4IT Vv8BeYY 8s0DSsEX4v4?key=dcZ5VXkHhEi6oDnrwJWhwO f

 

Image Source: Click here

2. What Are Transformers? 

  • Brief history: Introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017, this transformer architecture has literally redefined the NLP arena. Before transformers, with RNNs and LSTMs, we had sequential data dependencies, besides being computationally inefficient.
  • Core idea: Core idea: At the heart of this innovation is a mechanism called self-attention, which, applied to models, enables them to pay their attention not only to different words in a sequence but also irrespective of position. This allowed the transformers to drop the recurrence requirements and enabled full parallelization, which greatly enhances efficiency in computation.
  • Importance: Importance: Transformers are the heart of most modern LLMs, including GPT and BERT. This is the reason behind these models now being able to produce state-of-the-art results for translation, summarization, and question-answering tasks, among many others.

3. Understanding the Transformer Architecture 

Transformers have many core parts that work together to handle and comprehend sequential data:

  • Key Components:
    • Encoder-Decoder Structure: The encoder transforms the input sequence into a number of encoded representations. The decoder uses such representations to produce the output sequence. This architecture makes transformers versatile for tasks like translation where input and output are different in terms of sequences.
    • Multi-Head Attention: Multi-head attention enables the model to pay attention to different parts of the sequence simultaneously. Computing attention scores across multiple heads captures finer relationships between words, thereby enhancing the model’s capacity to understand contexts.
    • Feedforward Neural Networks: Feedforward neural networks have transformations on intermediate outputs where non-linearity is added and allows the model to learn complex patterns.
    • Positional Encoding: Since the nature of transformers does not natively possess inherent sequentiality, positional encoding is added onto input embeddings to incorporate sequential information regarding position. They ensure the loss of sequences during processing.
  • Visual Aid (Optional):
AD 4nXfQpYFnGs855CUUdWam0z6MooMMKKOvvK5tzCv8HcUgxwE66490nie2Ng3uYqMN07VUMGLKYXhIEelgdvFBV 6dSBxHkwPIH98DaY DqhgrQts1PrN84U50rBBJgQiLhhNelnQ?key=dcZ5VXkHhEi6oDnrwJWhwO f

 

Image Source: Click here

  • Mathematical Insights (Optional):
AD 4nXdiMQJkEOQ8XxipUvdteR6Oe2dxUJ ExxesVFVtaKXujbCV9slmfyYRYT4Pj ZS90TiYIMBu4bS sg8SdjdWQ7hbDu8neyZnQALcvQFSRLqoRmRit8WSAsAC6Nt98LwwD4v10QqMQ?key=dcZ5VXkHhEi6oDnrwJWhwO f

 

Attention scores allow for appropriate focusing by giving higher weights to tokens to which attention needs to be given.

Image Source: Click here

4. Coding a Transformer from Scratch 

  • Setup:

To get started, you’ll need the following:

  • Required Libraries: Python, PyTorch/TensorFlow, NumPy, etc..
  • Environment Setup: Install the necessary libraries using 

pip install torch numpy

  • Step-by-Step Implementation:
    • Create positional encodings.

import torch

import math

def positional_encoding(max_len, d_model):

    pe = torch.zeros(max_len, d_model)

    for pos in range(max_len):

        for i in range(0, d_model, 2):

            pe[pos, i] = math.sin(pos / (10000 ** (i / d_model)))

            pe[pos, i + 1] = math.cos(pos / (10000 ** (i / d_model)))

    return pe.unsqueeze(0)

  • Define the attention mechanism.

def scaled_dot_product_attention(query, key, value):

    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(query.size(-1))

    weights = torch.nn.functional.softmax(scores, dim=-1)

    return torch.matmul(weights, value)

  • Build the encoder and decoder blocks.

import torch.nn as nn

class TransformerEncoderLayer(nn.Module):

    def __init__(self, d_model, num_heads):

        super().__init__()

        self.attention = nn.MultiheadAttention(d_model, num_heads)

        self.feedforward = nn.Sequential(

            nn.Linear(d_model, d_model * 4),

            nn.ReLU(),

            nn.Linear(d_model * 4, d_model)

        )

        self.norm1 = nn.LayerNorm(d_model)

        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, src):

        attn_output, _ = self.attention(src, src, src)

        src = self.norm1(src + attn_output)

        ff_output = self.feedforward(src)

        return self.norm2(src + ff_output)

  • Combine components to form the transformer model.

class Transformer(nn.Module):

    def __init__(self, num_layers, d_model, num_heads, max_len):

        super().__init__()

        self.encoder_layers = nn.ModuleList([

            TransformerEncoderLayer(d_model, num_heads) for _ in range(num_layers)

        ])

        self.positional_encoding = positional_encoding(max_len, d_model)

    def forward(self, src):

        src = src + self.positional_encoding[:, :src.size(1), :]

        for layer in self.encoder_layers:

            src = layer(src)

        return src

5. Testing the Model 

  • To test the transformer, you can use a toy dataset, such as a simple sequence-to-sequence translation task. Load the data, tokenize it, and pass it through the model:

model = Transformer(num_layers=6, d_model=512, num_heads=8, max_len=100)

input_sequence = torch.rand(10, 100, 512)  # Example input

output = model(input_sequence)

print(output.shape)  # Should match the input dimensions

  • Expected outputs should align with the input sequence length and embedding dimensions, demonstrating the model’s ability to handle sequences. However, this basic implementation may lack the refinement of pretrained models.

6. Next Steps and Applications 

  • This simple transformer can be used in text classification, question answering, or even machine translation. Fine-tune it on bigger datasets for better performance.
  • Pretrained models like BERT, GPT, or T5 based on transformers will yield state-of-the-art performance on NLP tasks.
  • For more details, refer to Hugging Face’s Transformers library or the PyTorch documentation for advanced implementations.

7. Conclusion 

 Transformers have become the pillars of modern NLP and drive breakthrough models and applications. This blog served as a primer on how their architecture works, plus step-by-step coding instructions on how to build one.

Are you intrigued by the possibilities of AI? Let’s chat! We’d love to answer your questions and show you how AI can transform your industry. Contact Us

Leave A Comment