Hugging Face Transformer's Library

Hugging Face is a popular machine learning platform featuring open-source models, datasets, spaces, libraries, and more. Today, I'd like to share what I’ve learned while exploring one of their libraries: the Transformers library. This library provides tools for training and running inference with pretrained generative ML models. It's called the Transformers library because it works with models that use a transformer architecture. Most Large Language Models (LLMs) are built on this architecture for a good reason—it performs exceptionally well!

We’ll discuss two parts of the library: the high-level Pipelines API and the lower-level Tokenizer and Model APIs.

Part 1: Pipelines

Let’s start with Pipelines. The pipeline function is a great way to get started with model inference. Instead of dealing with tokenizers, prompt templates, and other low-level details of generative models, you can simply use a call like the following:

from transformers import pipeline

my_pipeline = pipeline(“task”)
result = my_pipeline(input)

It’s that simple! What can you use for the “task” argument? Here are some examples:

Sentiment Analysis

classifier = pipeline("sentiment-analysis", device="cuda")
result = classifier("Yes, the Santa Clara Vanguard (SCV), a drum and bugle corps, is returning to the field for the 2024 and 2025 seasons!")
print(result)

Output:

[{'label': 'POSITIVE', 'score': 0.9978967905044556}]

Model: distilbert/distilbert-base-uncased-finetuned-sst-2-english

Named Entity Recognition

ner = pipeline("ner", grouped_entities=True, device="cuda")
result = ner("Do you like ChatGPT or Claude?")
print(result)

Output:

[{'entity_group': 'ORG', 'score': np.float32(0.73873335), 'word': 'ChatGPT', 'start': 12, 'end': 19}, {'entity_group': 'ORG', 'score': np.float32(0.5686558), 'word': 'Claude', 'start': 23, 'end': 29}]

Model: dbmdz/bert-large-cased-finetuned-conll03-english

Question Answering with Context

question_answerer = pipeline("question-answering", device="cuda")
result = question_answerer(question="What model is currently at the top of the LMArena Leaderboard?", context="Google's Gemini 2.5 is currently leading the LMArena Leaderboard.")
print(result)

Output:

{'score': 0.955433189868927, 'start': 9, 'end': 19, 'answer': 'Gemini 2.5'}

Model: distilbert/distilbert-base-cased-distilled-squad

Text Summarization

summarizer = pipeline("summarization", device="cuda")
text = """Pipelines
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use.

There are two categories of pipeline abstractions to be aware about:

The pipeline() which is the most powerful object encapsulating all other pipelines.
Task-specific pipelines are available for audio, computer vision, natural language processing, and multimodal tasks.
"""
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary['summary_text'])

Output:

Pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks . The pipeline() is the most powerful object encapsulating all other pipelines .

Model: sshleifer/distilbart-cnn-12-6

Translation

translator = pipeline("translation_en_to_hi", model="Helsinki-NLP/opus-mt-en-hi", device="cuda")
result = translator("Pipelines API is very cool!")
print(result['translation_text'])

Output:

पाइपलाइन एपीआई बहुत अच्छा है!
"Pipeline API bahut accha hai!"

Note that in this case, I specified which model to use.

Classification

classifier = pipeline("zero-shot-classification", device="cuda")
result = classifier("Para-diddle-diddle shot", candidate_labels=["Percussion", "AI", "Astronomy"])
print(result)

Output:

{'sequence': 'Para-diddle-diddle shot', 'labels': ['Percussion', 'AI', 'Astronomy'], 'scores': [0.7227801084518433, 0.15953893959522247, 0.11768091470003128]}

Model: facebook/bart-large-mnli

I was surprised it got this right, as these are relatively small models.

Other available tasks include "Text Generation".

Image Generation

This next example uses DiffusionPipeline, which is not strictly part of the main pipeline function. It works similarly but is designed specifically for diffusion models—a paradigm used for generative visual computing. Here is an example:

image_gen = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16"
).to("cuda")

text = "Bats under the Bridge in Austin, Texas."
image = image_gen(prompt=text).images
image

If you want to start generating content with AI, the pipeline function is an excellent option!

Part 2: Tokenizers and Models

If you want to peel back the abstraction and learn more about what’s happening under the hood, the Transformers library offers classes like AutoTokenizer and AutoModelForCausalLM. We will discuss these in this section.

Let’s start with tokenizers. What is a tokenizer? For models to understand natural language like English, we must convert our text input into a numerical format. In other words, we need to turn English into numbers.

text = "Open Source Models are Magical!"
tokens = tokenizer.encode(text)
tokens

Output:

[128000][5109][8922][27972][73810]

To convert the tokenizer's tokens back into natural language, you can use tokenizer.decode(tokens).

Output:

<|begin_of_text|>Open Source Models are Magical!

Notice the <|begin_of_text|> token? We didn’t include that in our initial encode() input. This is a special token, which tokenizers use to help the model understand concepts beyond natural language, such as where a prompt begins. As an AI model is trained, it learns to associate these tokens with what they represent. For example, the token ID 128000 (which corresponds to <|begin_of_text|>) means nothing to an untrained model. However, a model trained on many examples where 128000 indicates the start of a text will learn to associate this token with the beginning of a prompt.

It's important to differentiate tokenization from embeddings. The tokenization process assigns a unique numeric ID to each token. The model then uses this ID to look up a corresponding vector of floating-point numbers in its embedding layer. This process assigns a vector (not just a single number) to each token. We will revisit the magical concept of embeddings later.

So far, we’ve considered a general tokenizer. However, the models we typically interact with, like OpenAI’s ChatGPT or Anthropic’s Claude, are fine-tuned for chat. These models expect a specific input format, often a sequence of system, user, and assistant prompts. When working with open-source chat models, we can use the tokenizer.apply_chat_template() function to format our input correctly. Let’s look at some code:

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct', trust_remote_code=True)

Note the word 'Instruct' at the end of the model’s name. This indicates that the model has been fine-tuned for chat purposes, a common naming convention on Hugging Face.

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)

Output:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Can you explain what an LLM is<|eot_id|><|start_header_id|>assistant<|end_header_id|>

You can see interesting special tokens here, such as <|begin_of_text|> and <|start_header_id|>, as well as a specific format for the system, user, and assistant roles. Not all models work this way; in fact, each model generally has its own tokenizer for this very reason. Models are trained on a corpus of text with a specific format. If a model’s input doesn’t match the format it was trained on (e.g., if we use a different tokenizer), it will likely produce nonsensical output. Now that we understand what tokenizers do, let’s generate some text! We'll use one of DeepSeek's models: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.

DEEPSEEK = "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B"

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Can you explain what an LLM is? Don't think too much, just give me a concise answer that explains what an LLM is"}
]

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(DEEPSEEK)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")

The line tokenizer.pad_token = tokenizer.eos_token simply gives the tokenizer a specific token to use for padding if the input is shorter than the model expects.

Once our input is prepared, let's download the model.

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(DEEPSEEK, device_map="auto", quantization_config=quant_config)

There’s a bit to unpack here. I use a quant_config to quantize the model. Quantization reduces the precision of model weights, allowing the model to fit into a smaller memory footprint. Typically, model weights are 32-bit floating-point numbers, but here we are loading them as 4-bit numbers—an eightfold reduction! We map from 32 bits to 4 bits using a technique called 'Normal Float 4' (NF4), specified by the bnb_4bit_quant_type="nf4" parameter. This process uses scaling factors (constants) to map numbers from the 32-bit space to the 4-bit space. Setting bnb_4bit_use_double_quant=True further quantizes these scaling factors. While the model weights are stored in 4-bit precision, computations are performed using the bfloat16 data type, specified by bnb_4bit_compute_dtype=torch.bfloat16. The 4-bit weights are temporarily dequantized to the larger bfloat16 type for calculations. bfloat16, as the name suggests, stores numbers with 16 bits. However, it differs from the standard 16-bit float (FP16) by allocating bits differently between the mantissa and exponent, providing greater numeric stability and a range similar to 32-bit floats (FP32). This 'hack' is quite common in ML systems.

AutoModelForCausalLM is a class from the Transformers library for loading autoregressive models. Autoregressive models, which predict future tokens by analyzing past ones, are the most common type of language model today.

If you're interested, this model took up 5,962.8 MB of memory!

Before running the model, let’s look at its architecture. Executing print(model) shows the following structure:

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 4096)
    (layers): ModuleList(
      (0-35): 36 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=12288, bias=False)
          (down_proj): Linear4bit(in_features=12288, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((4096,), eps=1e-06)
      )
    )
    (norm): Qwen3RMSNorm((4096,), eps=1e-06)
    (rotary_emb): Qwen3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=151936, bias=False)
)

We see the model has two main parts: the core model (Qwen3Model) and the language modeling head (lm_head). The Qwen3Model is the main body of the transformer. (Remember, the Transformers library serves models with the transformer architecture, which is the most common architecture for LLMs like this DeepSeek model). After the core model outputs a 4096-dimensional vector, the language modeling head projects it into a 151,936-dimensional vector. This final vector contains a likelihood score for each of the 151,936 possible tokens in the model's vocabulary.

Let’s look more closely at the model. The first component is the embed_tokens layer we discussed earlier. Its shape, (151936, 4096), means the model's vocabulary contains 151,936 tokens, and each token is represented by a 4096-dimensional vector.

How are these embeddings associated with each token? This is an interesting question, as we are essentially asking how to associate semantic meaning with vectors. These embedding weights are learned during the training process (via backpropagation, gradient descent, etc.). This is one of the most fascinating aspects of how NLP systems work. These embeddings capture complex relationships between words and tokens, allowing machines to interpret our natural language and generate text that makes sense to us. (More on this topic in a future blog post!) After the embedding layer, we see 36 identical decoder layers.

These layers are at the heart of the model. We see Key, Query, and Value projection layers, as well as an o_proj (output projection) and normalization layers (q_norm, k_norm). The Key, Query, and Value projections are used in the self-attention operation. Self-attention is the mechanism that allows each token to weigh the importance of other tokens in the sequence, encoding contextual meaning into the vectors. (This is another fascinating computation, which I'll cover in a future blog post!) The decoder layers also include MLP (Multi-Layer Perceptron) layers—fully connected neural networks that introduce critical non-linearity—and input_layernorm and post_attention_layernorm layers to normalize the inputs and outputs between consecutive decoder blocks. That wraps up the decoder layers!

Finally, after the decoder layers, we find a final normalization layer (RMSNorm) and a Qwen3RotaryEmbedding layer, which is used to encode positional information about the tokens.

As fun as it is to peek at the underlying model architecture, it's even more fun to run inference. Let's generate some text.

streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, max_new_tokens=1000, streamer=streamer)
print(tokenizer.decode(outputs))

Here was the output:

<｜begin of sentence｜>You are a helpful assistant<｜User｜>Can you explain what an LLM is? Don't think too much, just give me a concise answer that explains what an LLM is<｜Assistant｜><think>
Okay, the user is asking for a concise explanation of what an LLM is. They specifically said "Don't think too much," so they probably want a straightforward answer without jargon or fluff.

Hmm, judging by the phrasing, they might be someone who's heard the term but isn't familiar with it, or perhaps they're just looking for a quick definition. The tone is neutral, so no urgency or frustration detected.

I should keep it simple. LLM stands for Large Language Model, and it's basically a really big AI system trained on tons of text data. The key thing is it can generate human-like text based on what it's learned.

But "really big" and "human-like text" are vague. Maybe I should mention that it's a type of AI, and highlight its ability to understand and produce text. The user didn't ask for technical details, so I shouldn't overload them with parameters or training methods.

They also said "Don't think too much," so I'll avoid overcomplicating it. Just give the core idea: AI that mimics human language. That should cover what they need without diving into specifics.

Wait, should I mention examples like ChatGPT or GPT-4? Probably not, since they didn't ask for applications. Stick to the definition.

Final answer: Keep it short, define LLM, and emphasize its text generation capability. No need to expand unless they ask follow-up questions.
</think>
A Large Language Model (LLM) is a type of AI trained on vast amounts of text data to understand and generate human-like text.<｜end of sentence｜>

It's fascinating to look at the 'thinking' process of these reasoning models, visible in the text between the <think> and </think> special tokens. After experimenting with a few prompts, it seems that telling the model to 'think less' does indeed shorten its reasoning process and produce an output faster.

Awesome!

I learned a great deal about this from Ed Donner’s 'LLM Engineering: Master AI, Large Language Models & Agents' course and Professor Akella’s Systems for ML course at UT Computer Science. Many thanks to both of them!