pythonintermediate

HuggingFace Text Generation with Streaming

Run local text generation with HuggingFace models and stream output token-by-token to the console.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model_name = 'Qwen/Qwen2.5-0.5B-Instruct'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

prompt = 'Explain what a Python decorator is in simple terms:'
inputs = tokenizer(prompt, return_tensors='pt')

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

gen_kwargs = {**inputs, 'streamer': streamer, 'max_new_tokens': 200, 'do_sample': True, 'temperature': 0.7}
t = Thread(target=model.generate, kwargs=gen_kwargs)
t.start()

print(prompt, end='')
for token in streamer:
    print(token, end='', flush=True)
print()

Use Cases

  • local LLM
  • streaming generation
  • open-source models

Tags

Related Snippets

Similar patterns you can reuse in the same workflow.