</>SnippetsLabBuild faster with production-ready snippets

pythonintermediate

HuggingFace Text Generation with Streaming

Run local text generation with HuggingFace models and stream output token-by-token to the console.

pythonPress ⌘/Ctrl + Shift + C to copy

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model_name = 'Qwen/Qwen2.5-0.5B-Instruct'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

prompt = 'Explain what a Python decorator is in simple terms:'
inputs = tokenizer(prompt, return_tensors='pt')

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

gen_kwargs = {**inputs, 'streamer': streamer, 'max_new_tokens': 200, 'do_sample': True, 'temperature': 0.7}
t = Thread(target=model.generate, kwargs=gen_kwargs)
t.start()

print(prompt, end='')
for token in streamer:
    print(token, end='', flush=True)
print()

Use Cases

local LLM
streaming generation
open-source models

Tags

#huggingface #text-generation #streaming #local

Related Snippets

Similar patterns you can reuse in the same workflow.

typescriptintermediate

OpenAI Chat Completion with Streaming

Stream GPT responses token-by-token using the OpenAI SDK with async iteration.

Best for: chatbot UI

#openai#streaming

typescriptintermediate

Next.js AI Streaming Route Handler

Stream OpenAI responses from a Next.js App Router route handler using the Vercel AI SDK.

Best for: AI chatbot backend

pythonintermediate

Hugging Face Inference API

Run ML models via the Hugging Face Inference API for text generation, classification, and embeddings.

Best for: Text classification

#huggingface#inference

typescriptintermediate

Google Gemini API Integration

Call the Google Gemini API for text generation with streaming, safety settings, and system prompts.

Best for: Google AI-powered code generation

#gemini#google-ai