pythonadvanced
Serverless GPU AI with Modal
Run GPU-accelerated ML inference serverlessly on Modal with automatic scaling and cold start optimization.
pythonPress ⌘/Ctrl + Shift + C to copy
import modal
app = modal.App('ml-inference')
image = modal.Image.debian_slim().pip_install('torch', 'transformers')
@app.function(image=image, gpu='T4', timeout=120, container_idle_timeout=60)
def run_inference(texts: list[str]) -> list[str]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english', device=0)
results = classifier(texts)
return [f"{r['label']}: {r['score']:.3f}" for r in results]
@app.local_entrypoint()
def main():
texts = ['I love this product!', 'This is terrible.', 'Not bad, could be better.']
with modal.enable_output():
results = run_inference.remote(texts)
for text, result in zip(texts, results):
print(f'{text!r} -> {result}')Use Cases
- serverless ML
- GPU inference
- scalable AI deployment
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
pythonintermediate
BentoML Model Serving Service
Package and serve a scikit-learn model as a REST API with BentoML in Python.
Best for: model deployment
#bentoml#model-serving
pythonadvanced
Ray Serve ML Model Deployment
Deploy a scalable ML serving endpoint with Ray Serve, handling concurrent requests and model loading.
Best for: ML serving
#ray#model-serving
pythonintermediate
ONNX Runtime Fast ML Inference
Export a PyTorch model to ONNX and run fast CPU inference with ONNX Runtime.
Best for: model deployment
#onnx#inference
pythonadvanced
vLLM High-Throughput LLM Serving
Serve open-source LLMs with high throughput using vLLM's PagedAttention for production use.
Best for: high-throughput LLM
#vllm#open-source-llm