pythonintermediate
GPT PDF Summarisation Pipeline
Extract text from PDFs and summarise each section using map-reduce with the OpenAI API.
pythonPress ⌘/Ctrl + Shift + C to copy
from openai import OpenAI
from pypdf import PdfReader
client = OpenAI()
MAX_CHUNK = 3000 # chars per chunk
def extract_text(pdf_path: str) -> str:
reader = PdfReader(pdf_path)
return '\n'.join(page.extract_text() or '' for page in reader.pages)
def chunk_text(text: str, size: int = MAX_CHUNK) -> list[str]:
return [text[i:i+size] for i in range(0, len(text), size)]
def summarise_chunk(chunk: str) -> str:
resp = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role':'system','content':'Summarise this text concisely.'}, {'role':'user','content':chunk}],
max_tokens=200,
)
return resp.choices[0].message.content
def summarise_pdf(pdf_path: str) -> str:
text = extract_text(pdf_path)
chunks = chunk_text(text)
summaries = [summarise_chunk(c) for c in chunks]
combined = '\n'.join(summaries)
return summarise_chunk(combined) # Final reduce
print(summarise_pdf('report.pdf'))Use Cases
- document summarisation
- research automation
- PDF analysis
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
typescriptintermediate
OpenAI Chat Completion with Streaming
Stream GPT responses token-by-token using the OpenAI SDK with async iteration.
Best for: chatbot UI
#openai#streaming
typescriptbeginner
Generate Text Embeddings with OpenAI
Create vector embeddings for semantic search and similarity matching using text-embedding-3-small.
Best for: semantic search
#openai#embeddings
typescriptadvanced
RAG Pipeline (Retrieve + Augment + Generate)
Minimal RAG implementation: embed a query, retrieve top-k chunks, inject into prompt.
Best for: document Q&A
#rag#embeddings
typescriptintermediate
OpenAI Tool Calling (Function Calling)
Define tools for GPT to call, parse the response, execute the function, and return results.
Best for: AI agents
#openai#tool-calling