PDF Processing Toolkit

Official Anthropic skill enabling comprehensive PDF manipulation through Python libraries and command-line tools, supporting extraction, creation, transformation, and form processing at scale.

Essential for data analysts, document processors, and automation engineers who need to extract data from PDFs, create reports, or manipulate PDF documents programmatically.

Core Purpose

This toolkit provides end-to-end PDF processing capabilities:

Extract text and data with layout preservation
Parse tables and convert to structured formats
OCR scanned documents for text recognition
Merge and split PDF documents
Create new PDFs with custom content
Encrypt and protect sensitive documents

Key Capabilities

Text & Data Extraction

Advanced Text Extraction

pdfplumber for Precision:

Text extraction with layout preservation
Column detection and ordering
Whitespace and formatting retention
Character-level positioning data

Table Extraction:

Automatic table detection
Export to Excel or pandas DataFrames
Handle complex table structures
Preserve cell formatting

OCR for Scanned Documents:

pytesseract integration
Multi-language support
Image preprocessing for accuracy
Batch processing capabilities

Document Manipulation

PDF Operations

Merging:

Combine multiple PDFs into one
Preserve bookmarks and metadata
Maintain page quality
Handle large document sets

Splitting:

Extract specific pages or ranges
Create individual page files
Split by bookmarks or criteria
Batch extraction

Transformation:

Rotate pages programmatically
Extract metadata (author, creation date)
Apply watermarks to pages
Resize and crop pages

PDF Creation

Document Generation

reportlab Canvas:

Draw text, shapes, and images
Precise positioning control
Custom fonts and styling
Low-level PDF construction

Platypus Framework:

Multi-page report generation
Flowable content layout
Built-in style sheets
Tables, paragraphs, and images

Primary Python Libraries

pypdf

Core Operations:

PYTHON

from pypdf import PdfReader, PdfWriter

# Merge PDFs
writer = PdfWriter()
for pdf_file in pdf_files:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)
writer.write("merged.pdf")

# Split pages
reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    writer.write(f"page_{i}.pdf")

# Encrypt
writer.encrypt(user_password="user123",
               owner_password="owner456")

pdfplumber

Advanced Extraction:

PYTHON

import pdfplumber

# Extract text with layout
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()

# Extract tables
with pdfplumber.open("data.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    # Convert to DataFrame
    import pandas as pd
    df = pd.DataFrame(tables[0][1:],
                     columns=tables[0][0])

reportlab

PDF Creation:

PYTHON

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

# Simple document
c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World")
c.save()

# Multi-page report with Platypus
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf")
styles = getSampleStyleSheet()
story = [Paragraph("Title", styles['Title']),
         Paragraph("Content here", styles['Normal'])]
doc.build(story)

Command-Line Utilities

qpdf

PDF Manipulation:

BASH

# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Extract pages
qpdf input.pdf --pages . 1-5 -- output.pdf

# Decrypt
qpdf --decrypt --password=pass input.pdf output.pdf

pdftotext

Text Extraction:

BASH

# Basic extraction
pdftotext document.pdf output.txt

# Preserve layout
pdftotext -layout document.pdf output.txt

# Specific pages
pdftotext -f 1 -l 10 document.pdf output.txt

OCR Processing

Scanned Document Workflow:

Extract images from PDF pages
Preprocess images (enhance contrast, denoise)
Run OCR with pytesseract
Post-process text (clean up errors)
Create searchable PDF with text layer

This enables full-text search on scanned documents.

OCR Implementation

PYTHON

import pytesseract
from PIL import Image
from pdf2image import convert_from_path

# Convert PDF to images
images = convert_from_path('scanned.pdf')

# OCR each page
text_pages = []
for image in images:
    text = pytesseract.image_to_string(image)
    text_pages.append(text)

# Combine results
full_text = '\n\n--- Page Break ---\n\n'.join(text_pages)

Security & Protection

Password Encryption

User vs. Owner Passwords:

User password: Required to open document
Owner password: Controls editing permissions

PYTHON

from pypdf import PdfWriter

writer = PdfWriter()
# Add pages...

writer.encrypt(
    user_password="view_only",
    owner_password="full_access",
    permissions_flag=0b0100  # Allow printing only
)
writer.write("protected.pdf")

Decryption

Remove Password Protection:

PYTHON

from pypdf import PdfReader, PdfWriter

reader = PdfReader("encrypted.pdf", password="password123")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.write("decrypted.pdf")

Practical Use Cases

Data Extraction Pipeline

Scenario: Extract tables from financial reports

Load PDF with pdfplumber
Detect tables automatically
Extract to pandas DataFrames
Clean and validate data
Export to Excel or database

Benefits: Automate manual data entry, reduce errors, process hundreds of documents

Report Generation

Scenario: Generate monthly business reports

Query data from databases
Create charts and visualizations
Build PDF with reportlab Platypus
Apply company branding
Distribute automatically

Benefits: Consistent formatting, automated distribution, time savings

Document Archival

Scenario: Digitize paper records with OCR

Scan documents to PDF
Run OCR on each page
Create searchable PDFs
Extract metadata
Index in document management system

Benefits: Searchable archives, space savings, disaster recovery

Performance Optimization

Large Document Handling:

Stream processing: Process pages one at a time
Parallel processing: Use multiprocessing for batch operations
Memory management: Close files after processing
Caching: Cache extracted data for reuse

For 1000+ page documents, process in chunks to avoid memory issues.

Table Extraction Best Practices

Improving Accuracy:

Use pdfplumber's explicit table settings
Define table boundaries manually if needed
Handle merged cells appropriately
Validate extracted data against known patterns

Example:

PYTHON

with pdfplumber.open("data.pdf") as pdf:
    page = pdf.pages[0]

    # Explicit table settings
    table_settings = {
        "vertical_strategy": "lines",
        "horizontal_strategy": "lines",
        "intersection_tolerance": 3
    }

    tables = page.extract_tables(table_settings)

Error Handling

Common Issues:

Corrupted PDFs: Use qpdf for repair
Password protection: Handle decryption errors gracefully
Malformed tables: Fallback to text extraction
OCR errors: Implement spell-checking post-processing

Technical Requirements

Dependencies:

Python Libraries:

pypdf: Core PDF operations
pdfplumber: Advanced extraction
reportlab: PDF generation
pytesseract: OCR capabilities
pdf2image: Image conversion
Pillow: Image processing

System Tools:

Tesseract OCR engine
Poppler utilities
qpdf command-line tool

About This Skill

This skill is an official Anthropic skill from the Anthropic Skills Repository. It represents best practices for PDF processing in Claude Code.

Official Skills are maintained by Anthropic and provide production-ready, well-tested functionality for document workflows.

Official Anthropic skill for comprehensive PDF manipulation including text extraction, table parsing, merging, splitting, OCR, and document creation.