Article Extractor

Extract clean, distraction-free text from web articles and blog posts. This skill removes advertisements, navigation elements, newsletter prompts, and other clutter to give you just the content you want to read.

Perfect for researchers, content curators, and anyone who wants to save articles for offline reading without the noise.

Core Functionality

Transform cluttered web pages into clean text:

Remove ads, sidebars, and navigation
Extract article headlines and body text
Save to readable text files
Preview content before saving

How It Works

Extraction Process

Fetch Article - Downloads content from URL
Parse & Clean - Removes non-content elements
Extract Text - Pulls out headline and body
Save File - Creates text file with article title as filename

Tools Used

Primary: reader-cli (Mozilla Readability)

Industry-standard content extraction
Same algorithm as Firefox Reader View
Excellent accuracy on most sites

Fallback: trafilatura (Python)

Academic-grade web scraping library
Specialized for text extraction
Works when reader-cli fails

Usage Examples

Extract from URL:

Extract the article from https://example.com/blog/post

Process multiple articles:

Extract articles from these URLs:
- https://site1.com/article
- https://site2.com/post
- https://site3.com/blog

Output Format

Generated filename: Article title with special characters cleaned

Content structure:

[Article Headline]

[Clean article body text with paragraphs preserved]

Installation

Option 1: npm (recommended)

BASH

npm install -g @mozilla/readability-cli

Option 2: Python

BASH

pip install trafilatura

Best Practices

Do:

Respect website terms of service
Use for personal reading and research
Verify extraction quality on preview
Keep extracted content private

Don't:

Republish extracted content without permission
Use for bulk commercial scraping
Assume 100% accuracy on all sites
Extract from paywalled content you don't have access to

Common Use Cases

Research & Learning:

Save academic articles for offline study
Build personal knowledge base
Extract tutorials for later reference

Content Curation:

Clean up articles for sharing with team
Build reading lists
Archive important content

Accessibility:

Convert cluttered pages to screen-reader friendly text
Remove distracting elements
Create plain text versions

Limitations

May struggle with:

Heavily JavaScript-rendered content
Paywalled articles
Content behind login walls
Sites with aggressive anti-scraping measures

Solutions:

Try both reader-cli and trafilatura
Ensure you're logged in if accessing subscription content
Use browser's Reader View as alternative
Contact site owner for API access if available

Part of Tapestry Suite

Tapestry Workflow

Article Extractor is one component of the Tapestry Skills Suite:

Full workflow:

Article Extractor - Clean content from web
YouTube Transcript - Extract video transcripts
Ship-Learn-Next - Turn content into action plans
Tapestry - Orchestrate all skills together

About This Skill

This skill was created by michalparkola as part of the Tapestry Skills for Claude Code collection.

Philosophy: "Learning = doing better, not knowing more." The Tapestry suite emphasizes turning passive content consumption into active implementation.

Extracts clean, readable content from web articles and blog posts, removing ads, navigation, and clutter using reader-cli or trafilatura.