Article Extractor

Extracts clean, readable content from web articles and blog posts, removing ads, navigation, and clutter using reader-cli or trafilatura.

Article Extractor

Extract clean, distraction-free text from web articles and blog posts. This skill removes advertisements, navigation elements, newsletter prompts, and other clutter to give you just the content you want to read.

Perfect for researchers, content curators, and anyone who wants to save articles for offline reading without the noise.

Core Functionality

Transform cluttered web pages into clean text:

  • Remove ads, sidebars, and navigation
  • Extract article headlines and body text
  • Save to readable text files
  • Preview content before saving

How It Works

Extraction Process

  1. Fetch Article - Downloads content from URL
  2. Parse & Clean - Removes non-content elements
  3. Extract Text - Pulls out headline and body
  4. Save File - Creates text file with article title as filename

Tools Used

Primary: reader-cli (Mozilla Readability)

  • Industry-standard content extraction
  • Same algorithm as Firefox Reader View
  • Excellent accuracy on most sites

Fallback: trafilatura (Python)

  • Academic-grade web scraping library
  • Specialized for text extraction
  • Works when reader-cli fails

Usage Examples

Extract from URL:

Extract the article from https://example.com/blog/post

Process multiple articles:

Extract articles from these URLs:
- https://site1.com/article
- https://site2.com/post
- https://site3.com/blog

Output Format

Generated filename: Article title with special characters cleaned

Content structure:

[Article Headline]

[Clean article body text with paragraphs preserved]

Installation

Option 1: npm (recommended)

BASH
npm install -g @mozilla/readability-cli

Option 2: Python

BASH
pip install trafilatura

Best Practices

Do:

  • Respect website terms of service
  • Use for personal reading and research
  • Verify extraction quality on preview
  • Keep extracted content private

Don't:

  • Republish extracted content without permission
  • Use for bulk commercial scraping
  • Assume 100% accuracy on all sites
  • Extract from paywalled content you don't have access to

Common Use Cases

Research & Learning:

  • Save academic articles for offline study
  • Build personal knowledge base
  • Extract tutorials for later reference

Content Curation:

  • Clean up articles for sharing with team
  • Build reading lists
  • Archive important content

Accessibility:

  • Convert cluttered pages to screen-reader friendly text
  • Remove distracting elements
  • Create plain text versions

Limitations

May struggle with:

  • Heavily JavaScript-rendered content
  • Paywalled articles
  • Content behind login walls
  • Sites with aggressive anti-scraping measures

Solutions:

  • Try both reader-cli and trafilatura
  • Ensure you're logged in if accessing subscription content
  • Use browser's Reader View as alternative
  • Contact site owner for API access if available

Part of Tapestry Suite

Tapestry Workflow

Article Extractor is one component of the Tapestry Skills Suite:

Full workflow:

  1. Article Extractor - Clean content from web
  2. YouTube Transcript - Extract video transcripts
  3. Ship-Learn-Next - Turn content into action plans
  4. Tapestry - Orchestrate all skills together

About This Skill

This skill was created by michalparkola as part of the Tapestry Skills for Claude Code collection.

Philosophy: "Learning = doing better, not knowing more." The Tapestry suite emphasizes turning passive content consumption into active implementation.


Extracts clean, readable content from web articles and blog posts, removing ads, navigation, and clutter using reader-cli or trafilatura.