wechat-article-extractor

Overview

Skill Key: chunhualiao/wechat-article-extractor
Author: chunhualiao
Source Repo: openclaw/skills
Version: -
Source Path: skills/chunhualiao/wechat-article-extractor
Latest Commit SHA: eb46cc32205396ccb0a500f3517058a93ecaf2f9

Extracted Content

SKILL.md excerpt

# WeChat Article Extractor

Extract WeChat public account articles to clean Markdown. WeChat blocks headless browsers (环境异常 CAPTCHA) and `web_fetch` gets empty JS-rendered pages, so the reliable approach is: find a mirror on aggregator sites, then extract content.

## Scope & Boundaries

**This skill handles:**
- Extracting article text, images, and metadata from WeChat article URLs
- Finding mirror copies when direct access is blocked
- Converting HTML to clean Markdown
- Saving output as `.md` files

**This skill does NOT handle:**
- Publishing or syncing to note-taking apps (that's the user's workflow)
- Batch extraction of multiple articles (handle one at a time)
- WeChat login, authentication, or account management
- Translating article content

## Inputs

| Input | Required | Description |
|-------|----------|-------------|
| WeChat URL | Yes | An `mp.weixin.qq.com` link |
| Output filename | No | Defaults to kebab-case of article title |
| Save location | No | Defaults to `/tmp/` |

## Outputs

- A Markdown file with full article content, images, and metadata header
- Console confirmation with file path and character count

## Workflow

### Step 1 — Try direct fetch (fast path)

```
web_fetch(url, extractMode="markdown", maxChars=50000)
```

**Success check:** If result `rawLength > 500` AND content has real paragraphs (not just nav/footer text) → skip to Step 4 Option B.

**Failure indicators:** `rawLength < 500`, content is navigation/boilerplate only, or contains "环境异常" → go to Step 2.

### Step 2 — Extract article metadata

From the URL or any partial content, identify:
- Article title (from `<title>` or og:title)
- Author / account name (from og:description or page content)

If metadata is unavailable from the URL, ask the user for the article title.

### Step 3 — Search for mirrors

```
web_search("<article title> <author/account name>")
```

**Mirror site priority** (ranked by content quality and reliability):
1. **53ai.com** — full content, re...

README excerpt

# wechat-article-extractor

Extract WeChat public account (微信公众号) articles to clean Markdown files with images and metadata.

## Problem

WeChat articles are notoriously difficult to archive:
- Direct scraping is blocked by bot detection (环境异常 CAPTCHA)
- `web_fetch` gets empty JavaScript-rendered shells
- Headless browsers trigger anti-bot measures

This skill works around these limitations by automatically finding mirror copies on aggregator sites, then extracting clean content.

## How It Works

1. Attempts direct fetch (works ~10% of the time)
2. If blocked, searches for mirror copies on aggregator sites (53ai.com, ofweek.com, juejin.cn, etc.)
3. Downloads mirror HTML and extracts article content, images, and metadata
4. Outputs clean Markdown with proper formatting

Falls back to Chrome Extension Relay for very new or niche articles with no mirrors.

## Installation

Copy the skill directory to your OpenClaw skills folder:

```bash
cp -r wechat-article-extractor ~/.openclaw/<workspace>/skills/
```

### Requirements

- Python 3.8+
- `curl` (for downloading mirror pages)
- OpenClaw tools: `web_fetch`, `web_search`, `exec`
- Optional: `browser` tool (for Chrome Relay fallback)

## Usage

Share a WeChat article URL with your agent:

> "Save this article: https://mp.weixin.qq.com/s/example123"

The skill triggers automatically on `mp.weixin.qq.com` URLs.

### Trigger Phrases

- Any `mp.weixin.qq.com` URL
- "extract wechat article"
- "save wechat article"
- "archive wechat"
- "提取公众号文章"
- "保存公众号文章"

## Output Format

```markdown
# Article Title

**作者：** Author Name
**来源：** 微信公众号「Account Name」
**日期：** 2024-01-15
**原文：** https://mp.weixin.qq.com/s/...

---

Full article content with images preserved...
```

## Extraction Script

The included Python script handles HTML-to-Markdown conversion:

```bash
# Extract from downloaded HTML
python3 scripts/extract_wechat.py article.html output.md

# With source URL for metadata
python3 scr...

TopRank Skills

安装方式

Overview

Extracted Content

SKILL.md excerpt

README excerpt

Related Claw Skills

laborany

openclaw-remote-minimax-setup-skill

awesome-openclaw-learning-skills

openclaw-admin

bug-audit-skill

Agenvoy