TopRank Skills

Home / Claw Skills / Araignée / web-scraper
Official OpenClaw rules 54%

web-scraper

Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and LLM entity extraction

Stars

0

Installs

0

Status

ACTIVE

Visibility

PUBLIC

安装方式

直接复制以下提示词,发送给你的 AI 助手即可完成安装。

请先检查是否已安装 SkillHub 商店,若未安装,请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店,然后安装 web-scraper 技能。 若已安装,则直接安装 web-scraper 技能。

Overview

Skill Key
guifav/web-scraper
Author
guifav
Source Repo
openclaw/skills
Version
-
Source Path
skills/guifav/web-scraper
Latest Commit SHA
80c570be72ec2ee8a2abcdc0cdf4825da69c2104

Extracted Content

SKILL.md excerpt

# Web Scraper

You are a senior data engineer specialized in web scraping and content extraction. You extract, clean, and comprehend web page content using a multi-strategy cascade approach: always start with the lightest method and escalate only when needed. You use LLMs exclusively on clean text (never raw HTML) for entity extraction and content comprehension. This skill creates Python scripts, YAML configs, and JSON output files. It never reads or modifies `.env`, `.env.local`, or credential files directly.

**Credential scope:** `OPENROUTER_API_KEY` is used in generated Python scripts to call the OpenRouter API for LLM-based entity extraction (Stage 5). The skill references this variable in template code only — it never makes direct API calls itself. All other operations (HTTP requests, HTML parsing, Playwright rendering) require no credentials.

## Planning Protocol (MANDATORY — execute before ANY action)

Before writing any scraping script or running any command, you MUST complete this planning phase:

1. **Understand the request.** Determine: (a) what URLs or domains need to be scraped, (b) what content needs to be extracted (full article, metadata only, entities), (c) whether this is a single page or a bulk crawl, (d) the expected output format (JSON, CSV, database).

2. **Survey the environment.** Check: (a) installed Python packages (`pip list | grep -E "requests|beautifulsoup4|scrapy|playwright|trafilatura"`), (b) whether Playwright browsers are installed (`npx playwright install --dry-run`), (c) available disk space for output, (d) `.env.example` for expected API keys. Do NOT read `.env`, `.env.local`, or any file containing actual credential values.

3. **Analyze the target.** Before choosing an extraction strategy: (a) check if the URL responds to a simple GET request, (b) detect if JavaScript rendering is needed, (c) check for paywall indicators, (d) identify the site's Schema.org markup. Document findings.

4. **Choose the extraction strategy.** Use...

Related Claw Skills

openstockdata

stock-data-skill

★ 4

OpenClaw Skill for stock data analysis

capt-marbles

geo-optimization

★ 1

Generative Engine Optimization (GEO) for AI search visibility. Optimize content to appear in ChatGPT, Perplexity, Claude, and Google AI Overviews. Use when optimizing websites, pages, or content for LLM discoverability and citation.

camopel

ddgs-search

★ 0

Free multi-engine web search via ddgs CLI (DuckDuckGo, Google, Bing, Brave, Yandex, Yahoo, Wikipedia) + arXiv API search. No API keys required. Use when user needs web search, research paper discovery, or when other skills need a search backend. Drop-in replacement for web-search-plus.

camopel

arxivkb

★ 0

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud API keys required — everything runs locally.

camopel

finviz-crawler

★ 0

Continuous financial news crawler for finviz.com with SQLite storage, article extraction, and query tool. Use when monitoring financial markets, building news digests, or needing a local financial news database. Runs as a background daemon or systemd service.

capgoblin

credex-protocol

★ 0

Access unsecured credit lines for AI agents on the Arc Network using the Credex Protocol. Use for borrowing USDC against reputation, repaying debt to grow credit limits, providing liquidity as an LP, or managing cross-chain USDC via Circle Bridge. Triggers on "borrow from credex", "repay debt", "deposit to pool", "check credit status", "provide liquidity", or any credit/lending task on Arc.