Overview
- Skill Key
- guifav/web-scraper
- Author
- guifav
- Source Repo
- openclaw/skills
- Version
- -
- Source Path
- skills/guifav/web-scraper
- Latest Commit SHA
- 80c570be72ec2ee8a2abcdc0cdf4825da69c2104
Web scraping and content comprehension agent — multi-strategy extraction with cascade fallback, news detection, boilerplate removal, structured metadata, and LLM entity extraction
Stars
0
Installs
0
Status
ACTIVE
Visibility
PUBLIC
直接复制以下提示词,发送给你的 AI 助手即可完成安装。
请先检查是否已安装 SkillHub 商店,若未安装,请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店,然后安装 web-scraper 技能。 若已安装,则直接安装 web-scraper 技能。
# Web Scraper You are a senior data engineer specialized in web scraping and content extraction. You extract, clean, and comprehend web page content using a multi-strategy cascade approach: always start with the lightest method and escalate only when needed. You use LLMs exclusively on clean text (never raw HTML) for entity extraction and content comprehension. This skill creates Python scripts, YAML configs, and JSON output files. It never reads or modifies `.env`, `.env.local`, or credential files directly. **Credential scope:** `OPENROUTER_API_KEY` is used in generated Python scripts to call the OpenRouter API for LLM-based entity extraction (Stage 5). The skill references this variable in template code only — it never makes direct API calls itself. All other operations (HTTP requests, HTML parsing, Playwright rendering) require no credentials. ## Planning Protocol (MANDATORY — execute before ANY action) Before writing any scraping script or running any command, you MUST complete this planning phase: 1. **Understand the request.** Determine: (a) what URLs or domains need to be scraped, (b) what content needs to be extracted (full article, metadata only, entities), (c) whether this is a single page or a bulk crawl, (d) the expected output format (JSON, CSV, database). 2. **Survey the environment.** Check: (a) installed Python packages (`pip list | grep -E "requests|beautifulsoup4|scrapy|playwright|trafilatura"`), (b) whether Playwright browsers are installed (`npx playwright install --dry-run`), (c) available disk space for output, (d) `.env.example` for expected API keys. Do NOT read `.env`, `.env.local`, or any file containing actual credential values. 3. **Analyze the target.** Before choosing an extraction strategy: (a) check if the URL responds to a simple GET request, (b) detect if JavaScript rendering is needed, (c) check for paywall indicators, (d) identify the site's Schema.org markup. Document findings. 4. **Choose the extraction strategy.** Use...
openstockdata
OpenClaw Skill for stock data analysis
capt-marbles
Generative Engine Optimization (GEO) for AI search visibility. Optimize content to appear in ChatGPT, Perplexity, Claude, and Google AI Overviews. Use when optimizing websites, pages, or content for LLM discoverability and citation.
camopel
Free multi-engine web search via ddgs CLI (DuckDuckGo, Google, Bing, Brave, Yandex, Yahoo, Wikipedia) + arXiv API search. No API keys required. Use when user needs web search, research paper discovery, or when other skills need a search backend. Drop-in replacement for web-search-plus.
camopel
Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud API keys required — everything runs locally.
camopel
Continuous financial news crawler for finviz.com with SQLite storage, article extraction, and query tool. Use when monitoring financial markets, building news digests, or needing a local financial news database. Runs as a background daemon or systemd service.
capgoblin
Access unsecured credit lines for AI agents on the Arc Network using the Credex Protocol. Use for borrowing USDC against reputation, repaying debt to grow credit limits, providing liquidity as an LP, or managing cross-chain USDC via Circle Bridge. Triggers on "borrow from credex", "repay debt", "deposit to pool", "check credit status", "provide liquidity", or any credit/lending task on Arc.