Overview
- Skill Key
- 1kalin/afrexai-web-scraping-engine
- Author
- 1kalin
- Source Repo
- openclaw/skills
- Version
- -
- Source Path
- skills/1kalin/afrexai-web-scraping-engine
- Latest Commit SHA
- 6d8483d0bcfc4a8a21077afbb88d292f662b3a32
Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrapers, extracting web data, monitoring competitors, or automating data collection at scale.
Stars
0
Installs
0
Status
ACTIVE
Visibility
PUBLIC
直接复制以下提示词,发送给你的 AI 助手即可完成安装。
请先检查是否已安装 SkillHub 商店,若未安装,请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店,然后安装 Web Scraping & Data Extraction Engine 技能。 若已安装,则直接安装 Web Scraping & Data Extraction Engine 技能。
# Web Scraping & Data Extraction Engine
## Quick Health Check (Run First)
Score your scraping operation (2 points each):
| Signal | Healthy | Unhealthy |
|--------|---------|-----------|
| Legal compliance | robots.txt checked, ToS reviewed | Scraping blindly |
| Architecture | Tool matches site complexity | Using Puppeteer for static HTML |
| Anti-detection | Rotation, delays, fingerprint diversity | Single IP, no delays |
| Data quality | Validation + dedup pipeline | Raw dumps, no cleaning |
| Error handling | Retry logic, circuit breakers | Crashes on first 403 |
| Monitoring | Success rates tracked, alerts set | No visibility |
| Storage | Structured, deduplicated, versioned | Flat files, duplicates |
| Scheduling | Appropriate frequency, off-peak | Hammering during business hours |
**Score: /16** → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign
---
## Phase 1: Legal & Ethical Foundation
### Pre-Scrape Compliance Checklist
```yaml
compliance_brief:
target_domain: ""
date_assessed: ""
robots_txt:
checked: false
target_paths_allowed: false
crawl_delay_specified: ""
ai_bot_rules: "" # Many sites now block AI crawlers specifically
terms_of_service:
reviewed: false
scraping_mentioned: false
scraping_prohibited: false
api_available: false
api_sufficient: false
data_classification:
type: "" # public-factual | public-personal | behind-auth | copyrighted
contains_pii: false
pii_types: [] # name, email, phone, address, photo
gdpr_applies: false # EU residents' data
ccpa_applies: false # California residents' data
legal_risk: "" # low | medium | high | do-not-scrape
decision: "" # proceed | use-api | request-permission | abandon
justification: ""
```
### Legal Landscape Quick Reference
| Scenario | Risk Level | Key Case Law |
|----------|-----------|--------------|
| Public data, no login, robots.txt allows | LOW | hiQ v. LinkedIn (2022) |
| Public...
# Web Scraping & Data Extraction Engine 🕸️ Complete web scraping methodology for AI agents and developers — from legal compliance to production-scale data pipelines. ## What This Skill Does Turns your AI agent into a web scraping expert that: - **Assesses legality** before touching any site (robots.txt, ToS, GDPR/CCPA) - **Selects the right tool** (HTTP clients → Scrapy → Playwright → managed services) - **Defeats anti-bot detection** with proxy rotation, fingerprint diversity, and stealth patterns - **Builds data pipelines** with validation, deduplication, and structured storage - **Monitors health** with breakage detection, success rate tracking, and alerting - **Scales efficiently** from single-site to millions of pages ## Install ```bash clawhub install afrexai-web-scraping-engine ``` ## Quick Start Tell your agent: - "Check if I can scrape example.com/products" - "Build a price monitoring scraper for 3 competitor sites" - "My scraper keeps getting blocked — help" - "Extract product data from this URL" ## What's Inside - **Legal compliance framework** with case law references and decision rules - **Tool selection matrix** (8 tools compared across 5 dimensions) - **Anti-detection strategies** (proxy tiers, stealth configs, Cloudflare bypass) - **Code patterns** for pagination, JS rendering, authentication, change detection - **Data pipeline** with validation, deduplication, cleaning, and storage - **5 complete scraping patterns** (e-commerce, jobs, news, social, real estate) - **Production operations** — monitoring dashboard, breakage detection, runbook - **100-point quality scoring** rubric ## ⚡ Level Up This free skill covers methodology. For industry-specific data extraction strategies: - **[SaaS Context Pack](https://afrexai-cto.github.io/context-packs/)** — Competitor monitoring, pricing intelligence - **[Ecommerce Context Pack](https://afrexai-cto.github.io/context-packs/)** — Product data, price tracking at scale - **[Real Estate Context Pack...
openbotx
An open-source platform for orchestrating AI agents — secure, simple, and built for everyone. Multi-agent, real-time task board, web control panel, skills system, browser automation, multi-provider, scheduler, and more. One command to start. Everything from the browser. No coding required.
0xnyk
X Intelligence CLI — search, monitor, analyze, and engage on X/Twitter. TypeScript + Bun. AI agent skill.
heyixuan2
Bambu Lab 3D printer control and automation. Activate when user mentions: printer status, 3D printing, slice, analyze model, generate 3D, AMS filament, print monitor, Bambu Lab, or any 3D printing task. Full pipeline: search → generate → analyze → colorize → preview → open BS → user slice → print → monitor. Supports all 9 Bambu Lab printers (A1 Mini, A1, P1S, P2S, X1C, X1E, H2C, H2S, H2D).
jackculpan
Track flight prices from Google Flights with this OpenClaw skill. Search routes, monitor prices, and get alerts when prices drop.
openclaw-trade
openclaw trading assistant| openclaw trading skill | nof1.ai & openclaw [moltbot] collaboration | We get the best practices from alpha arena trading seasons and bring it to clawdbot All top AI agents, realtime monitoring and news research, gather info from private insiders and many other! Using Hyperliquid API.
xquik-dev
X (Twitter) automation skill for AI coding agents. Tweet search, user lookup, follower/following extraction, media download, reply/retweet/quote extraction, 40+ tools, account monitoring & trending topics. REST API, MCP server, HMAC webhooks. Works with Claude Code, Cursor, Codex, Copilot, Windsurf & 40+ agents.