Web Scraping & Data Extraction Engine

Overview

Skill Key: 1kalin/afrexai-web-scraping-engine
Author: 1kalin
Source Repo: openclaw/skills
Version: -
Source Path: skills/1kalin/afrexai-web-scraping-engine
Latest Commit SHA: 6d8483d0bcfc4a8a21077afbb88d292f662b3a32

Extracted Content

SKILL.md excerpt

# Web Scraping & Data Extraction Engine

## Quick Health Check (Run First)

Score your scraping operation (2 points each):

| Signal | Healthy | Unhealthy |
|--------|---------|-----------|
| Legal compliance | robots.txt checked, ToS reviewed | Scraping blindly |
| Architecture | Tool matches site complexity | Using Puppeteer for static HTML |
| Anti-detection | Rotation, delays, fingerprint diversity | Single IP, no delays |
| Data quality | Validation + dedup pipeline | Raw dumps, no cleaning |
| Error handling | Retry logic, circuit breakers | Crashes on first 403 |
| Monitoring | Success rates tracked, alerts set | No visibility |
| Storage | Structured, deduplicated, versioned | Flat files, duplicates |
| Scheduling | Appropriate frequency, off-peak | Hammering during business hours |

**Score: /16** → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign

---

## Phase 1: Legal & Ethical Foundation

### Pre-Scrape Compliance Checklist

```yaml
compliance_brief:
  target_domain: ""
  date_assessed: ""
  
  robots_txt:
    checked: false
    target_paths_allowed: false
    crawl_delay_specified: ""
    ai_bot_rules: ""  # Many sites now block AI crawlers specifically
    
  terms_of_service:
    reviewed: false
    scraping_mentioned: false
    scraping_prohibited: false
    api_available: false
    api_sufficient: false
    
  data_classification:
    type: ""  # public-factual | public-personal | behind-auth | copyrighted
    contains_pii: false
    pii_types: []  # name, email, phone, address, photo
    gdpr_applies: false  # EU residents' data
    ccpa_applies: false  # California residents' data
    
  legal_risk: ""  # low | medium | high | do-not-scrape
  decision: ""  # proceed | use-api | request-permission | abandon
  justification: ""
```

### Legal Landscape Quick Reference

| Scenario | Risk Level | Key Case Law |
|----------|-----------|--------------|
| Public data, no login, robots.txt allows | LOW | hiQ v. LinkedIn (2022) |
| Public...

README excerpt

# Web Scraping & Data Extraction Engine 🕸️

Complete web scraping methodology for AI agents and developers — from legal compliance to production-scale data pipelines.

## What This Skill Does

Turns your AI agent into a web scraping expert that:

- **Assesses legality** before touching any site (robots.txt, ToS, GDPR/CCPA)
- **Selects the right tool** (HTTP clients → Scrapy → Playwright → managed services)
- **Defeats anti-bot detection** with proxy rotation, fingerprint diversity, and stealth patterns
- **Builds data pipelines** with validation, deduplication, and structured storage
- **Monitors health** with breakage detection, success rate tracking, and alerting
- **Scales efficiently** from single-site to millions of pages

## Install

```bash
clawhub install afrexai-web-scraping-engine
```

## Quick Start

Tell your agent:
- "Check if I can scrape example.com/products"
- "Build a price monitoring scraper for 3 competitor sites"
- "My scraper keeps getting blocked — help"
- "Extract product data from this URL"

## What's Inside

- **Legal compliance framework** with case law references and decision rules
- **Tool selection matrix** (8 tools compared across 5 dimensions)
- **Anti-detection strategies** (proxy tiers, stealth configs, Cloudflare bypass)
- **Code patterns** for pagination, JS rendering, authentication, change detection
- **Data pipeline** with validation, deduplication, cleaning, and storage
- **5 complete scraping patterns** (e-commerce, jobs, news, social, real estate)
- **Production operations** — monitoring dashboard, breakage detection, runbook
- **100-point quality scoring** rubric

## ⚡ Level Up

This free skill covers methodology. For industry-specific data extraction strategies:

- **[SaaS Context Pack](https://afrexai-cto.github.io/context-packs/)** — Competitor monitoring, pricing intelligence
- **[Ecommerce Context Pack](https://afrexai-cto.github.io/context-packs/)** — Product data, price tracking at scale
- **[Real Estate Context Pack...

Related Claw Skills

openbotx

★ 83

An open-source platform for orchestrating AI agents — secure, simple, and built for everyone. Multi-agent, real-time task board, web control panel, skills system, browser automation, multi-provider, scheduler, and more. One command to start. Everything from the browser. No coding required.

0xnyk

xint

★ 49

X Intelligence CLI — search, monitor, analyze, and engage on X/Twitter. TypeScript + Bun. AI agent skill.

heyixuan2

bambu-studio-ai

★ 41

Bambu Lab 3D printer control and automation. Activate when user mentions: printer status, 3D printing, slice, analyze model, generate 3D, AMS filament, print monitor, Bambu Lab, or any 3D printing task. Full pipeline: search → generate → analyze → colorize → preview → open BS → user slice → print → monitor. Supports all 9 Bambu Lab printers (A1 Mini, A1, P1S, P2S, X1C, X1E, H2C, H2S, H2D).

jackculpan

flightclaw

★ 32

Track flight prices from Google Flights with this OpenClaw skill. Search routes, monitor prices, and get alerts when prices drop.

openclaw-trade

openclaw-trading-assistant

★ 24

openclaw trading assistant| openclaw trading skill | nof1.ai & openclaw [moltbot] collaboration | We get the best practices from alpha arena trading seasons and bring it to clawdbot All top AI agents, realtime monitoring and news research, gather info from private insiders and many other! Using Hyperliquid API.

xquik-dev

x-twitter-scraper

★ 16

X (Twitter) automation skill for AI coding agents. Tweet search, user lookup, follower/following extraction, media download, reply/retweet/quote extraction, 40+ tools, account monitoring & trending topics. REST API, MCP server, HMAC webhooks. Works with Claude Code, Cursor, Codex, Copilot, Windsurf & 40+ agents.

TopRank Skills

安装方式