Overview
- Skill Key
- felixopt17/bbccrawlermaxclaw
- Author
- felixopt17
- Source Repo
- openclaw/skills
- Version
- -
- Source Path
- skills/felixopt17/bbccrawlermaxclaw
- Latest Commit SHA
- f9e9b1ee653c2ad8bf6294be8602f748f92b7297
Stars
0
Installs
0
Status
ACTIVE
Visibility
PUBLIC
直接复制以下提示词,发送给你的 AI 助手即可完成安装。
请先检查是否已安装 SkillHub 商店,若未安装,请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店,然后安装 Bbccrawlermaxclaw 技能。 若已安装,则直接安装 Bbccrawlermaxclaw 技能。
# BBC Crawler MaxClaw ## Description A powerful, universal web crawler optimized for BBC News but capable of crawling other sites. It integrates advanced scraping technologies including Crawl4AI and Playwright to handle dynamic content and anti-bot protections. ## Features - **Multi-Method Extraction**: - `crawl4ai`: Primary method using AsyncWebCrawler for high performance and accuracy. - `playwright`: Full browser rendering fallback for complex dynamic pages. - `requests`: Fast fallback for static content. - `auto`: Automatically detects the best method (Prioritizes Crawl4AI). - **Hierarchical Storage**: Saves content in a structured format: `YYYY-MM-DD/Category/Title.md`. - **Local Image Archiving**: Downloads images locally, names them by MD5 hash, and updates Markdown references. - **Content Filtering**: Intelligently extracts main article content and relevant images using CSS selectors. ## Requirements - Python 3.9+ - See `requirements.txt` for Python packages. ## Installation ```bash # 1. Install dependencies # Note: install.py supports passing arguments to pip, e.g., --break-system-packages python install.py # Example for environments requiring --break-system-packages: python install.py --break-system-packages ``` ## Usage ### Basic Usage ```bash python universal_crawler_v2.py --url https://www.bbc.co.uk/news --max-pages 50 ``` ### Advanced Usage ```bash # Force Crawl4AI python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method crawl4ai # Force Playwright python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method playwright # Control depth and delay python universal_crawler_v2.py --url https://www.bbc.co.uk/news --depth 3 --delay 2.5 # Specify output directory python universal_crawler_v2.py --url https://www.bbc.co.uk/news --output ./my_data ``` ## Troubleshooting - **Import Errors**: If you see "No module named 'crawl4ai'" or similar, run `python install.py` again. - **Empty Responses**: Ensure you have the...
---
AIGC:
ContentProducer: Minimax Agent AI
ContentPropagator: Minimax Agent AI
Label: AIGC
ProduceID: "00000000000000000000000000000000"
PropagateID: "00000000000000000000000000000000"
ReservedCode1: 3045022074c546b58330a07da35f0e0d94c0ad51cc6d4e445b235ef033071d1b4da4a11f022100b661c9e283570d245986240a3688ea7cfc0107c8f8d80d9985d4b35a77cfa953
ReservedCode2: 3045022100f1f6d58d0a92215eab415fba9da1ec4c2525d726e00458bb0456467ff4e572550220151c90d6d8f68eb711764b45a38831c0f5ba52612215d97fcc3e218db6ed5873
---
# BBC Crawler Skill
这是一个基于OpenClaw的BBC和通用网站爬虫Skill。
## 功能特性
- 支持BBC News、BBC Sport等栏目的精准内容提取
- 支持通用网站爬取(Generic Mode)
- 支持Markdown格式输出
- 自动提取正文、标题、作者、发布时间等元数据
- 智能去重和深度控制
- **图片本地化下载**:自动下载正文图片到本地,并更新Markdown链接,支持离线查看。
## 独立安装指南
该 Skill 支持在 Windows、Linux 和 macOS 等多种环境下运行。
### 1. 准备环境
确保已安装 Python 3.8+。
### 2. 快速安装与运行 (推荐)
**Windows 用户:**
1. 双击运行 `install_dependencies.bat` 安装依赖。
2. 双击运行 `run_bbc_crawler.bat` 启动演示爬虫。
**Linux / macOS 用户:**
1. 在终端中运行安装脚本:
```bash
chmod +x install_dependencies.sh
./install_dependencies.sh
```
2. 运行演示爬虫:
```bash
chmod +x run_bbc_crawler.sh
./run_bbc_crawler.sh
```
### 3. 手动安装 (高级用户)
在Skill目录下运行:
```bash
pip install -r requirements.txt
```
### 3. 运行爬虫
#### 基本用法
```bash
python universal_crawler_v2.py --url https://www.bbc.com/news --max-pages 50
```
#### 常用参数
- `--url`: 起始URL(必填)
- `--max-pages`: 最大爬取页面数(默认50)
- `--depth`: 爬取深度(默认5)
- `--output`: 输出目录(默认当前目录下的 `data` 文件夹)
- `--delay`: 请求间隔时间(秒,默认3.0)
### 4. 输出结果
爬取结果将保存在指定的输出目录中,结构如下:
```
data/
└── {日期}/
└── {栏目}/
├── images/ # 图片文件夹
│ ├── {hash}.jpg
│ └── ...
└── {域...
capt-marbles
Web scraping and crawling with Firecrawl API. Fetch webpage content as markdown, take screenshots, extract structured data, search the web, and crawl documentation sites. Use when the user needs to scrape a URL, get current web info, capture a screenshot, extract specific data from pages, or crawl docs for a framework/library.
capt-marbles
Control PhantomBuster automation agents via API. List agents, launch automations, get output/results, check status, and abort running agents. Use when the user needs to run LinkedIn scraping, Twitter automation, lead generation phantoms, or any PhantomBuster workflow.
camopel
Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud API keys required — everything runs locally.
camopel
Continuous financial news crawler for finviz.com with SQLite storage, article extraction, and query tool. Use when monitoring financial markets, building news digests, or needing a local financial news database. Runs as a background daemon or systemd service.
cccccqqqqq
Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) for execution; this skill provides strategy, recipes, and best practices.
dorukardahan
Extract all assets and content from websites including images, SVGs, fonts, videos, and page structure. Parallel agents with thorough scraping coverage. Triggers: "extract assets", "scrape website", "download site assets", "get all images from", or "/traktor url". Supports multiple URLs.