Overview
- Skill Key
- baokui/pdf-ocr-layout
- Author
- baokui
- Source Repo
- openclaw/skills
- Version
- -
- Source Path
- skills/baokui/pdf-ocr-layout
- Latest Commit SHA
- be0a5a9a63f12fcc33fc20238d237551bbe0136d
Stars
0
Installs
0
Status
ACTIVE
Visibility
PUBLIC
直接复制以下提示词,发送给你的 AI 助手即可完成安装。
请先检查是否已安装 SkillHub 商店,若未安装,请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店,然后安装 Pdf Ocr Layout 技能。 若已安装,则直接安装 Pdf Ocr Layout 技能。
# GLM-OCR Multimodal Deep Analysis
This tool builds a high-precision document parsing pipeline: using **GLM-OCR** for layout element extraction, calling **GLM-4.7** for logical interpretation of table data, and calling **GLM-4.6V** for multimodal visual interpretation of images and charts.
## Pipeline Implementation Architecture
This Skill consists of two core script stages, orchestrated through `glm_ocr_pipeline.py`:
### 1. Extraction Stage (`scripts/glm_ocr_extract.py`)
- **Core Model**: GLM-OCR
- **Function**: Responsible for physical layout analysis of documents
- **Output**: Extract table HTML and clean to Markdown, automatically crop independent chart image files based on Bbox coordinates, and generate intermediate JSON containing full page reading order
### 2. Understanding Stage (`scripts/glm_understanding.py`)
- **Core Model**: GLM-4.7 (text) / GLM-4.6V (visual)
- **Function**: Responsible for deep semantic reasoning of content
- **Logic**:
- **Tables**: Combine full text context, use GLM-4.7 to analyze business meaning of Markdown table data
- **Charts**: Combine full text context + cropped images, use GLM-4.6V for multimodal visual analysis
## Invocation Methods
### Command Line Invocation
```bash
# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
--file_path "/data/report_page.jpg" \
--output_dir "/data/output"
```
## API Parameter Description
| Parameter | Type | Required | Description |
| --- | --- | --- | --- |
| file_path | string | ✅ | Absolute path to input file (supports .pdf, .png, .jpg) |
| output_dir | string | ✅ | Result output directory (used to save cropped images and JSON reports) |
## Return Result Structure (JSON)
The tool returns a list containing layout elements and their deep understanding:
```json
[
{
"type": "table",
"bbox": [100, 200, 500, 600],
"content_info": "| Revenue | Q1 |\n|-...
edholofy
University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.
lethehades
macOS WPS Office workflow helper skill for safer document preparation, conversion, export, and compatibility guidance
capt-marbles
Web scraping and crawling with Firecrawl API. Fetch webpage content as markdown, take screenshots, extract structured data, search the web, and crawl documentation sites. Use when the user needs to scrape a URL, get current web info, capture a screenshot, extract specific data from pages, or crawl docs for a framework/library.
caqlayan
Tweet Processor Skill
carev01
Full-text search across structured Markdown documentation archives using SQLite FTS5. Use when you need to search large collections of Markdown articles that are separated by "---" delimiters and contain source URLs (marked with "*Source:" pattern). Provides fast BM25-ranked search with automatic source URL extraction for citations. Ideal for research, documentation lookups, and knowledge base exploration. Requires indexing documentation first with `docs.py index`.
camelsprout
DuckDB CLI specialist for SQL analysis, data processing and file conversion. Use for SQL queries, CSV/Parquet/JSON analysis, database queries, or data conversion. Triggers on "duckdb", "sql", "query", "data analysis", "parquet", "convert data".