pdf-to-structured

Extract structured data from construction PDFs. Convert specifications, BOMs, schedules, and reports from PDF to Excel/CSV/JSON. Use OCR for scanned documents and pdfplumber for native PDFs.

View Source SKILL.md

Stars

Installs

Status

ACTIVE

Visibility

PUBLIC

安装方式

直接复制以下提示词，发送给你的 AI 助手即可完成安装。

请先检查是否已安装 SkillHub 商店，若未安装，请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店，然后安装 pdf-to-structured 技能。若已安装，则直接安装 pdf-to-structured 技能。

Overview

Skill Key: datadrivenconstruction/pdf-to-structured
Author: datadrivenconstruction
Source Repo: openclaw/skills
Version: -
Source Path: skills/datadrivenconstruction/pdf-to-structured
Latest Commit SHA: 709fad9ac7324fb535f740dffb1b9644af4b1ade

Extracted Content

SKILL.md excerpt

# PDF to Structured Data Conversion

## Overview

Based on DDC methodology (Chapter 2.4), this skill transforms unstructured PDF documents into structured formats suitable for analysis and integration. Construction projects generate vast amounts of PDF documentation - specifications, BOMs, schedules, and reports - that need to be extracted and processed.

**Book Reference:** "Преобразование данных в структурированную форму" / "Data Transformation to Structured Form"

> "Преобразование данных из неструктурированной в структурированную форму — это и искусство, и наука. Этот процесс часто занимает значительную часть работы инженера по обработке данных."
> — DDC Book, Chapter 2.4

## ETL Process Overview

The conversion follows the ETL pattern:
1. **Extract**: Load the PDF document
2. **Transform**: Parse and structure the content
3. **Load**: Save to CSV, Excel, or JSON

## Quick Start

```python
import pdfplumber
import pandas as pd

# Extract table from PDF
with pdfplumber.open("construction_spec.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()
    df = pd.DataFrame(table[1:], columns=table[0])
    df.to_excel("extracted_data.xlsx", index=False)
```

## Installation

```bash
# Core libraries
pip install pdfplumber pandas openpyxl

# For scanned PDFs (OCR)
pip install pytesseract pdf2image
# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract

# For advanced PDF operations
pip install pypdf
```

## Native PDF Extraction (pdfplumber)

### Extract All Tables from PDF

```python
import pdfplumber
import pandas as pd

def extract_tables_from_pdf(pdf_path):
    """Extract all tables from a PDF file"""
    all_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            for table_num, table in enumerate(tables):
                if table and len(table) > 1:...

Related Claw Skills

openstockdata

stock-data-skill

★ 4

OpenClaw Skill for stock data analysis

capt-marbles

geo-optimization

★ 1

Generative Engine Optimization (GEO) for AI search visibility. Optimize content to appear in ChatGPT, Perplexity, Claude, and Google AI Overviews. Use when optimizing websites, pages, or content for LLM discoverability and citation.

capgoblin

credex-protocol

★ 0

Access unsecured credit lines for AI agents on the Arc Network using the Credex Protocol. Use for borrowing USDC against reputation, repaying debt to grow credit limits, providing liquidity as an LP, or managing cross-chain USDC via Circle Bridge. Triggers on "borrow from credex", "repay debt", "deposit to pool", "check credit status", "provide liquidity", or any credit/lending task on Arc.

capt-marbles

phantombuster

★ 0

Control PhantomBuster automation agents via API. List agents, launch automations, get output/results, check status, and abort running agents. Use when the user needs to run LinkedIn scraping, Twitter automation, lead generation phantoms, or any PhantomBuster workflow.

camelsprout

duckdb-en

★ 0

DuckDB CLI specialist for SQL analysis, data processing and file conversion. Use for SQL queries, CSV/Parquet/JSON analysis, database queries, or data conversion. Triggers on "duckdb", "sql", "query", "data analysis", "parquet", "convert data".

camohiddendj

ddg-search

★ 0

DuckDuckGo HTML search scraper CLI with JSON, CSV, OpenSearch, markdown, and compact outputs.