arxivkb

Overview

Skill Key: camopel/arxivkb
Author: camopel
Source Repo: openclaw/skills
Version: -
Source Path: skills/camopel/arxivkb
Latest Commit SHA: 16bd32a36d5182a59ea9aa2069b6f96db31e86bf

Extracted Content

SKILL.md excerpt

# ArXivKB — Science Knowledge Base

## Why This Skill?

🏠 **100% local** — crawls arXiv's free API, embeds with Ollama (nomic-embed-text), indexes in FAISS + SQLite. No cloud cost.

🔍 **Semantic search on paper content** — FAISS indexes PDF chunks (not just abstracts), so you find papers by what they contain.

📂 **arXiv category-based** — tracks official arXiv categories (155 available, 8 groups). No free-text queries.

🧹 **Auto-cleanup** — configurable expiry deletes old papers, PDFs, and chunks.

## Install

```bash
python3 scripts/install.py
```

Works on **macOS and Linux**. Installs Python deps (`faiss-cpu`, `pdfplumber`, `tiktoken`, `arxiv`, `numpy`), pulls `nomic-embed-text` via Ollama, creates data directories and DB.

### Prerequisites

- **Ollama** — must be installed and running (`ollama serve`)
- **Python 3.10+**

## Quick Start

```bash
# 1. Add arXiv categories to track
akb categories add cs.AI cs.CV cs.LG

# 2. Browse all available categories
akb categories browse

# 3. Ingest recent papers (last 7 days)
akb ingest

# 4. Check stats
akb stats
```

## Categories

```bash
akb categories list # Show enabled categories
akb categories browse # Browse all 155 arXiv categories
akb categories browse robotics # Filter by keyword
akb categories add cs.AI cs.RO # Enable categories
akb categories delete cs.AI # Disable a category
```

Categories are official arXiv codes (e.g. `cs.AI`, `eess.IV`, `q-fin.ST`). The full taxonomy is built in.

## Ingestion

```bash
akb ingest # Crawl, download PDFs, chunk, embed
akb ingest --days 14 # Look back 14 days
akb ingest --dry-run # Preview only
akb ingest --no-pdf # Index abstracts only (faster)
```

Pipeline: arXiv API → PDF download → text extraction (pdfplumber) → chunking (tiktoken, 500 tokens, 50 overlap) → embedding (Ollama nomic-embed-text) → FAISS + SQLite.

## Paper Details

```bash
akb paper 2401.12345...

README excerpt

# arxivkb

An arXiv paper crawler with local semantic search (FAISS), topic management, and optional LLM summarization. All embedding is done locally — no cloud APIs required.

Powers the **🔬 ArXiv** app in [PrivateApp](https://github.com/camopel/PrivateApp).

## Install

```bash
python3 scripts/install.py
```

This will:
- Install Python dependencies (`faiss-cpu`, `pdfplumber`, `arxiv`, `numpy`, `tiktoken`)
- Pull the default embedding model via Ollama (`nomic-embed-text`)
- Create the data directory at `~/workspace/arxivkb/`
- Set up a SQLite database with default arXiv categories
- Schedule a daily ingest cron (systemd timer on Linux, launchd on macOS)

## Usage

### Manage topics (arXiv categories)

```bash
# Browse available categories
akb topics browse
akb topics browse "machine learning"

# List enabled categories
akb topics list

# Enable categories
akb topics add cs.AI cs.CV cs.RO stat.ML

# Disable a category
akb topics delete cs.AI
```

### Ingest papers

```bash
# Ingest papers from the last 7 days
akb ingest --days 7

# Dry run (show what would be fetched)
akb ingest --days 3 --dry-run

# Expire old papers
akb expire --days 30
```

### Search papers

```bash
# Semantic search (requires embedding model)
python3 scripts/search.py "transformer attention mechanism" --top 10

# Paper details
akb paper 2310.00001
```

### Stats

```bash
akb stats
```

## Data Directory

Papers are stored in `~/workspace/arxivkb/`:
- `arxivkb.db` — SQLite database (papers, chunks, categories)
- `pdfs/` — Downloaded PDF files
- `faiss/` — FAISS vector index files
- `config.json` — Per-user configuration

## Embedding Models

By default, ArXivKB uses `nomic-embed-text` via [Ollama](https://ollama.ai). Make sure Ollama is running:

```bash
ollama serve
ollama pull nomic-embed-text
```

Alternative models can be configured in `~/workspace/arxivkb/config.json`.

## Background Service

The installer schedules daily paper ingestion:

```bash
# Linux — systemd timer
systemctl --user...

TopRank Skills

安装方式

Overview

Extracted Content

SKILL.md excerpt

README excerpt

Related Claw Skills

bambu-studio-ai

stock-data-skill

geo-optimization

duckdb-en

openclaw-autodidact

twitter-intel