Home / Skills / tools / evaluating-llms-harness

evaluating-llms-harness

star 4.7k account_tree 380 verified_user MIT License

Overview Implementation Examples History

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Key Features

Comprehensive skill evaluation and performance tracking
Community-driven ratings and reviews
Easy integration with Claude Code
Regular updates and maintenance

Quick Start

TopRank Skills install Orchestra-Research/lm-evaluation-harness

chat Comments (0)

chat_bubble_outline

No comments yet. Be the first to share your thoughts!

Skill Details

GitHub Stars 4.7k

GitHub Forks 380

Created Mar 2026

Last Updated 4个月前

tools tools debugging

Related Skills

fabric

danielmiessler

star 9.7k

chevron_right

typescript-expert

vudovn

star 4.2k

chevron_right

break-loop

mindfold-ai

star 3.3k

chevron_right

burp-suite

trailofbits

star 2.4k

chevron_right

page-behavior-audit

openclaw

star 2.4k

chevron_right

Build your own?

Join 12,000+ developers contributing to the Claude ecosystem.

Sign in to Comment

evaluating-llms-harness

Key Features

Quick Start

chat Comments (0)

Skill Details

Related Skills

fabric

typescript-expert

break-loop

burp-suite

page-behavior-audit

Build your own?