track-models | Skill Performance & Reviews | TopRankSkills

TopRank Skills

Home / Skills / tools / track-models

track-models

maintained by flashinfer-ai

star 196 account_tree 28 verified_user MIT License
bolt View GitHub

name: track-models description: Track popular/new open-source LLMs and update docs/model_coverage.mdx with their kernel support status. Use when discovering new models to add to the coverage tracker, checking if a specific model is covered, or refreshing model coverage documentation.

Track Models

Discover popular or newly released open-source LLMs, determine their kernel coverage in FlashInfer-Bench, and update docs/model_coverage.mdx with summary and detailed per-kernel status tables.

Usage

# Auto-discover new popular models not yet tracked
/track-models --discover

# Add a specific model by name
/track-models --model-name mistral-7b
/track-models --model-name gemma-3-27b --hf-repo-id google/gemma-3-27b-it

# Refresh coverage status for all already-tracked models
/track-models --refresh-status

# Do everything: discover new models + refresh existing ones
/track-models --discover --refresh-status

Parameters

  • --discover (optional): Auto-discover new popular models not yet in model_coverage.mdx
  • --model-name (optional): Specific model to add or refresh (e.g., "mistral-7b", "gemma-3-27b")
  • --hf-repo-id (optional): Override the HuggingFace repo ID (e.g., "google/gemma-3-27b-it")
  • --refresh-status (optional): Re-check definition files for all already-tracked models and update ✅/❌ status

Prerequisites

  • docs/model_coverage.mdx must exist (it does — managed by this skill)
  • flashinfer_trace/definitions/ must exist with current definition JSON files
  • Run /clone-repos first if you need SGLang or sgl-cookbook configs for a model that isn't in the existing patterns

What This Skill Does

Phase 1: Model Discovery

When --discover is set, find popular open-source LLMs not yet tracked in model_coverage.mdx.

Discovery Sources (in order)

  1. SGLang supported models list:

    ls tmp/sglang/python/sglang/srt/models/
    

    SGLang models are a curated list of production-quality LLMs. Every model in SGLang is a candidate.

  2. sgl-cookbook (models with recommended serving configs):

    ls tmp/sgl-cookbook/data/models/generated/v0.5.6/
    

    Models with a YAML config are actively deployed — highest priority.

  3. Inference API provider model lists (strong popularity signal): A model served by 2+ commercial inference APIs is definitively production-critical. Fetch or browse the following provider pages and extract the open-weight model list:

    Provider URL Notes
    Together AI https://www.together.ai/pricing Full model catalog with pricing tiers
    Fireworks AI https://fireworks.ai/pricing Open-source models section
    Groq https://groq.com/pricing Smaller curated list, very high-traffic models
    OpenRouter https://openrouter.ai/models Largest aggregator; filter by free or sort by throughput
    DeepInfra https://deepinfra.com/models Broad open-source catalog
    Hyperbolic https://app.hyperbolic.xyz/models GPU-native provider
    Cerebras https://inference.cerebras.ai Wafer-scale inference, limited but popular subset

    Heuristic: Any model listed by ≥ 2 providers should be added to coverage tracking. Models on ≥ 3 providers are the highest priority (most widely deployed in production).

    Known high-priority models from provider crawls (update this list over time):

    • Llama 4 Scout (17Bx16E) — Together, Groq, sgl-cookbook
    • Llama 4 Maverick (17Bx128E MoE) — Together, Groq
    • Llama 3.1 405B — Together, Fireworks
    • gpt-oss 120B / gpt-oss 20B — Together, Fireworks, Groq (OpenAI open-source)
    • GLM-4.5 / GLM-4.6 / GLM-5 — Together, Fireworks; sgl-cookbook has glm46.yaml
    • MiniMax M2 / M2.5 — Together, Fireworks
    • Kimi K2 / K2.5 — Together, Fireworks, Groq
    • DeepSeek-V3.1 / R1-0528 — Together, Fireworks (updated DeepSeek variants)
    • Qwen3 235B A22B — Together, Fireworks, Groq
    • Mistral Small 3 (24B) — Together; sgl-cookbook has mistral.yaml
  4. HuggingFace trending (optional, web search): Search for "huggingface trending LLM open source {current_year}" to find newly popular models. Focus on models that are:

    • Open-weight (not API-only)
    • LLM/VLM with transformer-based decoder
    • Have ≥ 1K downloads/week or are from a major lab (Meta, Google, Mistral, Alibaba, etc.)

Filtering Already-Tracked Models

Read the current docs/model_coverage.mdx and extract all model names from the ## Summary table. Skip any model whose name is already listed.

Architecture Priority

Prefer to discover models that use kernel types already supported by FlashInfer-Bench:

  • GQA attention (most LLMs)
  • MLA attention (DeepSeek family)
  • GDN linear attention (Qwen3-Next family)
  • MoE FFN (Mixtral, Qwen-MoE, DeepSeek, etc.)
  • Mamba/SSM (NemotronH, GraniteMoe-Hybrid)
  • Dense FFN (Llama, Gemma, Mistral)

Phase 2: Architecture Extraction

For each new model, extract architecture details. Use the same approach as the extract-kernel-definitions skill.

Step 1: Get config.json from HuggingFace

from huggingface_hub import hf_hub_download
import json

config_path = hf_hub_download(repo_id="meta-llama/Llama-3.1-8B", filename="config.json")
with open(config_path) as f:
    config = json.load(f)

Key fields to extract:

Field Used For
hidden_size RMSNorm definition names, GEMM K dim
num_hidden_layers Layer count
num_attention_heads GQA/MLA q_heads
num_key_value_heads GQA kv_heads
head_dim or hidden_size / num_attention_heads Attention head dim
intermediate_size GEMM N dim for MLP
vocab_size Sampling definition names
architectures Model class name → infer op types
num_experts / num_local_experts MoE expert count
num_experts_per_tok MoE topk

Step 2: Determine attention type from architecture

Use the architectures field in config.json or the SGLang model class:

Architecture String Attention Type Notes
LlamaForCausalLM, MistralForCausalLM, GemmaForCausalLM, Qwen2ForCausalLM GQA Standard
DeepseekV2ForCausalLM, DeepseekV3ForCausalLM MLA Has q_lora_rank, kv_lora_rank, qk_rope_head_dim
Qwen3NextForCausalLM GDN + GQA hybrid Has gdn_* config keys
NemotronHForCausalLM GQA + Mamba2 hybrid Has ssm_* config keys
MixtralForCausalLM GQA + MoE Has num_local_experts

Step 3: Find sgl-cookbook TP/EP config

find tmp/sgl-cookbook/data/models/generated/v0.5.6/ -name "*.yaml" | xargs grep -l "{model_keyword}"

Parse the YAML to extract all unique tp and ep values. Use these to compute per-TP head counts.

If no sgl-cookbook config exists, use TP=1 (single GPU baseline).


Phase 3: Map Kernels to Definitions

For each model, compute the full list of expected definition names, then check which ones exist in flashinfer_trace/definitions/.

3a: Compute expected definitions

Follow these rules (same as CLAUDE.md):

RMSNorm (not TP-dependent):

  • rmsnorm_h{hidden_size}
  • fused_add_rmsnorm_h{hidden_size}
  • If MLA: also rmsnorm_h{q_lora_rank} and rmsnorm_h{kv_lora_rank}

GEMM (not TP-dependent):

  • gemm_n{intermediate_size}_k{hidden_size} (gate/up proj)
  • gemm_n{hidden_size}_k{intermediate_size} (down proj)
  • gemm_n{hidden_size}_k{hidden_size} (o_proj, if square)
  • For MLA: gemm_n{q_lora_rank * num_heads}_k{hidden_size} etc. (check SGLang impl)

GQA (TP-dependent, per TP value):

  • gqa_paged_prefill_causal_h{q//TP}_kv{kv//TP}_d{head_dim}_ps1
  • gqa_paged_prefill_causal_h{q//TP}_kv{kv//TP}_d{head_dim}_ps64
  • gqa_paged_decode_h{q//TP}_kv{kv//TP}_d{head_dim}_ps1
  • gqa_paged_decode_h{q//TP}_kv{kv//TP}_d{head_dim}_ps64
  • gqa_ragged_prefill_causal_h{q//TP}_kv{kv//TP}_d{head_dim}

MLA (TP-dependent, per TP value):

  • mla_paged_prefill_causal_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps1
  • mla_paged_prefill_causal_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps64
  • mla_paged_decode_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps1
  • mla_paged_decode_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_ps64

Where ckv_dim = kv_lora_rank + qk_rope_head_dim and kpe_dim = qk_rope_head_dim.

DSA (for DeepSeek V3.2-style, TP-dependent):

  • dsa_topk_indexer_fp8_h{num_heads//TP}_d{head_dim}_topk{topk}_ps64
  • dsa_sparse_attention_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_topk{topk}_ps1
  • dsa_sparse_attention_h{num_heads//TP}_ckv{ckv_dim}_kpe{kpe_dim}_topk{topk}_ps64

GDN (for Qwen3-Next-style, TP-dependent, per TP value):

  • gdn_prefill_qk{q_heads//TP}_v{v_heads//TP}_d{head_dim}_k_last
  • gdn_decode_qk{q_heads//TP}_v{v_heads//TP}_d{head_dim}_k_last
  • gdn_mtp_qk{q_heads//TP}_v{v_heads//TP}_d{head_dim}_k_last

Mamba2 SSU (for NemotronH-style, TP-dependent):

  • mamba_ssu_decode_h{nheads//TP}_d{head_dim}_s{dstate}_ng{ngroups//TP}
  • Constraints: head_dim ∈ [64,128,256], dstate ∈ [64,128,256], nheads/ngroups ∈ [1,8,16]

MoE (EP-dependent):

  • moe_fp8_block_scale_ds_routing_topk{topk}_ng{n_group}_kg{topk_group}_e{num_experts//EP}_h{hidden_size}_i{intermediate_size}

Sampling (not TP-dependent):

  • top_k_sampling_from_probs_v{vocab_size}
  • top_k_top_p_sampling_from_probs_v{vocab_size}
  • top_p_sampling_from_probs_v{vocab_size}

3b: Check definition existence

For each expected definition name, check:

find flashinfer_trace/definitions/ -name "{definition_name}.json"

Assign status:

  • — JSON file exists in flashinfer_trace/definitions/
  • — Name computed from model config, but no JSON file found (missing, needs to be created)
  • — Module exists in the model architecture but definition is not computed/mapped (unmapped)

3c: Compute coverage summary

coverage = (count of ✅) / (count of ✅ + count of ❌)
  • ✅ Fully covered — all expected definitions exist (0 ❌)
  • 🟡 Partial — some definitions exist, some missing
  • ❌ Not covered — no definitions exist (all ❌ or —)

Phase 4: Update docs/model_coverage.mdx

4a: Add row to Summary table

Find the ## Summary section and add (or update) the model row:

| {Model Display Name} | {Architecture description} | {Coverage emoji + label} |

Architecture description examples:

  • GQA + Dense (standard transformer)
  • GQA + MoE (standard attention + mixture of experts)
  • MLA + Dense/MoE (multi-head latent attention)
  • GDN + GQA + MoE (hybrid linear + standard attention)
  • GQA + Mamba2 (hybrid attention + SSM)

4b: Add detailed section

Append (or update) a new ## {Model Display Name} section before the closing ---:

## {Model Display Name}

**Architecture**: {num_layers} decoder layers, {attention_type} attention, {ffn_type} FFN{extra details}

{If multiple TP configs, add note}: Standard serving configuration: **TP={N}** (or TP={N} / TP={M}).

| Definition | Op Type | Status |
|-----------|---------|:------:|
| `rmsnorm_h{hidden_size}` | rmsnorm | ✅ |
| `fused_add_rmsnorm_h{hidden_size}` | rmsnorm | ✅ |
| ... | ... | ... |
| MoE gate / topk / experts | moe | — |

**Coverage**: {N} / {M} definitions present. {Optional: note about missing definitions.}

---

4c: Preserve existing sections

Do NOT overwrite existing model sections unless --refresh-status is set. When refreshing:

  • Re-check all ✅/❌ in the existing table
  • Update the coverage count line
  • Update the summary table row

Documentation Standards

Model Display Names

Use consistent naming matching the summary table:

  • DeepSeek V3 / R1 (joint entry for same-architecture variants)
  • Llama 3.1 8B (include size)
  • Qwen3 30B A3B (include size + variant)
  • Mistral 7B v0.3
  • Gemma 3 27B

Architecture Descriptions

Keep brief (5–8 words), following existing patterns:

  • {layer_count} decoder layers, {attention} attention, {ffn} FFN
  • Add hybrid prefix for mixed architectures (e.g., 48 layers, hybrid GDN+GQA attention, MoE FFN)

TP/EP Notes

When multiple TP configs exist, show all in the table using separate rows with TP label in Op Type column:

| `gdn_prefill_qk16_v32_d128_k_last` | gdn TP=1 | ❌ |
| `gdn_prefill_qk8_v16_d128_k_last` | gdn TP=2 | ✅ |
| `gdn_prefill_qk4_v8_d128_k_last` | gdn TP=4 | ✅ |

For MoE with EP:

| `moe_..._e256_...` | moe EP=1 | ❌ |
| `moe_..._e32_...` | moe EP=8 | ✅ |

Unmapped Modules

For modules that exist in the architecture but don't have definitions mapped, use :

| MoE gate / topk / experts | moe | — |

Output

After running, report:

  1. Models added: List of new models added to model_coverage.mdx
  2. Models refreshed: List of models whose coverage status was updated
  3. Missing definitions: List of ❌ definitions that need to be created (suggest running /extract-kernel-definitions)
  4. Summary delta: What changed in the summary table (newly fully covered, newly partial, etc.)

Common Model Architectures Reference

GQA Standard (Llama / Mistral / Gemma / Qwen2.5)

Expected definitions:

  • 2× RMSNorm
  • 4× GEMM (qkv_proj, o_proj, gate_up, down)
  • 5× GQA (paged decode ps1, paged decode ps64, paged prefill ps1, paged prefill ps64, ragged prefill)
  • 3× Sampling (top_k, top_k_top_p, top_p)

GQA + MoE (Mixtral / Qwen2-MoE / Qwen3-MoE)

Expected definitions: same as GQA Standard, plus:

  • 1× MoE (per EP config)

MLA + Dense/MoE (DeepSeek V2/V3/R1)

Expected definitions:

  • 4× RMSNorm (hidden, q_lora, kv_lora, plus sometimes kv_lora variants)
  • N× GEMM (MLA has more projections: qkv_a, q_b, kv_b, o_proj, gate/up/down)
  • 4× MLA (paged decode ps1, paged decode ps64, paged prefill ps1, paged prefill ps64)
  • 1× MLA ragged (prefill)
  • 1× MoE (per EP config)
  • 3× Sampling

GDN + GQA + MoE (Qwen3-Next)

Expected definitions:

  • 2× RMSNorm
  • 3× GDN per TP config (prefill, decode, mtp)
  • 5× GQA per TP config
  • 1× MoE
  • (Sampling: if tracked)

GQA + Mamba2 (NemotronH / GraniteMoe-Hybrid)

Expected definitions:

  • 2× RMSNorm
  • N× GEMM
  • 5× GQA
  • 1× Mamba SSU per TP config
  • 3× Sampling

Note: Check FlashInfer Mamba SSU constraints before adding: nheads/ngroups ∈ [1, 8, 16]. Models that violate this (e.g., FalconH1 ngroups=1 nheads=128) cannot use FlashInfer SSU and should be marked ❌ for Mamba kernel with a note.


Integration with Other Skills

# Discover new models and see what's missing
/track-models --discover

# Create missing kernel definitions for a new model
/extract-kernel-definitions --model-name mistral_7b

# Add reference tests for new definitions
/add-reference-tests --op-type gqa_paged

# Full workflow for a brand new model
/clone-repos
/track-models --model-name gemma-3-27b --hf-repo-id google/gemma-3-27b-it
/extract-kernel-definitions --model-name gemma3
/add-reference-tests --op-type gqa_paged

Error Handling

HuggingFace config not accessible

  • Cause: Private model or network error
  • Handling: Note the model as ❌ Not covered with a "config unavailable" note in the detailed section. Proceed with other models.

Architecture not recognized

  • Cause: Novel architecture not in the known patterns
  • Handling: Add model to summary as ❌ Not covered, add a detailed section with — (architecture not yet mapped) for all kernel rows and a note explaining the unknown pattern.

Definition file naming ambiguity

  • Cause: Computed name doesn't match any definition (off-by-one in dims, different naming convention)
  • Handling: Mark as ❌ and list the computed name in the detailed section. Cross-check with actual files in flashinfer_trace/definitions/ before marking as missing.

Model not in SGLang

  • Cause: Model isn't implemented in SGLang yet
  • Handling: Still add to coverage doc using HuggingFace config.json. Note "SGLang implementation pending" in the architecture description.

Maintaining This Document

Update this file when:

  • New op_types are added to FlashInfer-Bench (add to Phase 3 mapping and Common Architectures)
  • Definition naming conventions change
  • New major model families emerge (add to Common Architectures)

See Also

chat Comments (0)

chat_bubble_outline

No comments yet. Be the first to share your thoughts!

Skill Details

GitHub Stars 196
GitHub Forks 28
Created Mar 2026
Last Updated 3个月前
tools tools debugging

Related Skills

fabric
chevron_right
typescript-expert
chevron_right
break-loop
chevron_right
burp-suite
chevron_right
page-behavior-audit
chevron_right

Build your own?

Join 12,000+ developers contributing to the Claude ecosystem.