Methodology — How We Collect and Rank LLM Data | Context Window

Data Sources

We sync model data hourly from two primary sources:

OpenRouter API — Provides model metadata including context window, max output tokens, supported parameters (vision, tool use, function calling, extended thinking, web search, prompt caching), and current pricing from multiple inference providers.
LiteLLM — Provides additional pricing data, cached input pricing, batch discount rates, and capability flags like supports_reasoning.

Where the two sources conflict, we prefer OpenRouter for context window and max output values (more up-to-date), and prefer LiteLLM for pricing (broader provider coverage). Both are merged into a single canonical record per model.

Deduplication

Many models appear under multiple IDs across providers (e.g. openai/gpt-4o and gpt-4o). We normalize model IDs by stripping provider prefixes and version suffixes, then group into canonical families. For catalog display, we show only the most-capable variant of each model family (highest context window), filtered to models with at least one public inference provider.

Capability Inference

Providers don't always declare all capabilities in their API metadata. Where explicit signals are absent, we infer:

Speed tier — Inferred from model name: flash, mini, haiku → fast; o3, r1, thinking → deep; all others → balanced.
Batch API — Inferred from well-known providers that publish batch APIs (OpenAI, Anthropic, Google).
Web search — Inferred from model IDs containing search or online, and OpenRouter supported_parameters.
Best-for tags — Assigned based on capability combination: coding (tool use + function calling), agents (tool use + large context), reasoning (extended thinking), multimodal (vision).

Rankings

Each ranking category uses a scoring function over the normalized model data:

Largest Context — Sorted by context_window descending.
Best for RAG — Context window score × retrieval capability bonus (tool use, streaming).
Best for Agents — Requires tool use + function calling. Scored by context × capability richness.
Best for Documents — Requires ≥ 100K context. Scored by context window size.
Cheapest per Context Token — Score = context_window / (input_cost + ε). Higher = more tokens per dollar.
Best Multimodal — Requires vision. Scored by context × output capacity.
Best Reasoning — Requires extended thinking or "reasoning" model tag. Scored by reasoning capability + context.
Best for Chatbots — Balanced score across context, speed tier, and price.

Use-Case Scoring

Each use case defines hard gates (minimum context, required capabilities) and a soft scoring function. Models that fail the hard gates are excluded. The remaining models are scored and ranked for that use case. Scoring weights vary by use case: RAG weights context heavily, Agents weight tool use and function calling, Document Processing weights very large context.

Pricing

All prices are in USD per million tokens ($/M). We display the cheapest available price across all inference providers tracked for that model. Prices shown are standard-tier (not batch or cached). Batch pricing and cached input pricing are shown separately in model detail pages where available.

Catalog Freshness

Data syncs run hourly via scheduled jobs. Pages revalidate at most every hour. When a provider updates a model's context window, pricing, or capabilities, the change propagates to all pages within the next sync cycle. Deprecated or unlisted models are hidden from rankings but remain accessible via direct URL for historical reference.

What We Don't Include

We deliberately exclude subjective quality metrics like benchmark scores (MMLU, HumanEval, BIG-Bench) because they are often self-reported, vary by prompt format, and don't translate directly to performance on your specific task. We focus on verifiable, structural specifications.