The Model Landscape

The major AI model families, documented from the developers' own pages. For rankings we link the live leaderboards - rankings change weekly, and we would rather send you to the source than publish stale numbers.

55 model families, all primary-sourced

Model families

Durable facts only: who makes it, what it is, how it is distributed. Each entry notes the lineup as of its verification date.

model verified June 7, 2026

AlphaCode

DeepMind's competition-level code generation system that reached the median rank in real programming contests.

model verified June 7, 2026

AlphaCode 2

Gemini-powered successor to AlphaCode that reached the 85th percentile among human competitive programmers.

model verified June 7, 2026

Aya (Cohere for AI multilingual model)

Aya is a 2024 open multilingual instruction-tuned model from Cohere for AI covering 101 languages, over half of them lower-resourced.

model verified June 7, 2026

BLOOM (BigScience multilingual LLM)

BLOOM is a 176B-parameter open multilingual language model built by the volunteer BigScience workshop, covering 46 languages plus 13 programming languages.

model verified June 7, 2026

Chronos

Amazon's Chronos, a time-series foundation model that tokenizes numeric series and trains a language model on them for zero-shot forecasting.

model verified June 6, 2026

Claude (Anthropic model family)

Anthropic's family of Claude assistant models, organized into Opus, Sonnet, and Haiku tiers and delivered via API and apps.

model verified June 7, 2026

Claude Opus 4.8

Anthropic's May 2026 flagship, an Opus 4.7 upgrade with sharper agentic judgment, faster fast mode, and dynamic workflows.

model verified June 7, 2026

Code Llama

Meta's family of open code models built on Llama 2, with infilling and long-context support up to 100k tokens.

model verified June 7, 2026

CodeGen

Salesforce's open code model family that framed program synthesis as a multi-turn conversation.

model verified June 7, 2026

Codestral

Mistral AI's first code model, a 22B open-weight system trained on more than 80 programming languages.

model verified June 7, 2026

CodeT5

Salesforce's encoder-decoder code model that uses identifier-aware pretraining for both understanding and generation.

model verified June 7, 2026

Command (Cohere model family)

Cohere's enterprise language models, including Command R and Command R+, built for retrieval-augmented generation, tool use, and multilingual business tasks.

model verified June 7, 2026

Command R (Cohere)

Cohere's family of models built for enterprise retrieval-augmented generation, tool use, and long-context multilingual tasks.

model verified June 7, 2026

DBRX (Databricks open MoE model)

Databricks' open mixture-of-experts model, released March 2024 with 132B total and 36B active parameters, beating earlier open models on standard benchmarks.

model verified June 7, 2026

DeepSeek-Coder

Open code models from DeepSeek trained on project-level code that beat Codex and GPT-3.5 on coding benchmarks.

model verified June 7, 2026

Evo (DNA foundation models)

Arc Institute's family of genomic foundation models that read and generate DNA, RNA, and protein sequences from raw nucleotides.

model verified June 7, 2026

Falcon (TII open-weight model family)

The UAE Technology Innovation Institute's open-weight language models, including Falcon-40B and the 180-billion-parameter Falcon-180B trained on web data.

model verified June 7, 2026

FLUX.1 (Black Forest Labs image models)

Black Forest Labs' 12-billion-parameter text-to-image models, released in August 2024 by much of the original Stable Diffusion team in open and closed tiers.

model verified June 6, 2026

Gemma (Google open-weight model family)

Google's family of open-weight models, built from the same research as Gemini and released for developers to download, fine-tune, and run themselves.

model verified June 7, 2026

Geneformer

A transformer pretrained on about 30 million single-cell transcriptomes that transfers to network-biology tasks with little extra data.

model verified June 6, 2026

GPT (OpenAI model family)

OpenAI's family of general-purpose generative pre-trained transformer models, delivered mainly through the OpenAI API and ChatGPT.

model verified June 7, 2026

Granite (IBM open enterprise model family)

IBM's Apache 2.0 open model family for enterprise use, including the Granite code models trained on 116 programming languages and released in May 2024.

model verified June 6, 2026

Grok (xAI model family)

xAI's family of Grok models, the AI assistants developed by Elon Musk's xAI and integrated with the X platform.

model verified June 7, 2026

InCoder

A generative code model from Meta and collaborators that can fill in missing code using context on both sides.

model verified June 7, 2026

Jamba (AI21 hybrid model family)

AI21 Labs' open models built on a hybrid Transformer-Mamba mixture-of-experts architecture, with a 256K-token context window for long-document work.

model verified June 6, 2026

Midjourney (image-generation model family)

An independent lab's text-to-image generator known for a distinctive painterly aesthetic, run first through Discord and later on its own website.

model verified June 6, 2026

Mistral (Mistral AI model family)

French lab Mistral AI's models, mixing open-weight releases with commercial API models across general, code, and audio tasks.

model verified June 7, 2026

Moirai

Salesforce's Moirai, a universal time-series forecasting transformer trained on a 27-billion-observation archive for strong zero-shot performance.

model verified June 7, 2026

NLLB-200 (No Language Left Behind)

Meta's NLLB-200 is a single open model that translates directly between 200 languages, including 150 low-resource ones.

model verified June 8, 2026

NVIDIA Cosmos 3

NVIDIA's open foundation model for physical AI that unifies vision reasoning, world generation, and action across robots and vehicles.

model verified June 6, 2026

o-series (OpenAI reasoning models)

OpenAI's line of reasoning models, beginning with o1, that think through problems step by step before answering.

model verified June 7, 2026

OLMo (Ai2 fully-open model family)

The Allen Institute's family of fully open language models, released in February 2024 with not just weights but the training data, code, and checkpoints.

model verified June 6, 2026

Phi (Microsoft small-model family)

Microsoft's Phi family of small language models designed to deliver strong capability at sizes that can run locally.

model verified June 7, 2026

Pythia (EleutherAI research model suite)

EleutherAI's suite of 16 language models from 70M to 12B parameters, all trained on the same data in the same order with 154 checkpoints each.

model verified June 6, 2026

Qwen (Alibaba model family)

Alibaba's Qwen family of language, vision, and image models, many released publicly through Hugging Face and GitHub.

model verified June 7, 2026

Runway Gen-3 Alpha

Runway's 2024 video model that improved fidelity and motion over Gen-2 and was framed as a step toward general world models.

model verified June 7, 2026

scGPT

A generative foundation model for single-cell biology, pretrained on over 33 million cells for tasks like cell typing and batch integration.

model verified June 7, 2026

Snowflake Arctic (Dense-MoE open model)

Snowflake's Apache 2.0 enterprise LLM, released April 2024 with a dense-MoE design of 480B total and 17B active parameters, trained for under $2 million.

model verified June 6, 2026

Sora (OpenAI video-generation model family)

OpenAI's text-to-video model line, from the February 2024 research preview to general availability in December 2024 and the Sora 2 release in 2025.

model verified June 6, 2026

Stable Diffusion (Stability AI model family)

Stability AI's family of open text-to-image diffusion models, from the original 1.x release through SDXL and Stable Diffusion 3, runnable on consumer hardware.

model verified June 7, 2026

StarCoder

BigCode's open 15B code model trained on permissively licensed GitHub repositories with an opt-out process.

model verified June 7, 2026

TimeGPT

Nixtla's TimeGPT, presented as the first foundation model for time series, producing zero-shot forecasts on data it never saw in training.

model verified June 7, 2026

TimesFM

Google's TimesFM, a decoder-only foundation model for forecasting that gives strong zero-shot accuracy across many public time-series datasets.

model verified June 7, 2026

Unitree H1

The Unitree H1 is a full-size general-purpose humanoid robot from Chinese firm Unitree, notable for fast bipedal running and a relatively low price.

Live leaderboards

Where current rankings actually live. These are the authoritative sources we point to instead of freezing scores that go stale.

live

Epoch AI

Structured data on notable models, compute, and hardware

live

Hugging Face

Open model hub with downloads, evals, and trending models

How models are measured

The benchmarks behind the headlines - what each one actually tests.

benchmark

AgentBench

A 2023 benchmark spanning eight environments that measured how well LLMs act as agents, exposing a wide commercial-vs-open gap.

benchmark

AgentHarm

A benchmark of malicious agent tasks that tests whether tool-using LLM agents refuse harmful requests and resist jailbreaks.

benchmark

AGIEval

A benchmark that grades foundation models on real human exams like the SAT, LSAT, and college entrance tests.

benchmark

Aider Polyglot Benchmark

Tests whether an LLM can edit existing code correctly across six languages, using 225 hard Exercism exercises.

benchmark

API-Bank

A 2023 benchmark that runs real API calls to test whether LLMs can plan, retrieve, and correctly invoke tools.

benchmark

ARC (AI2 Reasoning Challenge)

A 2018 set of 7,787 grade-school science questions, split into Easy and Challenge sets that defeated retrieval methods.

benchmark

ARC-AGI-2

The 2025 successor to ARC-AGI, a harder reasoning benchmark that frontier AI systems initially scored in the single digits on.

benchmark

BIG-Bench Hard (BBH)

The 23 hardest BIG-Bench tasks where models trailed humans, used to show chain-of-thought prompting unlocks hidden ability.

benchmark

BigCodeBench

A code benchmark of 1,140 practical tasks that require calling many real libraries and following complex instructions.

benchmark

Brown Corpus

The Brown Corpus, compiled in 1964, was the first million-word computer-readable sample of American English and seeded modern corpus linguistics.

benchmark

BrowseComp

OpenAI benchmark of 1,266 questions that force a web agent to dig persistently for hard-to-find facts.

benchmark

ChestX-ray14 (ChestX-ray8)

NIH's 2017 release of 108,948 chest X-rays from 32,717 patients with NLP-mined disease labels became a standard medical-imaging benchmark.

benchmark

CIFAR-10 and CIFAR-100

Two small labeled image datasets from Krizhevsky and Hinton that became standard testbeds for deep learning research.

benchmark

COCO (Common Objects in Context)

A large image dataset with per-object segmentation that became the standard benchmark for detection, segmentation, and captioning.

benchmark

Conceptual Captions

Google's dataset of about 3.3 million image-caption pairs harvested and cleaned from web alt-text, far larger than hand-curated caption sets like COCO.

benchmark

Cybench

A benchmark of 40 professional capture-the-flag tasks that measures how well AI agents can perform real cybersecurity work.

benchmark

DataComp

A 2023 benchmark that fixes the training code and competes on the data instead, using a pool of 12.8 billion image-text pairs to test dataset curation.

benchmark

DeepMind Control Suite

The DeepMind Control Suite is a standardized set of MuJoCo-based continuous control tasks widely used to benchmark RL agents.

benchmark

DROP

A reading-comprehension benchmark of 96,000 questions requiring discrete reasoning like counting, addition, and sorting.

benchmark

DS-1000

A benchmark of 1,000 real data-science coding problems across seven Python libraries, drawn from StackOverflow.

benchmark

Ego-Exo4D

A 1,286-hour dataset pairing first-person and third-person video of skilled activities like sports, music, and dance.

benchmark

Ego4D

A 3,670-hour dataset of first-person daily-life video from 9 countries, built to teach machines egocentric perception.

benchmark

FActScore

An evaluation that breaks long text into atomic facts and scores the share supported by a reliable source.

benchmark

FineWeb

Hugging Face's openly documented 15-trillion-token web dataset that set a new bar for transparent large-scale pre-training data.

benchmark

FrontierMath

Epoch AI's benchmark of original research-level math problems that leading models initially solved less than two percent of.

benchmark

GDPval

OpenAI benchmark scoring AI on real economically valuable work across 44 occupations in nine GDP sectors.

benchmark

Global-MMLU

A Cohere-led rebuild of MMLU across 42 languages that flags culturally biased and Western-centric questions.

benchmark

GLUE and SuperGLUE

GLUE (2018) bundled nine language tasks into one score and became the BERT-era scoreboard, then saturated within months, prompting SuperGLUE in 2019.

benchmark

GSM-Symbolic

A template-based variant of GSM8K showing model math accuracy drops when numbers change or irrelevant clauses are added.

benchmark

GSM8K (Grade School Math 8K)

A dataset of about 8,500 grade-school math word problems that tests a model's multi-step arithmetic reasoning.

benchmark

Gymnasium

Gymnasium is the maintained successor to OpenAI Gym, providing the de facto standard API for RL environments.

benchmark

HalluLens

A Meta hallucination benchmark built on a clear taxonomy, with tasks that regenerate to resist leakage.

benchmark

HarmBench

A standardized framework for automated red teaming, comparing attacks that try to make models produce harmful content.

benchmark

HellaSwag

A 2019 commonsense benchmark built by adversarial filtering, where humans scored over 95% but top models under 48%.

benchmark

HumanEval

A 164-problem test that checks whether a model can write working Python code from a natural-language description, graded by unit tests.

benchmark

Humanity's Last Exam (HLE)

A 2,500-question expert-level exam across many subjects, built to stay hard for frontier AI as easier benchmarks get saturated.

benchmark

KITTI Vision Benchmark Suite

KITTI is a 2012 real-world driving benchmark with camera and lidar data that became the standard test for autonomous-vehicle perception.

benchmark

Labeled Faces in the Wild (LFW)

LFW, released by UMass in 2007, holds 13,233 web photos of 5,749 people and became the standard face verification benchmark.

benchmark

LiveBench

A contamination-limited benchmark with monthly fresh questions and objective scoring across six task categories.

benchmark

LiveCodeBench

A code benchmark that keeps collecting new contest problems over time so models cannot have memorized the answers.

benchmark

LMArena (Chatbot Arena)

A live leaderboard that ranks AI chatbots by anonymous head-to-head human preference votes.

benchmark

MASK Benchmark (Honesty)

A benchmark that separates honesty from accuracy, measuring whether language models lie under pressure rather than just whether they are correct.

benchmark

MATH (Competition Mathematics Dataset)

A dataset of 12,500 challenging competition math problems, each with a full worked solution, used to measure mathematical problem-solving in AI.

benchmark

MathVista

A benchmark of 6,141 problems that test whether models can do math reasoning over images, charts, and diagrams.

benchmark

MGSM

A multilingual grade-school math benchmark of 250 problems translated into ten diverse languages.

benchmark

Mind2Web

A 2023 benchmark of 2,000+ real tasks across 137 websites for testing agents that follow instructions on any site.

benchmark

MLPerf

An industry benchmark suite from MLCommons that measures how fast computing systems can train and run AI models.

benchmark

MMLU-Pro

A harder version of MMLU with ten answer choices and reasoning-focused questions, where scores dropped 16 to 33 percent.

benchmark

MMMU-Pro

A hardened version of MMMU that strips out questions text-only models can guess and adds a vision-only mode.

benchmark

MNIST (handwritten digit database)

A dataset of 70,000 handwritten digit images that became the default first benchmark for computer vision and machine learning.

benchmark

MS MARCO

Microsoft's reading-comprehension and retrieval dataset built from a million real Bing search queries with human-written answers and millions of passages.

benchmark

MT-Bench

A set of multi-turn questions graded by a strong model acting as judge, used to score chat assistants on open-ended replies.

benchmark

MultiMedQA

MultiMedQA, introduced with Med-PaLM in 2023, bundles seven medical question-answering sets to test how well LLMs encode clinical knowledge.

benchmark

Needle In A Haystack

Greg Kamradt's simple test that plants a fact in long text and checks if a model can retrieve it.

benchmark

nuScenes

nuScenes is a 2019 driving dataset with the full sensor suite, 6 cameras, 5 radars, and a lidar, across 1,000 annotated scenes.

benchmark

Open LLM Leaderboard

Hugging Face's public ranking that scored open models on the same reproducible benchmarks, drawing over two million visitors.

benchmark

OSWorld

A 2024 benchmark of 369 real computer tasks across Ubuntu, Windows, and macOS where humans scored 72% and the best agent only 12%.

benchmark

Penn Treebank

A Penn-built corpus of syntactically annotated English that became the standard training and test set for parsing and language modeling.

benchmark

PlanBench

A planning benchmark from the automated-planning community that exposes how poorly LLMs generate valid plans.

benchmark

RepoBench

A benchmark for repository-level code completion that tests retrieval and prediction across multiple files.

benchmark

RULER

A synthetic long-context benchmark that finds most models hold far less context than they advertise.

benchmark

SimpleQA

OpenAI benchmark of 4,326 short factual questions that measures whether a model knows what it knows.

benchmark

SuperGLUE

A harder successor to GLUE, created after models surpassed non-expert humans on the original language-understanding benchmark.

benchmark

SWE-bench

A benchmark that tests whether AI systems can resolve real GitHub issues by editing real codebases, graded by the projects' own tests.

benchmark

SWE-bench Verified

A human-validated 500-task subset of SWE-bench, built with OpenAI, that became the headline measure of AI coding agents.

benchmark

TheAgentCompany

A 2024 CMU benchmark of 175 real workplace tasks in a simulated software company, where the best agent finished only about 30% autonomously.

benchmark

TruthfulQA

A 2021 benchmark of 817 questions where the largest models were often the least truthful, mimicking human misconceptions.

benchmark

Visual Genome

A dataset of 108,077 images with dense crowdsourced annotations of objects, attributes, and relationships, aimed at visual reasoning not just recognition.

benchmark

VSI-Bench

A visual-spatial benchmark testing whether multimodal models can understand and recall spaces from video.

benchmark

Waymo Open Dataset

Waymo's 2019 open dataset released high-quality lidar and camera data from 1,150 driving scenes for autonomous-driving perception research.

benchmark

WebArena

A 2023 CMU benchmark of realistic, self-hosted websites that exposed how poorly LLM agents perform real web tasks.

benchmark

WebLI

Google's web-scale multilingual image-text dataset, around 10 billion images with text in over 100 languages, built to train the PaLI vision-language model.

benchmark

WebVoyager

A 2024 benchmark and agent that uses screenshots and multimodal models to complete tasks on 15 real websites.

benchmark

Winograd Schema Challenge

The Winograd Schema Challenge tests commonsense reasoning with pronoun puzzles, proposed in 2012 as an alternative to the Turing test.

benchmark

WinoGrande

A 2019 benchmark of 44,000 Winograd-style pronoun puzzles, scaled up and debiased to test real commonsense reasoning.

benchmark

WMDP (Weapons of Mass Destruction Proxy)

A benchmark of 3,668 questions that proxies hazardous biology, cyber, and chemistry knowledge in language models, paired with an unlearning method.

benchmark

WordNet

Princeton's hand-built lexical database of English that organized words into concept sets and underpinned decades of NLP, including ImageNet.