Reading Android Bench: scores, cost, and the efficiency trap

Google quietly shipped something useful in March 2026: Android Bench, a leaderboard that scores large language models on real Android engineering tasks instead of generic coding puzzles. As an Android developer, this is the first benchmark I actually care about — it's measuring the thing I'd use these models for. So I pulled the June 9th 2026 leaderboard apart to see what it really says.

TL;DR

The best model (GPT 5.5) resolves 74% of tasks — and even the top four are a statistical tie once you read the confidence intervals.
On raw score-per-dollar, Gemini 3.1 Pro Preview is the standout of the top tier: the same 72.4% as GPT 5.4 at roughly half the cost.
The cost and latency columns are seductive but misleading — by Google's own admission, models that fail fast look cheap.
For Android specifically, no model is "done." A quarter of the hardest tasks still beat everyone.

## What Android Bench actually measures

Most coding benchmarks lean on web and Python repositories. Android development has its own shape — Kotlin, Jetpack Compose, the View system, Gradle, the lifecycle — and Google built the benchmark to reflect it.¹

The dataset is 100 tasks distilled from 38,989 pull requests across open-source Android projects, weighted to mirror real codebases:

71% Kotlin, 25% Java
41% Jetpack Compose, 59% View-based

Each model gets a real issue and must produce a patch; a Patch Verifier compiles the project and runs the test suite to decide whether the issue is actually resolved. The headline score is the average percentage of the 100 tasks solved across 10 runs, with a confidence interval (p < 0.05) attached. To guard against contamination, Google embeds the standard BIG-BENCH canary string and manually reviewed agent trajectories to rule out reward hacking.¹

That last part matters: this isn't "did the text look right," it's "did the patch compile and pass tests." For Android, where a plausible-looking change can still break the build, that's the right bar.

## The scoreboard

Top of the June 9th 2026 board:²

#	Model	Score	95% CI	Latency	Tokens	Cost
1	GPT 5.5	74.0%	66.9–80.6	15.7h	64.7M	$134.2
2	GPT 5.4	72.4%	65.4–79.0	21.2h	64.2M	$91.7
3	Gemini 3.1 Pro Preview	72.4%	65.2–79.0	11.1h	73.3M	$47.9
4	Claude Opus 4.7	68.7%	61.0–76.0	11.6h	90.0M	$124.3
5	Claude Opus 4.6	66.6%	59.2–74.0	9.9h	69.5M	$84.4
6	Gemini 3.5 Flash	63.7%	56.3–70.7	14.2h	355.9M	$147.1
7	GLM 5.1	59.7%	52.1–67.4	33.4h	80.2M	$46.7
8	Kimi K2.6	58.6%	51.3–66.1	29.9h	94.3M	$42.5
9	Claude Sonnet 4.6	58.4%	50.3–66.3	8.2h	47.9M	$40.4
10	DeepSeek V4 Pro	55.4%	47.9–63.5	35.8h	132.7M	$13.7

The first thing to sit with: the best score in the world on real Android tasks is 74%. A quarter of these issues defeat every model on the board. AI assistance is genuinely strong here, but "it'll handle my backlog" is not where we are.

## The top is a tie — read the intervals, not the rank

It's tempting to crown GPT 5.5 the "best Android model." But look at the confidence intervals:

Rank ≠ better

On a 100-task set with 10 runs, a 1.6-point gap is noise. Treat the top cluster (≈68–74%) as one tier and choose within it on the things that actually differ — cost, latency, and how the model fits your workflow.

That reframing is the whole game, because once you stop chasing the #1 badge, the economics get interesting.

## Cost per point: the real ranking

Score alone ignores what you pay for it. Dividing cost by score gives a rough "dollars per point of capability" — a crude but honest way to compare. Using the board's own numbers:²

Model	Score	Cost	Cost / point
Gemini 3.1 Pro Preview	72.4%	$47.9	$0.66
Claude Sonnet 4.6	58.4%	$40.4	$0.69
GPT 5.4	72.4%	$91.7	$1.27
Claude Opus 4.7	68.7%	$124.3	$1.81
GPT 5.5	74.0%	$134.2	$1.81
Qwen 3.6 Max Preview	51.4%	$222.4	$4.33

Read across the top tier and the story flips. Gemini 3.1 Pro Preview ties GPT 5.4 on score at roughly half the cost, and lands within a hair of GPT 5.5 for about a third of the spend. GPT 5.5 buys you ~1.6 extra points (inside the noise) for nearly 3× the cost per point. Meanwhile Qwen 3.6 Max Preview is the cautionary tale: a mid-table score at the highest price on the board.

If you want maximum capability per dollar and can tolerate ~55% and long runtimes, DeepSeek V4 Pro (55.4% at $13.7) and V4 Flash (52.7% at $8.4) are far down-board on raw score but dominate on price — which leads straight to the catch.

## The trap in the cost and latency columns

Cheap and fast look like virtues until you read Google's stated limitations.¹ The benchmark explicitly flags a bias toward incomplete runs: a model that fails a task early consumes fewer tokens and less time, so it appears more efficient.

Failing fast looks like efficiency

Low token counts and low cost can be a symptom of giving up, not of being lean. Always read the cost and latency columns next to the score — never on their own.

You can see the shape of it on the board. Gemma 4 31B IT posts the lowest token use (29.5M) and a tiny $2.5 cost — but only solves 33.2%. Some of that "efficiency" is just shorter, failed attempts. Conversely, DeepSeek V4 Pro's value is more believable precisely because it isn't cheap on effort: 35.8h and 132.7M tokens say it's actually grinding through tasks, not bailing.

Latency has its own spread worth noting: Claude Sonnet 4.6 finishes the suite in 8.2h while DeepSeek V4 Pro takes 35.8h — a 4× difference that matters a lot if a model sits in your inner-loop tooling rather than a nightly batch.

## Open weights are closing, not closed

The proprietary frontier still leads, but the open-weight field is no longer a rounding error. DeepSeek V4 Pro (55.4%) and V4 Flash (52.7%) are within striking distance of mid-tier proprietary models, and Alibaba's Qwen and Google's own Gemma round out a credible open lineup further down.² The very bottom — GPT OSS 20B at 2.4% — is a reminder that "open" spans a huge capability range; small open models are not yet useful for autonomous Android patching.

## So what should an Android dev actually use?

Reading the board as a practitioner, not a leaderboard-watcher:

Best capability, cost no object: GPT 5.5 / GPT 5.4 / Gemini 3.1 Pro — pick on ecosystem fit; they're a statistical tie.
Best value at the frontier: Gemini 3.1 Pro Preview — top-tier score, ~half the cost, and the second-fastest of the leaders.
Cost-sensitive or self-hosted: DeepSeek V4 Pro/Flash if ~55% and long runtimes are acceptable.
Inner-loop / latency-sensitive: Claude Sonnet 4.6 — not the top score, but the fastest credible option (8.2h) at a low price.

A snapshot, not a verdict

This is the June 9th 2026 board, and several entries are Preview builds. Provider pricing shifts, models get revised, and Google rotates older versions to an archive. Re-check the live leaderboard before making a call — and treat every number here as "true on that date."

The bigger takeaway is the one Google is clearly fishing for by publishing this at all: Android-specific evaluation is now a thing models are optimized against. That's good for us. The methodology, dataset, and harness are open on GitHub if you want to reproduce a run or scrutinize a task — which, given how much rides on these numbers, is exactly the right thing for Google to have done.

Task selection (100 tasks from 38,989 pull requests), the Kotlin/Java/Compose/View breakdown, the Patch Verifier, the BIG-BENCH canary string, the trajectory review, and the stated limitations (including the incomplete-run efficiency bias) are from the Android Bench methodology and Google's launch post, Elevating AI-assisted Android development (March 2026). ↩
Leaderboard figures (score, confidence interval, latency, tokens, cost) from the Android Bench leaderboard, "Latest results as of June 9th 2026." Cost-per-point values are my own arithmetic (cost ÷ score) on those figures. ↩