LLM Infrastructure Brief · Non-Thinking Mode

The low-cost tier in non-thinking mode — what beats Gemini 3.1 Flash-Lite?

All models run with reasoning disabled (non-thinking). Eligible = models that support turning thinking off, or running at low/minimal effort (which we treat as the non-thinking equivalent — this keeps gpt-oss in). Cost is effective spend under your workload — 80% input-cache hit rate, 20:1 input/output — and intelligence is measured non-thinking. The frontier below tests one thing: for a given non-thinking intelligence, are open models cheaper than Gemini?

Prepared July 1, 2026 · Prices $/M tokens, on-demand list · Intelligence = Artificial Analysis Intelligence Index. Non-thinking scores shown as estimates alongside published reasoning-mode scores — see the methodology caveat.

Best Flash-Lite alternative

DeepSeek V4 Flash

~2× cheaper ($0.07 vs $0.15/unit) and higher non-thinking intelligence. Cleanly dominates Flash-Lite.

Cost floor

GPT-5 Nano

~$0.03/unit (minimal effort). Nova Lite (~$0.04) is the cheapest true thinking-off option. Both low intelligence.

Frontier owned by

Open models

Nova Lite → V4 Flash → MiniMax M3 → Kimi K2.6 → V4 Pro. Gemini appears only at the very top.

US-origin open option

gpt-oss-120B

At low effort (≈ non-thinking): cheap (~$0.08) but below DeepSeek V4 Flash on intelligence. Edge = fully open + US-origin.

Direct answer: In non-thinking mode, Gemini 3.1 Flash-Lite is Pareto-dominated. DeepSeek V4 Flash is cheaper on every token and scores higher; MiniMax M3 is a rung smarter for a hair more. Gemini's low/mid tier (3.1 Flash-Lite, 3 Flash) sits inside the frontier — no workload makes them the rational pick. Gemini only re-earns the frontier at the top (3.5 Flash).

Methodology caveat — read this. Intelligence here is measured at low effort / non-thinking, not reasoning mode. Artificial Analysis publishes mostly reasoning-mode scores, so the Low-effort intel (est.) column is derived from each model's published reasoning score minus a documented delta (hybrids drop ~30% on this reasoning-heavy index at low effort; gpt-oss uses its reasoning_effort: low setting; Kimi K2.6 and Nova are natively non-thinking, so small/no delta). Relative ordering is robust; treat absolute values as directional and confirm per model on AA.

Cost model

Effective $/unit = 0.2 × Input + 0.8 × CachedInput + 0.05 × Output (1M input + 50K output; 80% of input cached)

Running non-thinking actually makes the 20:1 ratio realistic — no reasoning-token blow-up on output, plus lower latency. Cache-read discount depth still dominates: Together / Google / Bedrock-native ≈ 90% off cached input · DeepInfra / Fireworks / Baseten ≈ 50% off · Groq = none.

1 · Non-thinking Pareto frontier — intelligence vs effective cost

Price axis reversed: cheaper is higher. So the ideal corner is top-right (cheap + smart) and the efficient frontier now reads as a convex-up curve. Every model at its cheapest US-available host, effective cost under 80%-cache / 20:1, non-thinking intelligence on the x-axis.

Open-weight (OW) Proprietary (Nova/Claude/GPT) Gemini ● filled = true off ◯ hollow = low/min effort Pareto frontier ✕ dominated

Reading the convex frontier: cheaper is up, so the efficient curve bows toward the top-right. It runs GPT-5 Nano → DeepSeek V4 Flash → MiniMax M3 → Kimi K2.6 → DeepSeek V4 Pro → Gemini 3.5 Flash. The whole mid-frontier is open-weight; only the cheap floor (GPT-5 Nano; Nova Lite is the cheapest true thinking-off) and the intelligence ceiling (Gemini 3.5 Flash) are proprietary. Gemini 3.1 Flash-Lite, Gemini 3 Flash, and even the cheapest Claude (Haiku 4.5) fall below the curve — each dominated by an open model that's both cheaper and smarter.

Apples-to-oranges guard (● vs ◯): filled points run truly thinking-off (open hybrids + native Nova/Kimi). Hollow points can only reach a low/minimal-effort floor — gpt-oss (reasoning_effort: low) and all Gemini 3.x (thinking_level: minimal, no true zero). So Gemini's points still carry some residual reasoning — if anything that flatters Gemini here, and it still doesn't reach the frontier.

2 · The numbers (non-thinking)

Model	Type	Reason. (pub.)	Low-effort intel (est.)	Input	Cached in	Output	Eff. $/unit	Frontier?
GPT-5 Nano min	Prop	~22	~16	$0.05	$0.005	$0.40	$0.03	✓ cost floor
Amazon Nova Lite off	Prop	n/a	~12	$0.06	$0.015	$0.24	$0.04	✓ cheapest true-off
gpt-oss-20B low eff	OW	~15	~11	$0.05	$0.025	$0.20	$0.04	near floor
DeepSeek V4 Flash off	OW	47	~29	$0.14	$0.03	$0.28	$0.07	✓ ★ best value
gpt-oss-120B low eff	OW	24	~18	$0.09	$0.045	$0.45	$0.08	dominated
Gemini 3.1 Flash-Lite min	Gemini	25	~22	$0.25	$0.025	$1.50	$0.145	dominated
MiniMax M3 off	OW	~44	~30	$0.30	$0.06	$1.20	$0.17	✓ frontier
Gemini 3 Flash min	Gemini	27	~19	$0.50	$0.05	$3.00	$0.29	dominated
Qwen 3.6 Max off	OW	40	~27	~$0.60	$0.06	$3.00	$0.32	dominated
Amazon Nova Pro off	Prop	n/a	~20	$0.80	$0.08	$3.20	$0.38	dominated
Kimi K2.6 off native	OW	54	~33	$0.75	$0.125	$3.50	$0.43	✓ frontier
GLM-5 off	OW	~40	~27	$1.05	$0.105	$3.50	$0.47	dominated
Claude Haiku 4.5 off	Prop	~44	~30	$1.00	$0.10	$5.00	$0.53	dominated
DeepSeek V4 Pro off	OW	52	~34	$2.10	$0.20	$4.40	$0.80	✓ frontier
Gemini 3.5 Flash min	Gemini	50	~35	$1.50	$0.15	$9.00	$0.87	✓ intel. leader
Gemini 3.1 Pro* min	Gemini	~46	~33	~$1.50	$0.15	~$12.00	~$0.95	dominated

Reasoning scores are published (default/max-effort per AA). Non-thinking = estimate (see caveat). Eff. cost uses each model's cheapest US-available host and that host's cache discount; DeepSeek V4 Pro is also available first-party at ~$0.44/$0.87 (eff ~$0.22) but that routes data to DeepSeek's servers — a residency trade-off. Kimi K2.6 & Nova are natively non-thinking; Claude Haiku 4.5 is non-thinking by default (extended thinking is opt-in); GPT-5 Nano runs at minimal effort (not fully off). Claude/GPT non-thinking scores are estimates. *Gemini 3.1 Pro output price estimated.

3 · Spotlight: cheaper than Gemini 3.1 Flash-Lite, similar-or-better non-thinking intelligence

This is the crux of your question. Flash-Lite (non-thinking): intelligence ~22, ~$0.145/unit. Only two eligible models are strictly cheaper AND at least as smart; a third is a hair more for a real intelligence jump.

Model	Non-think intel	Eff. $/unit	vs Flash-Lite
DeepSeek V4 Flash OW off	~29 (higher)	$0.07	2.1× cheaper + smarter — the answer ★
Amazon Nova Lite Prop off	~12 (lower)	$0.04	3.6× cheaper but a clear step down in intelligence; AWS-native
MiniMax M3 OW off	~30 (higher)	$0.17	~same price, meaningfully smarter — the "step up" pick

Everything else eligible is either dominated by DeepSeek V4 Flash (Gemini 3 Flash, Qwen 3.6, GLM-5, Nova Pro, gpt-oss-120B) or a paid step up in intelligence (Kimi K2.6 $0.43, DeepSeek V4 Pro $0.80). gpt-oss-120B/20B (run at low effort ≈ non-thinking) come in cheap but land below DeepSeek V4 Flash on low-effort intelligence — their real edge is being fully open-weight and US-origin, if Chinese-model provenance is a concern for you.

4 · Where to host each Pareto-frontier model

The six frontier models and their hosting options ($/M, input / cached-in / output). Cheapest US-available host is highlighted. The open-weight models are cheapest on their own first-party API, but those route data to China — the US hosts (DeepInfra, Together, Fireworks, Baseten) cost more and keep data out. Nova / GPT-5 Nano / Gemini are single-vendor.

Model	Host	Input	Cached in	Output	Notes
GPT-5 Nano Prop	OpenAI API	$0.05	$0.005	$0.40	single vendor; 90% cache
	Azure OpenAI	$0.05	$0.005	$0.40	enterprise / compliance
DeepSeek V4 Flash OW	DeepInfra US	$0.14	$0.07	$0.28	cheapest US host; 50% cache
	DeepSeek API / OpenRouter	$0.14	$0.003	$0.28	cheapest overall; data→China
	Together US	~$0.35	~$0.035	~$0.50	~90% cache (best for cache-heavy)
	Fireworks US	~$0.40	~$0.20	~$0.55	high-RPM SLAs
MiniMax M3 OW	DeepInfra US	~$0.30	$0.15	~$1.20	cheapest US host
	MiniMax API	$0.30	—	$1.20	cheapest overall; data→China
	Together US	~$0.35	$0.06	~$1.30	deep cache
Kimi K2.6 OW	DeepInfra (FP4) US	$0.75	$0.375	$3.50	cheapest US host; ~$1.44 blended
	OpenRouter	$0.55	—	$3.20	cheapest per-token (routed)
	Moonshot API	$0.95	—	$4.00	first-party; data→China
	Together / Fireworks / Parasail US	~$1.15–1.71 blended			also available
DeepSeek V4 Pro OW	DeepInfra US	$1.30	$0.65	$2.60	cheapest US host; 50% cache
	DeepSeek API	$0.44	~$0.04	$0.87	cheapest overall; data→China
	Together US	$2.10	$0.20	$4.40	~90% cache — best cache-heavy
	Baseten US	$1.74	~$0.87	$3.48	dedicated deployments
Nova Lite Prop	AWS Bedrock (only)	$0.06	$0.015	$0.24	AWS-native; 90% cache
Gemini 3.5 Flash Gemini	Google AI Studio	$1.50	$0.15	$9.00	90% cache
	Vertex AI	$1.50	$0.15	$9.00	enterprise / compliance

Host takeaway: for the open-weight frontier models, the model's own first-party API is always cheapest — but sends data to China. For US data-residency, DeepInfra is the cheapest US host (50% cache); Together wins on your cache-heavy 80%-hit workload (~90% cache read) despite higher sticker; Fireworks/Baseten for SLAs/dedicated. Nova is Bedrock-only, GPT-5 Nano is OpenAI/Azure, Gemini is Google/Vertex.

5 · What to do

Replace Gemini 3.1 Flash-Lite with DeepSeek V4 Flash. Non-thinking, ~2× cheaper effective ($0.07 vs $0.145), and higher intelligence. It's the single clearest win in this analysis.
Want the absolute floor and don't need the intelligence? Nova Lite (~$0.04) — natively non-thinking, AWS-native, zero compliance friction. Good for bulk/simple classification where ~12 intelligence suffices.
Need one notch more capability? MiniMax M3 (~$0.17) for roughly Flash-Lite's price, or step to Kimi K2.6 ($0.43, natively non-thinking) for the strongest non-thinking open model in the mid band.
Need US-origin / non-Chinese open weights? gpt-oss-120B at low effort (~$0.08) is the pick — it's dominated on raw intelligence by DeepSeek V4 Flash, but it's fully open, self-hostable, and avoids Chinese-model provenance concerns.
Reserve Gemini 3.5 Flash for the genuine top of the band — it leads non-thinking intelligence but costs ~12× DeepSeek V4 Flash, driven by its $9 output.
Cheapest possible tokens? GPT-5 Nano (~$0.03) at minimal effort is the floor — below Nova Lite — but at low intelligence (~16). Fine for trivial routing/extraction; use Nova Lite if you need a genuinely thinking-off model.
Don't reach for the cheapest Claude or GPT for this workload. Claude Haiku 4.5 (~$0.53) is dominated — MiniMax M3 matches its intelligence at ~⅓ the cost, and Kimi K2.6 beats it on both. Claude/GPT premium buys reliability and ecosystem, not price-for-intelligence here.
Host = price lever. Put open models on DeepInfra (cheapest tokens) or Together (deepest ~90% cache discount — wins on your cache-heavy workload). Keep Bedrock for Nova (its cache discount doesn't extend to third-party open models).

Bottom line: With non-thinking mode enforced, open models still win the price-for-intelligence frontier everywhere except the extreme top. Gemini 3.1 Flash-Lite is not on the frontier — DeepSeek V4 Flash beats it on both axes. You'd only stay on Gemini for managed-API reliability, multimodality, or data-residency reasons, not for cost or non-thinking intelligence.

Sources. Intelligence: Artificial Analysis Intelligence Index · Gemini 3.1 Flash-Lite (AA, index 25) · DeepSeek V4 Flash/Pro (reasoning 47/52) · BenchLM aggregate. Non-thinking support: Gemini thinking levels · Hybrid thinking toggle study. Pricing: Price Per Token · Together · DeepInfra · Nova · Gemini.

Caveats. Non-thinking Intelligence Index values are estimates derived from published reasoning scores (hybrids ≈ −30%; Kimi/Nova native non-thinking) — AA publishes mostly reasoning-mode; verify per model. Qwen 3.6 Max input, Gemini 3.1 Pro output, and MiniMax M3 cache rate are approximate. Prices are on-demand list and move fast — confirm on live pages before committing.