LLM Infrastructure Brief · Non-Thinking Mode

The low-cost tier in non-thinking mode — what beats Gemini 3.1 Flash-Lite?

All models run with reasoning disabled (non-thinking). Eligible = models that support turning thinking off, or running at low/minimal effort (which we treat as the non-thinking equivalent — this keeps gpt-oss in). Cost is effective spend under your workload — 80% input-cache hit rate, 20:1 input/output — and intelligence is measured non-thinking. The frontier below tests one thing: for a given non-thinking intelligence, are open models cheaper than Gemini?

Prepared July 1, 2026 · Prices $/M tokens, on-demand list · Intelligence = Artificial Analysis Intelligence Index. Non-thinking scores shown as estimates alongside published reasoning-mode scores — see the methodology caveat.
Best Flash-Lite alternative
DeepSeek V4 Flash
~2× cheaper ($0.07 vs $0.15/unit) and higher non-thinking intelligence. Cleanly dominates Flash-Lite.
Cost floor
GPT-5 Nano
~$0.03/unit (minimal effort). Nova Lite (~$0.04) is the cheapest true thinking-off option. Both low intelligence.
Frontier owned by
Open models
Nova Lite → V4 Flash → MiniMax M3 → Kimi K2.6 → V4 Pro. Gemini appears only at the very top.
US-origin open option
gpt-oss-120B
At low effort (≈ non-thinking): cheap (~$0.08) but below DeepSeek V4 Flash on intelligence. Edge = fully open + US-origin.
Direct answer: In non-thinking mode, Gemini 3.1 Flash-Lite is Pareto-dominated. DeepSeek V4 Flash is cheaper on every token and scores higher; MiniMax M3 is a rung smarter for a hair more. Gemini's low/mid tier (3.1 Flash-Lite, 3 Flash) sits inside the frontier — no workload makes them the rational pick. Gemini only re-earns the frontier at the top (3.5 Flash).
Methodology caveat — read this. Intelligence here is measured at low effort / non-thinking, not reasoning mode. Artificial Analysis publishes mostly reasoning-mode scores, so the Low-effort intel (est.) column is derived from each model's published reasoning score minus a documented delta (hybrids drop ~30% on this reasoning-heavy index at low effort; gpt-oss uses its reasoning_effort: low setting; Kimi K2.6 and Nova are natively non-thinking, so small/no delta). Relative ordering is robust; treat absolute values as directional and confirm per model on AA.

Cost model

Effective $/unit = 0.2 × Input + 0.8 × CachedInput + 0.05 × Output  (1M input + 50K output; 80% of input cached)
Running non-thinking actually makes the 20:1 ratio realistic — no reasoning-token blow-up on output, plus lower latency. Cache-read discount depth still dominates: Together / Google / Bedrock-native ≈ 90% off cached input · DeepInfra / Fireworks / Baseten ≈ 50% off · Groq = none.

1 · Non-thinking Pareto frontier — intelligence vs effective cost

Price axis reversed: cheaper is higher. So the ideal corner is top-right (cheap + smart) and the efficient frontier now reads as a convex-up curve. Every model at its cheapest US-available host, effective cost under 80%-cache / 20:1, non-thinking intelligence on the x-axis.

Open-weight (OW) Proprietary (Nova/Claude/GPT) Gemini ● filled = true off ◯ hollow = low/min effort Pareto frontier ✕ dominated
$0 $0.20 $0.40 $0.60 $0.80 $1.00 0 10 20 30 40 Low-effort / non-thinking Intelligence Index (higher →) Effective $/unit ↑ cheaper ↓ pricier cheap + smart ✓ dominates DeepSeek V4 Flash ★ MiniMax M3 Qwen 3.6 Max GLM-5 Kimi K2.6 DeepSeek V4 Pro gpt-oss-120B gpt-oss-20B Nova Lite Nova Pro GPT-5 Nano Claude Haiku 4.5 Gemini 3.1 Flash-Lite Gemini 3 Flash Gemini 3.1 Pro Gemini 3.5 Flash
Reading the convex frontier: cheaper is up, so the efficient curve bows toward the top-right. It runs GPT-5 Nano → DeepSeek V4 FlashMiniMax M3Kimi K2.6DeepSeek V4 Pro → Gemini 3.5 Flash. The whole mid-frontier is open-weight; only the cheap floor (GPT-5 Nano; Nova Lite is the cheapest true thinking-off) and the intelligence ceiling (Gemini 3.5 Flash) are proprietary. Gemini 3.1 Flash-Lite, Gemini 3 Flash, and even the cheapest Claude (Haiku 4.5) fall below the curve — each dominated by an open model that's both cheaper and smarter.
Apples-to-oranges guard (● vs ◯): filled points run truly thinking-off (open hybrids + native Nova/Kimi). Hollow points can only reach a low/minimal-effort floorgpt-oss (reasoning_effort: low) and all Gemini 3.x (thinking_level: minimal, no true zero). So Gemini's points still carry some residual reasoning — if anything that flatters Gemini here, and it still doesn't reach the frontier.

2 · The numbers (non-thinking)

ModelTypeReason.
(pub.)
Low-effort
intel (est.)
InputCached inOutputEff. $/unitFrontier?
GPT-5 Nano minProp~22~16$0.05$0.005$0.40$0.03✓ cost floor
Amazon Nova Lite offPropn/a~12$0.06$0.015$0.24$0.04✓ cheapest true-off
gpt-oss-20B low effOW~15~11$0.05$0.025$0.20$0.04near floor
DeepSeek V4 Flash offOW47~29$0.14$0.03$0.28$0.07✓ ★ best value
gpt-oss-120B low effOW24~18$0.09$0.045$0.45$0.08dominated
Gemini 3.1 Flash-Lite minGemini25~22$0.25$0.025$1.50$0.145dominated
MiniMax M3 offOW~44~30$0.30$0.06$1.20$0.17✓ frontier
Gemini 3 Flash minGemini27~19$0.50$0.05$3.00$0.29dominated
Qwen 3.6 Max offOW40~27~$0.60$0.06$3.00$0.32dominated
Amazon Nova Pro offPropn/a~20$0.80$0.08$3.20$0.38dominated
Kimi K2.6 off nativeOW54~33$0.75$0.125$3.50$0.43✓ frontier
GLM-5 offOW~40~27$1.05$0.105$3.50$0.47dominated
Claude Haiku 4.5 offProp~44~30$1.00$0.10$5.00$0.53dominated
DeepSeek V4 Pro offOW52~34$2.10$0.20$4.40$0.80✓ frontier
Gemini 3.5 Flash minGemini50~35$1.50$0.15$9.00$0.87✓ intel. leader
Gemini 3.1 Pro* minGemini~46~33~$1.50$0.15~$12.00~$0.95dominated
Reasoning scores are published (default/max-effort per AA). Non-thinking = estimate (see caveat). Eff. cost uses each model's cheapest US-available host and that host's cache discount; DeepSeek V4 Pro is also available first-party at ~$0.44/$0.87 (eff ~$0.22) but that routes data to DeepSeek's servers — a residency trade-off. Kimi K2.6 & Nova are natively non-thinking; Claude Haiku 4.5 is non-thinking by default (extended thinking is opt-in); GPT-5 Nano runs at minimal effort (not fully off). Claude/GPT non-thinking scores are estimates. *Gemini 3.1 Pro output price estimated.

3 · Spotlight: cheaper than Gemini 3.1 Flash-Lite, similar-or-better non-thinking intelligence

This is the crux of your question. Flash-Lite (non-thinking): intelligence ~22, ~$0.145/unit. Only two eligible models are strictly cheaper AND at least as smart; a third is a hair more for a real intelligence jump.

ModelNon-think intelEff. $/unitvs Flash-Lite
DeepSeek V4 Flash OW off~29 (higher)$0.072.1× cheaper + smarter — the answer ★
Amazon Nova Lite Prop off~12 (lower)$0.043.6× cheaper but a clear step down in intelligence; AWS-native
MiniMax M3 OW off~30 (higher)$0.17~same price, meaningfully smarter — the "step up" pick
Everything else eligible is either dominated by DeepSeek V4 Flash (Gemini 3 Flash, Qwen 3.6, GLM-5, Nova Pro, gpt-oss-120B) or a paid step up in intelligence (Kimi K2.6 $0.43, DeepSeek V4 Pro $0.80). gpt-oss-120B/20B (run at low effort ≈ non-thinking) come in cheap but land below DeepSeek V4 Flash on low-effort intelligence — their real edge is being fully open-weight and US-origin, if Chinese-model provenance is a concern for you.

4 · Where to host each Pareto-frontier model

The six frontier models and their hosting options ($/M, input / cached-in / output). Cheapest US-available host is highlighted. The open-weight models are cheapest on their own first-party API, but those route data to China — the US hosts (DeepInfra, Together, Fireworks, Baseten) cost more and keep data out. Nova / GPT-5 Nano / Gemini are single-vendor.

ModelHostInputCached inOutputNotes
GPT-5 Nano PropOpenAI API$0.05$0.005$0.40single vendor; 90% cache
Azure OpenAI$0.05$0.005$0.40enterprise / compliance
DeepSeek V4 Flash OWDeepInfra US$0.14$0.07$0.28cheapest US host; 50% cache
DeepSeek API / OpenRouter$0.14$0.003$0.28cheapest overall; data→China
Together US~$0.35~$0.035~$0.50~90% cache (best for cache-heavy)
Fireworks US~$0.40~$0.20~$0.55high-RPM SLAs
MiniMax M3 OWDeepInfra US~$0.30$0.15~$1.20cheapest US host
MiniMax API$0.30$1.20cheapest overall; data→China
Together US~$0.35$0.06~$1.30deep cache
Kimi K2.6 OWDeepInfra (FP4) US$0.75$0.375$3.50cheapest US host; ~$1.44 blended
OpenRouter$0.55$3.20cheapest per-token (routed)
Moonshot API$0.95$4.00first-party; data→China
Together / Fireworks / Parasail US~$1.15–1.71 blendedalso available
DeepSeek V4 Pro OWDeepInfra US$1.30$0.65$2.60cheapest US host; 50% cache
DeepSeek API$0.44~$0.04$0.87cheapest overall; data→China
Together US$2.10$0.20$4.40~90% cache — best cache-heavy
Baseten US$1.74~$0.87$3.48dedicated deployments
Nova Lite PropAWS Bedrock (only)$0.06$0.015$0.24AWS-native; 90% cache
Gemini 3.5 Flash GeminiGoogle AI Studio$1.50$0.15$9.0090% cache
Vertex AI$1.50$0.15$9.00enterprise / compliance
Host takeaway: for the open-weight frontier models, the model's own first-party API is always cheapest — but sends data to China. For US data-residency, DeepInfra is the cheapest US host (50% cache); Together wins on your cache-heavy 80%-hit workload (~90% cache read) despite higher sticker; Fireworks/Baseten for SLAs/dedicated. Nova is Bedrock-only, GPT-5 Nano is OpenAI/Azure, Gemini is Google/Vertex.

5 · What to do

Bottom line: With non-thinking mode enforced, open models still win the price-for-intelligence frontier everywhere except the extreme top. Gemini 3.1 Flash-Lite is not on the frontier — DeepSeek V4 Flash beats it on both axes. You'd only stay on Gemini for managed-API reliability, multimodality, or data-residency reasons, not for cost or non-thinking intelligence.
Sources. Intelligence: Artificial Analysis Intelligence Index · Gemini 3.1 Flash-Lite (AA, index 25) · DeepSeek V4 Flash/Pro (reasoning 47/52) · BenchLM aggregate. Non-thinking support: Gemini thinking levels · Hybrid thinking toggle study. Pricing: Price Per Token · Together · DeepInfra · Nova · Gemini.

Caveats. Non-thinking Intelligence Index values are estimates derived from published reasoning scores (hybrids ≈ −30%; Kimi/Nova native non-thinking) — AA publishes mostly reasoning-mode; verify per model. Qwen 3.6 Max input, Gemini 3.1 Pro output, and MiniMax M3 cache rate are approximate. Prices are on-demand list and move fast — confirm on live pages before committing.