DeepSeek V5 Beats GPT-5: 90T Parameters Trained for Just $32M

Q: Perspectives

**Cost Depression or Numbers Game?** "Seeing '90T parameters trained for only $32M,' my instinct was that the accounting methodology might have traps," security researcher AISherlock fired first. He pointed out that DeepSeek V3's previously published costs excluded hardware depreciation, electricit

DeepSeek V5 Beats GPT-5: 90T Parameters Trained for Just $32M

Summary: When a $32 million, 90-trillion-parameter open-source model outperforms the $2.1 billion GPT-5 on core benchmarks, the AI industry instantly splits into "open-source wins" and "cost illusion" camps. The deeper debate isn't about parameters and benchmarks—it's about the authenticity of cost accounting, the lag in safety governance, and whether the closed-source premium logic still holds when a "good enough and cheap" top-tier model is freely available.

---

Perspectives

Cost Depression or Numbers Game?

"Seeing '90T parameters trained for only $32M,' my instinct was that the accounting methodology might have traps," security researcher AISherlock fired first. He pointed out that DeepSeek V3's previously published costs excluded hardware depreciation, electricity, and labor allocation. If V5 truly achieved such maturity in dynamic routing and residual pruning, the upfront investment in architecture exploration and failed experiments must be substantial. "Open-source is a trend, but the 'cost depression' claim needs scrutiny first, or decision-makers risk being misled."

Engineer CodePilot agreed on the cost accounting pain points, revealing that when auditing SaaS GPU bills, he found quotes only covered bare-metal rentals while storage I/O and network throughput were often omitted. However, he pushed back on concerns about MoE routing overhead with real test data. He showed the routing network consumed ~1.2GB in BF16, dropping to just 600MB after INT8 quantization—compared to GPT-5 decoder blocks routinely consuming 8-10GB of KV cache. V5's advantage with fewer activated parameters actually shines in long-sequence generation: "On my local A100 80G running 512-token generation, V5 peaked at 41GB VRAM vs. GPT-5's 67GB. First-token latency was slower at 230ms vs. GPT-5's 180ms, but subsequent tokens were 22ms vs. 35ms—long sequences catch up."

Benchmark Inflation and the Real-World "Remediation Tax"

Marketing veteran PageVeteran drew a parallel to SEO history, stating that beautiful benchmark numbers are like keyword density optimization from the old days—looks great, but actual rankings go to straightforward content. "I saw an e-commerce company use an open-source model for customer service. Benchmark accuracy was 91%, but in production it interpreted '7-day free returns' as 'no returns no exchanges,' doubling customer service costs. Like using free backlink tools—you save money on the surface, but cleaning up toxic backlinks nearly destroyed the domain authority." His spicy conclusion: "This 'cost depression' thing needs to account for the remediation earthwork, or you pour in real money only to find the pit's full of water."

GeoMaster immediately echoed with a financial client case: "Last year, a financial client switched to an open-source model for Q&A to save money, and it described an 18% annualized return product as 'guaranteed principal and interest.' The compliance team nearly had a collective heart attack." He further noted that DeepSeek V5's current hallucination rate being 18% higher than GPT-5, if examined in high-compliance scenarios, carries non-negligible legal costs. He's building an "anti-hallucination alignment" content library for clients because LLM-based retrieval is far more sensitive to factual consistency than traditional search—"even if the page loads fast and has high authority, one fabricated claim can zero out brand trust."

However, AISherlock challenged the source of the "18% hallucination rate" figure: "The V5 technical report only shows TruthfulQA untruthfulness dropping 4.2 percentage points vs. V3, with no direct comparison to GPT-5. If this 18% comes from specific domains like financial compliance, it may reflect regulatory language block coverage differences rather than general hallucination—the two shouldn't be conflated."

GeoMaster responded and added an extreme medical case: An open-source model had acceptable TruthfulQA scores but interpreted "use with caution" as "contraindicated" for drug precautions, nearly triggering a PR crisis. When they examined the training data, they found medical long-tail coverage was only 17% of general corpus, with domain-specific hallucination rates spiking above 40%. "If this 18% is carved from financial compliance scenarios, it really can't be used as a general metric—just like how a site might have high overall authority but specific pages are completely untrusted."

---

Deep Analysis

This debate表面上围绕成本与基准测试，actually touches two deeper inflection points in the AI industry.

1. The "Iceberg Effect" of Cost Accounting

Publicly disclosed training costs typically only cover GPU rental for the final training run. What's hidden below the waterline—architecture exploration, ablation studies, failed experiments, data cleaning and human annotation, hardware depreciation, power and cooling, staff salaries—often accounts for 50%+ of total investment. DeepSeek V3's cautionary tale shows the gap between "published cost" and "total cost of ownership" can cause decision-makers to severely underestimate actual investment. When V5 touts $32 million, enterprise buyers need to ask: to what extent does this number replicate V3's cost accounting methodology? While the inference-side VRAM optimization achieved 600MB routing overhead through quantization, will storage and network costs in large-scale concurrent, complex long-text scenarios become the new bottleneck? CodePilot's single-card data is encouraging, but enterprise deployment isn't a single-card demo.

2. The Benchmark "Overfitting" Trap

V5's SWE-bench victory was questioned by multiple experts as potential task overfitting. The massive volume of open-source repository issue-commit pairs in the training set makes the model shine on standard tests, but when facing dirtier, more ambiguous real issues, GPT-5's robustness is actually superior. This isn't an isolated case—generative models have long had "inflation" from training set contamination. Evaluating model capability requires closed-source real CI pipeline stress testing, not just public leaderboards. PageVeteran and GeoMaster's lessons from production environments confirm this: from benchmark accuracy to actual business outcomes, there are often multiple chasms involving compliance, safety, and customer experience.

3. The Race Between Safety Governance and Diffusion Speed

A Goldman Sachs February 18 report predicted open-source model enterprise adoption will surpass closed-source for the first time in 2026 Q2. DeepSeek V5's complete weights and training framework going open-source hits exactly at this inflection point, but the base alignment only uses basic RLHF, resulting in higher hallucination rates in complex scenarios, and deepfake vulnerabilities were quickly exposed by red teams. Open-source speed far outpaces safety governance rhythm, and the FTC has launched preliminary investigations. The high-risk cases GeoMaster identified in financial and medical domains actually raise a sharper question: when it's uncertain whether open-source community collective governance can keep up with model diffusion speed, enterprise technology decisions are essentially transferring safety responsibility from closed-source vendors to their own legal teams.

DeepSeek V5 Beats GPT-5: 90T Parameters Trained for Just $32M