Huawei Ascend 950PR — 2.8x H20 FP4, and ByteDance + Alibaba Are Already Stockpiling It
1.56 PFLOPS FP4, 112GB HiBL 1.0 HBM, $6,900 per card. Made on SMIC N+2, targeting 750K units shipped in 2026. What US sanctions accidentally built.

$6,900 per card, and Huawei claims 2.8x an NVIDIA H20
Unveiled at China Partner Conference on March 20, the Ascend 950PR has quickly become the biggest AI chip story of Q2 2026. Single-card 1.56 PFLOPS FP4, 1 PFLOPS FP8, 112 GB memory. DDR version prices at about 50,000 yuan ($6,900), HBM version at 70,000 yuan ($9,600). Against NVIDIA's H20 — the other chip in its class — Huawei claims 2.8x FP4 compute.
One more number. On March 27, Reuters reported ByteDance and Alibaba had placed bulk orders. ByteDance's total commitment to Huawei silicon reportedly reaches $5.6B.
SMIC is producing it on the N+2 process (roughly equivalent to 7nm), with a 750,000 unit target for 2026. If that ships, China's domestic AI inference market is effectively running without NVIDIA.
What this actually is — three differentiators
The Ascend 950PR is "the Chinese-made AI inference chip you can use when you can't buy an NVIDIA H20." Three things define the positioning.
First, FP4-native architecture. The 950PR is the first Chinese AI accelerator to support FP4 inference at scale. FP4 is one precision step below the FP8/BF16 that H20 is optimized for, and it's become the preferred format for recent model releases — DeepSeek V3/V4, Qwen3, GLM-4 — because it halves memory usage with little accuracy loss. Huawei timed this chip for that shift.
Second, HiBL 1.0 — Huawei's own HBM. SK Hynix, Samsung, and Micron's HBM is increasingly restricted for China delivery. Huawei responded with HiBL (High-Bandwidth Low-power) 1.0, in-house HBM shipping at 112 GB capacity and 1.4 TB/s bandwidth. NVIDIA H20 runs 4.0 TB/s, so bandwidth is still a weakness. But raw capacity — 112 GB — exceeds H20's 96 GB.
Third, CUDA-compatible CANN Next. Huawei's CANN Next SDK mirrors CUDA's thread block, warp, and kernel launch semantics. About 80% of standard PyTorch inference code runs with config changes only, no rewrites. This software portability is the main reason ByteDance and Alibaba moved fast.
Source: commons.wikimedia.org · CC BY-SA 3.0
Core specs — lined up against NVIDIA H20 and H100
The competitor Huawei actually targets is the H20 — the China-spec, performance-reduced NVIDIA chip. The wins and losses are specific.
| Metric | Huawei Ascend 950PR | NVIDIA H20 | NVIDIA H100 |
|---|---|---|---|
| FP4 compute | 1.56 PFLOPS | ~0.56 PFLOPS | N/A |
| FP8 compute | 1 PFLOPS | 1.56 PFLOPS | 3.96 PFLOPS |
| Memory capacity | 112 GB HiBL 1.0 | 96 GB HBM3 | 80 GB HBM3 |
| Memory bandwidth | 1.4 TB/s | 4.0 TB/s | 3.35 TB/s |
| Interconnect | LingQu 2.0 TB/s | NVLink 900 GB/s | NVLink 900 GB/s |
| TDP | 600 W | 400 W | 700 W |
| Process | SMIC N+2 (7nm class) | TSMC 4N (5nm class) | TSMC 4N (5nm class) |
| Price per card | $6,900–$9,600 | ~$12,000 | ~$30,000 |
| China legal to sell? | ✓ | ✗ (further restricted) | ✗ |
Huawei wins on FP4, memory capacity, interconnect, and price. It loses on bandwidth and process density. But inside China, there's no legal way to buy H20 or H100 at volume — so the relevant question shifts from "does it win" to "can you actually get it."
Feature breakdown
Atlas 350 card and LingQu fabric
The 950PR chip ships on the Atlas 350 accelerator card. 600W TDP — lower than H100's 700W, higher than H20's 400W. Data centers can plan around an H100-class power envelope. For scale-out, Huawei built LingQu, an in-house interconnect at 2.0 TB/s. Nominally that beats NVLink's 900 GB/s, but the NVSwitch-scale fabric for 256-GPU rack-level networking is still something Huawei hasn't fully matched.
CANN Next and CUDA portability
The software stack is what matters for engineers. CANN Next exposes thread block, warp, and kernel launch primitives that map closely onto CUDA. PyTorch, vLLM, and TensorRT-LLM backend plugins are rolling out fast. Huawei publicly pushes MindSpore, but ByteDance's production benchmarks are reportedly running on PyTorch. The "80% portable" figure translates to "the other 20% is CUDA-specific kernel code that has to be rewritten" — and engineers note that 20% is where 80% of LLM throughput lives.
Source: commons.wikimedia.org · CC BY 2.0
Pricing + launch timeline
| Item | Date / condition |
|---|---|
| Official unveil | 2026-03-20, China Partner Conference |
| Mass production starts | April 2026 ("next month" per reports) |
| Volume shipments | 2H 2026 |
| 2026 unit target | 750,000 cards |
| DDR version price | 50,000 yuan ($6,900) |
| HBM version price | 70,000 yuan ($9,600) |
| Sample delivery history | Jan 2026: ByteDance, Alibaba received |
| Roadmap | 950DT → 951 → 960 → 970 (sequential release) |
Reuters reports ByteDance and Alibaba received engineering samples in January 2026 and ran production-grade inference benchmarks. The March unveil was a formal launch of a product already validated at customer sites.
Who this is for
ByteDance, Alibaba, Tencent, Baidu, and other Chinese hyperscalers: They are the target. US export controls block large-scale H100/H200/B200 buys, and Huawei is the only credible domestic alternative at scale. ByteDance's $5.6B commitment means TikTok and Douyin's recommendation models and the Doubao LLM will increasingly run on Huawei silicon.
Mid-size Chinese AI startups: With H20 grey-market prices running $25,000-$35,000, a $6,900 alternative is real. The chip is optimized for exactly the kind of FP4-friendly models Chinese startups deploy — DeepSeek R1/V4, Qwen3 32B-72B.
Developers in the US, Europe, and Korea: You can't buy one directly. Indirect access only, via rented instances on Alibaba Cloud, Tencent Cloud, or Huawei Cloud. But as a benchmark reference, the 950PR's public spec sheet is the most detailed look in years at how close China's 7nm fabrication has gotten.
Competitive response and market position
NVIDIA has not commented publicly as of April 15, but leaked internal memos suggest an H20 successor called B20 is being fast-tracked for the China market. B20 would scale H20's performance down further to stay below US export control thresholds.
AMD, Broadcom, and other Western chipmakers have effectively ceded China. AMD MI300X is export-restricted. Broadcom is focused on Google, Meta, and other US hyperscalers.
China's AI chip market isn't "when can we buy NVIDIA again." It's "how fast can we internalize Huawei."
Other Chinese AI chip startups — Cambricon, Hygon, Biren — don't have Huawei's scale or software ecosystem. Huawei is settling into the de facto standard position for Chinese AI infrastructure.
The bigger picture — sanctions are building a separate ecosystem
The US Commerce Department (BIS) has been phase-restricting China exports since 2022 — H100 first, then H800/A800, then H20 follow-ons. Korea, Japan, and the Netherlands have also restricted ASML EUV equipment exports. After five years of these measures, China has accelerated into full-stack self-sufficiency.
SMIC's N+2 process (7nm-class) demonstrated it could yield 7-billion-transistor AI accelerators when Kirin 9000S shipped in the Mate 60 Pro in 2023. Ascend 950PR confirms that same process can scale to mass-production AI accelerators just 18 months later. During the same window TSMC moved from 3nm to 2nm — so the node gap actually widened. But the 950PR's existence shows node density isn't decisive for defending the Chinese domestic market.
Geopolitically, the dual system is now visible. NVIDIA and AMD set the standard in the West. Huawei sets it in China. The two ecosystems are splitting at every layer — software stack (CUDA vs CANN), memory (HBM3e vs HiBL), interconnect (NVLink vs LingQu). The longer this divergence runs, the more expensive it becomes to reunify.
So what actually changes
NVIDIA shareholders and US policymakers: The thesis "sanctions slow China down" is fraying. Whether the B20 successor defends China market share is the next big question. If Commerce restricts B20 too, the irony is that it would hand Huawei a clean monopoly — the opposite of the intended outcome.
AI companies outside China: You can't use Huawei directly, but Chinese companies running LLM infrastructure much cheaper creates competitive pressure. Expect a repeat of the late-2024 DeepSeek moment — R1 matching OpenAI o1 at a fraction of the training cost. Accelerated Chinese open-weight releases translate directly into global price pressure.
Western engineers in practice: Chinese models on Hugging Face — Qwen3, DeepSeek V4, GLM-4 — are increasingly high-quality. The fact that they're trained and served on Huawei silicon raises governance questions. Enterprise RAG and fine-tuning pipelines that use these weights need a longer risk checklist, not a shorter one.
Korean semiconductor industry: SK Hynix and Samsung HBM exports to China are already restricted. Huawei's move to in-house HiBL means the Chinese market is a permanently lost HBM customer. Nvidia, AMD, and Google TPU HBM demand remains, so short-term impact is contained. Mid-term, "Chinese AI demand = inaccessible" has to become the planning assumption.
References
출처
- Huawei's new AI chip finds favor with ByteDance, Alibaba (CNBC/Reuters)
- Huawei Ascend 950PR: The 1.56 PFLOP AI Chip vs Nvidia (Tech Insider)
- Huawei Ascend 950PR: Atlas 350 AI Chip Challenges NVIDIA (Nerd Level Tech)
- Ascend 950PR Secures Major Orders from ByteDance and Alibaba (Technetbook)
- Huawei 950PR: Data Center CRE Impact (AI Consulting Network)
관련 기사

Huawei's 950PR Is China's Bet on Inference-Only Silicon

The AI Chip Supply War — Tesla, ASML, Huawei, FluidStack Moved in One Week

Meta Unveils 4 Generations of MTIA Custom Chips — Building an Nvidia-Free Inference Stack
AI 트렌드를 앞서가세요
매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.
