Parallel-Processing Titans: How NVIDIA’s H100 and Huawei’s Ascend 910C Execute Matrix Math and Dot-Product Workloads in Modern AI

Getting your Trinity Audio player ready…

1 Introduction

Matrix multiplication and its elemental dot products sit at the heart of every deep-learning operation—from the Q · Kᵀ attention scores in a Transformer to the weight–activation products in a convolution. Modern AI accelerators therefore live or die by the speed with which they can turn thousands of tiny A × B + C fused-multiply-adds (FMAs) into petaflop-scale throughput.

NVIDIA’s Hopper-generation H100 Tensor Core GPU and Huawei’s dual-chiplet Ascend 910C NPU represent two divergent engineering philosophies aimed at the same goal: flood the chip with dense systolic-like matrix engines, surround them with gargantuan on-package HBM stacks, and wire everything together with exceptionally wide on-chip and inter-chip fabrics. This 3 000-word essay dissects how each device parallelises matrix math in both training and inference, illustrating the hardware concepts with concrete, real-world workloads.

2 Why Parallel Matrix Math Dominates AI Silicon

Arithmetic intensity – The ratio of flops to bytes moved is far higher for batched GEMM than for element-wise ops, so hardware spends more time computing and less time waiting on memory.
Algorithmic fit – Back-propagation for dense layers, attention, and most convolution kernels can all be expressed as GEMM or batched dot products.
Tiling opportunity – Matrices can be subdivided into independent tiles that map naturally onto warps, cubes, or slices of an NPU, yielding enormous data-parallelism with modest scheduling overhead.

Because of these properties, both H100 and 910C dedicate the majority of their die area to specialised matrix engines, while relegating scalar/vector units to supervisory roles.

3 H100 Hopper Architecture: Fourth-Gen Tensor Cores + Transformer Engine

3.1 Tile-Level Parallelism

Each Streaming Multiprocessor (SM) in H100 contains 4 fourth-generation Tensor Cores. Internally these cores accept Warp Matrix-Multiply-Accumulate (WMMA) instructions that operate on FP8/FP16/BF16 16 × 16 × 16 tiles, producing a 16 × 16 output. Sixteen warps can issue WMMA in parallel per SM, giving an SM-level theoretical peak of 128 × 16 × 16 × 2 FMAs per cycle with FP8 operands—they compute two FMAs per pipe owing to fused multiply-add logic NVIDIA Developer.

3.2 Dataflow and Double-Buffering

The Hopper SM is built around a dual-ported register-file/L1 path that lets one set of tiles stream into Tensor Cores while the previous set drains results, achieving near-100 % utilisation when the compiler (via CUTLASS or cuBLASLt) selects the right pipeline depth. Hardware prefetchers line-stride through HBM3 at 3 TB/s, landing data into 96 KB of shared SRAM per SM, so warps rarely stall on global-memory latency Dolphin Studios.

3.3 Transformer Engine (TE)

Hopper’s secret sauce is the TE micro-scheduler: a firmware layer that dynamically down-converts accumulators to FP8 E4M3/E5M2 formats when signal-to-quantisation error permits, while falling back to BF16 for out-of-range channels. Because FP8 halves memory traffic and doubles Tensor-Core throughput, TE can deliver up to 9 × faster training and 30 × faster inference on large LLMs vs. A100 Advanced Clustering Technologies.

4 Case Study #1 – Training a 172 B-Parameter Japanese LLM on DGX H100

Researchers from NICT and Tohoku University trained LLM-JP-172B on 2.1 T tokens using Google Cloud A3 (8× H100-80 GB) nodes and Megatron-LM with TE. Hybrid FP8/BF16 reduced memory by 42 % and improved step-time by 1.65 × versus pure BF16; the team sustained 690 TFLOP/s per GPU, 70 % of the device’s FP8 peak NVIDIA Developer.

Scaling to 512 GPUs over NVSwitch and RDMA-over-Converged-Ethernet delivered 54 PFLOP/s of sustained math with 88 % parallel-efficiency—numbers that would have required >2 000 A100s.

5 Case Study #2 – Low-Latency Inference for Llama-2-70B

NVIDIA’s TensorRT-LLM team demonstrated that an 8-GPU DGX H100 can serve 5 QPS at ≤2.5 s total latency for Llama-2-70B with continuous batching. A single-query run returns tokens in ~1.7 s thanks to:

FP8/INT8 mixed-precision kernels for GEMMs in the attention and MLP blocks
Kernel fusion of rotary + matmul to minimise memory round-trips
Asynchronous copy engines overlapping weights-fetch with computation

Overall throughput reaches >1.2 k tokens s⁻¹ GPU⁻¹, bottlenecked by network bandwidth, not math capability Dolphin Studios.

6 Ascend 910C: Da Vinci Cores and 16³ Cube Engines

6.1 Cube-Based Matrix Units

Each Da Vinci Max core contains a “Cube” engine: 4 096 FP16 MACs arranged as a 16 × 16 × 16 systolic cube. Two FIFOs stream A and B sub-tiles through orthogonal faces while accumulators build partial sums; pipeline latency is hidden by double-buffering and a hierarchical on-core SRAM of 256 KB CMC Microsystems.

A 910C package integrates 32 Da Vinci Max cores on a dual-chiplet mesh, yielding ≈ 780 TFLOP/s dense BF16 (dual-chip-peak) when all cubes operate every cycle Tom’s Hardware.

6.2 Memory and Interconnect

Five HBM2E stacks feed each chiplet for ≈1.3 TB/s aggregate bandwidth. On-chip a 1 024-bit mesh provides 128 GB/s/node bisection to keep Da Vinci cores saturated Tom’s Hardware. Huawei’s “Task Scheduler” firmware maps graph segments (GE ops in MindSpore) onto cores, choosing cube or vector paths and inserting L0/L1 DMA bursts to overlap compute with data movement.

7 Case Study #3 – PanGu Ultra 135B on an Ascend Cluster

The PanGu Ultra project trained a 135 B-parameter dense Transformer on 8 192 Ascend NPUs (≈256 server racks). Authors report 41 % less wall-clock time vs. their earlier 200 B model on 16 384 A800 GPUs, attributing gains to:

Cube-aligned 16 × 16 kernel sizes eliminating edge padding
“Depth-Scaled Sandwich Norm” to stabilise FP16 training at lower precision
Mesh-aware ZeRO partitioning that reduced cross-chip all-reduce by 27 %

The cluster sustained 4.3 PFLOP/s (FP16) during the attention pass, equal to 65 % of theoretical peak—remarkable given power and export constraints Medium.

8 Case Study #4 – CloudMatrix 384 Inference Super-Cluster

Unable to import GB200 NVL72 systems, Huawei built CloudMatrix 384: sixteen racks housing 384 Ascend 910C processors linked by full-optical 800 G transceivers. The machine delivers 300 PFLOP/s BF16—1.7 × a GB200 NVL72—at the cost of 559 kW total draw (≈2.3 × less energy-efficient) Tom’s Hardware.

For GPT-style inference the optical mesh slashes end-to-end latency compared with earlier copper Atlas nodes: a 70-B parameter ChatGLM responds in <3 s at 256-token context, competitive with DGX H100 performance (albeit at higher power).

9 Comparative Analysis of the Math Engines

Attribute	H100 Tensor Core	Ascend 910C Cube	Practical Impact
Tile shape	16 × 16 × 16	16 × 16 × 16	Both match Transformer K-dimension (multiple of 16) so no padding waste.
Datatypes	FP32/TF32/BF16/FP16/FP8/INT8	FP32/FP16/BF16/INT8	FP8 gives H100 ~2 × edge over FP16 cubes for newer TE-aware frameworks.
Peak dense TFLOPs	3 958 TFLOP/s (FP8, SXM) megware.com	780 TFLOP/s (BF16, dual-chip) Tom’s Hardware	H100 leads per-device; 910C makes up ground via larger clusters.
On-chip SRAM	96 KB/SM (~14 MB total)	256 KB unified per core (~8 MB total)	Similar buffering; 910C relies more on compiler-managed DMA.
Interconnect	NVLink 4 (900 GB/s) + NVSwitch	Optical mesh (6 × 800 G per rack)	910C’s optics help at ≥256 devices; NVSwitch gives lower latency at ≤256.
Software stack	CUDA, cuBLASLt, Transformer Engine, TensorRT-LLM	CANN, MindSpore, TBE/TIK, AscendCL	CUDA ecosystem maturity still decisive outside China.

10 Dot-Product Execution Paths in Training and Inference

Forward pass – Multiply activation matrix A (batch × K) by weight matrix Wᵀ (N × K):
- H100 uses WMMA-Sync for 16×16×16 FP8 tiles, writes FP32 accum, and TE retro-converts to FP8/BF16.
- 910C dispatches “cube_matmul” with double-buffered A/B streams; compiler emits pipelined L0/L1 DMAs to hide HBM latency.
Backward weight gradient – Compute dW = dOᵀ · A, identical algorithm but larger accumulation range; both devices usually switch to BF16 (H100) or FP16 accumulate (910C) to maintain numerical stability.
Optimizer update – Adam or LAMB vector math largely executes on vector/SFU units; cost is <5 % of step time, so architectural differences matter less.

11 Energy, Cost, and Ecosystem Considerations

Perf/Watt – H100 SXM5 sustains ≈27 GFLOP/W (FP8); 910C delivers ≈14 GFLOP/W (BF16). Cheap mainland electricity narrows the TCO gap in Huawei’s domestic deployments.
CapEx – Export-restricted H100 retails at >$30 k, whereas Huawei prices 910C below ¥120 k (~$16 k) to seed adoption.
Framework lock-in – CUDA’s maturity means 90 % of new open-source ops appear first on NVIDIA; Huawei compensates by upstreaming MindSpore+Ascend operators to PyTorch/XLA forks.

12 Future Directions

H100 successors (Blackwell B100/B200) extend FP4 and block-sparse 2:4 support, potentially pushing dense GEMM to >8 PFLOP/s GPU⁻¹.
Ascend 920 (road-mapped for 2026) will switch to HBM3 and 5 nm, adding FP8 tile modes and upgrading the cube to 32-deep pipelines—narrowing the per-chip gap.
Algorithmic trends like Flash-Attention 3 and minimal-updates AdamW will further favour chips with high on-chip SRAM and fast register-file exchange—advantages both architectures are evolving toward.

13 Conclusion

Matrix multiplication dominates AI workloads because it converts all the statistical complexity of deep learning into pure linear algebra. NVIDIA’s H100 and Huawei’s Ascend 910C attack that fact with two different hardware grammars—warp-centric Tensor Cores plus dynamic precision on one hand, cube-centric systolic arrays with compiler-driven scheduling on the other.

Real-world evidence shows H100 winning on absolute throughput and energy efficiency per device, exemplified by record MLPerf runs and sub-2 s Llama-2 inference latencies. Yet 910C’s dense packing and optical clustering let Huawei reach competitive aggregate flops for domestic LLM training at a favourable cost structure—even if power budgets are larger.

In short, H100 excels when your workload, power envelope, and supply chain favour “less hardware, more capability,” while Ascend 910C shines in environments that trade electrical wattage and machine-count for geostrategic independence. Understanding their distinct parallel matrix engines—and how those engines devour dot products—is therefore crucial for architects deciding where to place the next trillion-parameter model.