Binary Computing
&
Ternary AI
From individual bits to intelligent systems. How {-1, 0, +1} ternary weights transform matrix multiplication into simple addition.
From Binary to Ternary
Traditional computing uses binary (base-2): every value is represented by 0s and 1s. A standard FP16 weight uses 16 bits. BitNet reduces this to just 2 bits for 3 possible values, creating a ternary (base-3) system at the hardware level.
Interactive: Number Systems
Ternary Weight Matrix: Live Visualization
Below is a simulated 16x16 layer of a BitNet model. Each cell represents one weight: green (+1), dark (0), red (-1). Click "Regenerate" to see a new random weight distribution. Notice the ~33% sparsity (zeros).
16x16 Weight Matrix
Matrix Multiply: FP16 vs BitNet
for i in 0..M: for j in 0..N: acc = 0.0f16 for k in 0..K: acc += W[i][k] * X[k][j] // FP16 MULTIPLY Y[i][j] = acc Operations: M * N * K multiplies Energy: 1.1 pJ per multiply Total: M*N*K * 1.1 pJ
for i in 0..M: for j in 0..N: acc = 0 for k in 0..K: if W[i][k]==+1: acc += X[k][j] if W[i][k]==-1: acc -= X[k][j] Y[i][j] = acc // skip zeros! Operations: ~M*N*K*0.67 add/sub Energy: 0.03 pJ per add Total: M*N*K * 0.67 * 0.03 pJ
Result: ~55x Less Energy Per Layer
By eliminating all floating-point multiplications and leveraging ~33% weight sparsity (zeros that skip computation entirely), BitNet achieves dramatic energy savings at each layer. Across an entire model with hundreds of layers, this compounds into the 55-82% total energy reduction observed in benchmarks.
ARM NEON Assembly: BitNet Kernel
This is a detailed walkthrough of how BitNet's ternary GEMM (General Matrix Multiply) kernel works on ARM processors using NEON SIMD instructions. Each NEON register holds 16 bytes, processing 16 elements simultaneously.
; ============================================ ; BitNet Ternary GEMM — ARM64 NEON ; Processes 16 elements per SIMD iteration ; No floating-point instructions used ; ============================================ .global bitnet_ternary_gemv .text .align 4 bitnet_ternary_gemv: ; x0 = output ptr, x1 = activation ptr ; x2 = packed weight ptr, x3 = K (inner dim) ; x4 = M (rows) ; Initialize accumulator registers to zero MOVI v30.4s, #0 ; v30 = positive accumulator MOVI v31.4s, #0 ; v31 = negative accumulator ; Prepare mask constants MOVI v28.16b, #0x01 ; mask for bit 0 (positive) MOVI v29.16b, #0x02 ; mask for bit 1 (negative) .loop_k: ; Load 16 INT8 activations LD1 {v0.16b}, [x1], #16 ; Load packed ternary weights ; Each byte holds 4 weights (2 bits each) ; Encoding: 00=zero, 01=+1, 10=-1 LD1 {v1.4s}, [x2], #16 ; Extract positive mask (bit 0 set = weight is +1) AND v2.16b, v1.16b, v28.16b CMEQ v2.16b, v2.16b, v28.16b ; all 1s where w=+1 ; Extract negative mask (bit 1 set = weight is -1) AND v3.16b, v1.16b, v29.16b CMEQ v3.16b, v3.16b, v29.16b ; all 1s where w=-1 ; Conditional accumulate: add activations where w=+1 AND v4.16b, v0.16b, v2.16b ; zero out non-positive SADDLP v5.8h, v4.16b ; pairwise add to 16-bit SADALP v30.4s, v5.8h ; accumulate to 32-bit ; Conditional accumulate: add activations where w=-1 AND v6.16b, v0.16b, v3.16b ; zero out non-negative SADDLP v7.8h, v6.16b SADALP v31.4s, v7.8h ; w=0 weights: NO operation (free sparsity!) SUBS x3, x3, #16 B.GT .loop_k ; Final: result = positives - negatives SUB v30.4s, v30.4s, v31.4s ADDV s0, v30.4s ; horizontal sum STR s0, [x0] ; store result RET
NEON Register State Visualization
x86 AVX-512: Server-Grade BitNet
; Process 64 elements per iteration with AVX-512 vmovdqu8 zmm0, [rsi] ; load 64 INT8 activations vmovdqu8 zmm1, [rdx] ; load 64 packed ternary weights ; Decode ternary weights vpandd zmm2, zmm1, zmm_mask_pos ; positive mask vpandd zmm3, zmm1, zmm_mask_neg ; negative mask ; Masked add/subtract (512-bit wide!) vpaddb zmm4, zmm0, zmm_zero {k1} ; add where +1 vpsubb zmm5, zmm0, zmm_zero {k2} ; sub where -1 ; 64 elements processed in ~1 clock cycle ; Throughput: ~200 GOPS at 3 GHz