INTERACTIVE TECHNICAL GUIDE

Binary Computing
& Ternary AI

From individual bits to intelligent systems. How {-1, 0, +1} ternary weights transform matrix multiplication into simple addition.

From Binary to Ternary

Traditional computing uses binary (base-2): every value is represented by 0s and 1s. A standard FP16 weight uses 16 bits. BitNet reduces this to just 2 bits for 3 possible values, creating a ternary (base-3) system at the hardware level.

Interactive: Number Systems

Binary (base-2)
Ternary (base-3)
BitNet Encoding

Ternary Weight Matrix: Live Visualization

Below is a simulated 16x16 layer of a BitNet model. Each cell represents one weight: green (+1), dark (0), red (-1). Click "Regenerate" to see a new random weight distribution. Notice the ~33% sparsity (zeros).

16x16 Weight Matrix

+1: 0 0: 0 -1: 0 1.58 bits/weight

Matrix Multiply: FP16 vs BitNet

// FP16: Multiply-Accumulate (slow, energy-hungry)
for i in 0..M:
  for j in 0..N:
    acc = 0.0f16
    for k in 0..K:
      acc += W[i][k] * X[k][j]  // FP16 MULTIPLY
    Y[i][j] = acc

Operations: M * N * K multiplies
Energy: 1.1 pJ per multiply
Total:  M*N*K * 1.1 pJ
// BitNet: Add/Sub only (fast, efficient)
for i in 0..M:
  for j in 0..N:
    acc = 0
    for k in 0..K:
      if W[i][k]==+1: acc += X[k][j]
      if W[i][k]==-1: acc -= X[k][j]
    Y[i][j] = acc  // skip zeros!

Operations: ~M*N*K*0.67 add/sub
Energy: 0.03 pJ per add
Total:  M*N*K * 0.67 * 0.03 pJ

Result: ~55x Less Energy Per Layer

By eliminating all floating-point multiplications and leveraging ~33% weight sparsity (zeros that skip computation entirely), BitNet achieves dramatic energy savings at each layer. Across an entire model with hundreds of layers, this compounds into the 55-82% total energy reduction observed in benchmarks.

ARM NEON Assembly: BitNet Kernel

This is a detailed walkthrough of how BitNet's ternary GEMM (General Matrix Multiply) kernel works on ARM processors using NEON SIMD instructions. Each NEON register holds 16 bytes, processing 16 elements simultaneously.

; Complete ARM64 NEON BitNet kernel with commentary
; ============================================
; BitNet Ternary GEMM — ARM64 NEON
; Processes 16 elements per SIMD iteration
; No floating-point instructions used
; ============================================

.global bitnet_ternary_gemv
.text
.align 4

bitnet_ternary_gemv:
    ; x0 = output ptr, x1 = activation ptr
    ; x2 = packed weight ptr, x3 = K (inner dim)
    ; x4 = M (rows)

    ; Initialize accumulator registers to zero
    MOVI    v30.4s, #0          ; v30 = positive accumulator
    MOVI    v31.4s, #0          ; v31 = negative accumulator

    ; Prepare mask constants
    MOVI    v28.16b, #0x01     ; mask for bit 0 (positive)
    MOVI    v29.16b, #0x02     ; mask for bit 1 (negative)

.loop_k:
    ; Load 16 INT8 activations
    LD1     {v0.16b}, [x1], #16

    ; Load packed ternary weights
    ; Each byte holds 4 weights (2 bits each)
    ; Encoding: 00=zero, 01=+1, 10=-1
    LD1     {v1.4s}, [x2], #16

    ; Extract positive mask (bit 0 set = weight is +1)
    AND     v2.16b, v1.16b, v28.16b
    CMEQ    v2.16b, v2.16b, v28.16b  ; all 1s where w=+1

    ; Extract negative mask (bit 1 set = weight is -1)
    AND     v3.16b, v1.16b, v29.16b
    CMEQ    v3.16b, v3.16b, v29.16b  ; all 1s where w=-1

    ; Conditional accumulate: add activations where w=+1
    AND     v4.16b, v0.16b, v2.16b   ; zero out non-positive
    SADDLP  v5.8h, v4.16b            ; pairwise add to 16-bit
    SADALP  v30.4s, v5.8h            ; accumulate to 32-bit

    ; Conditional accumulate: add activations where w=-1
    AND     v6.16b, v0.16b, v3.16b   ; zero out non-negative
    SADDLP  v7.8h, v6.16b
    SADALP  v31.4s, v7.8h

    ; w=0 weights: NO operation (free sparsity!)

    SUBS    x3, x3, #16
    B.GT    .loop_k

    ; Final: result = positives - negatives
    SUB     v30.4s, v30.4s, v31.4s
    ADDV    s0, v30.4s       ; horizontal sum
    STR     s0, [x0]            ; store result
    RET

NEON Register State Visualization

v0 — Activations (INT8 x 16)
v1 — Packed Weights (ternary x 16)
v2 — Positive Mask (+1 positions)
v3 — Negative Mask (-1 positions)
Result — Dot Product

x86 AVX-512: Server-Grade BitNet

; x86-64 AVX-512 BitNet kernel (data center deployment)
; Process 64 elements per iteration with AVX-512
vmovdqu8  zmm0, [rsi]          ; load 64 INT8 activations
vmovdqu8  zmm1, [rdx]          ; load 64 packed ternary weights

; Decode ternary weights
vpandd    zmm2, zmm1, zmm_mask_pos  ; positive mask
vpandd    zmm3, zmm1, zmm_mask_neg  ; negative mask

; Masked add/subtract (512-bit wide!)
vpaddb    zmm4, zmm0, zmm_zero {k1}  ; add where +1
vpsubb    zmm5, zmm0, zmm_zero {k2}  ; sub where -1

; 64 elements processed in ~1 clock cycle
; Throughput: ~200 GOPS at 3 GHz

Continue Reading