WebGL Performance Power-Up: Three.js, WASM, SIMD, and Lock-Free Concurrency

This post dives deep into optimizing WebGL performance using a powerful combination of technologies: Three.js, WebAssembly (WASM), Single Instruction, Multiple ...

WebGL Performance Power-Up: Three.js, WASM, SIMD, and Lock-Free Concurrency

This post dives deep into optimizing WebGL performance using a powerful combination of technologies: Three.js, WebAssembly (WASM), Single Instruction, Multiple Data (SIMD), and lock-free concurrency techniques using atomic operations for thread-safe data sharing. We'll explore how each contributes to a faster and more efficient rendering pipeline.

Introduction

WebGL brings hardware-accelerated 3D graphics to the web browser. However, complex scenes and demanding calculations can quickly become performance bottlenecks. This is where leveraging technologies like WASM, SIMD, and efficient data structures becomes crucial. We'll explore how to use Three.js as a framework, WASM for performance-critical calculations, SIMD for parallel data processing, and lock-free techniques for thread-safe data sharing, including Lock Striping for high-contention scenarios.

Three.js: Your 3D Scene Orchestrator

Three.js is a popular JavaScript library that simplifies WebGL development. It provides a higher-level API for creating and manipulating 3D scenes, handling camera controls, lighting, and material properties.

hljs javascript25 lines
// Example: Creating a basic Three.js scene
import * as THREE from 'three';

const scene = new THREE.Scene();
const camera = new THREE.PerspectiveCamera( 75, window.innerWidth / window.innerHeight, 0.1, 1000 );
const renderer = new THREE.WebGLRenderer();
renderer.setSize( window.innerWidth, window.innerHeight );
document.body.appendChild( renderer.domElement );

const geometry = new THREE.BoxGeometry( 1, 1, 1 );
const material = new THREE.MeshBasicMaterial( { color: 0x00ff00 } );
const cube = new THREE.Mesh( geometry, material );
scene.add( cube );

camera.position.z = 5;

function animate() {
  requestAnimationFrame( animate );
  cube.rotation.x += 0.01;
  cube.rotation.y += 0.01;
  renderer.render( scene, camera );
}

animate();

This example demonstrates the basic setup of a Three.js scene. We create a scene, camera, renderer, a cube geometry, a material, and then animate the cube's rotation. While Three.js handles many low-level WebGL details, performance-critical sections can benefit significantly from WASM optimization.

WebAssembly (WASM): Bringing Near-Native Performance to the Browser

WASM is a binary instruction format for a stack-based virtual machine. It allows you to run code written in languages like Rust in the browser at near-native speed. This is achieved by compiling Rust to WASM, which can then be loaded and executed by the browser's JavaScript engine.

Why WASM for WebGL?

Performance: WASM code executes significantly faster than JavaScript, especially for computationally intensive tasks like physics simulations, complex calculations, and data processing.
Memory Management: WASM provides more control over memory management compared to JavaScript's garbage collection, allowing for more efficient memory usage and reduced garbage collection pauses.

Example: A Simple WASM Module (Rust)

hljs rust8 lines
// Example: Simple Rust function to be compiled to WASM
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn add(a: f32, b: f32) -> f32 {
    a + b
}

This Rust code defines a simple add function that can be compiled to WASM using wasm-bindgen.

Compiling to WASM with wasm-pack

wasm-pack is the recommended toolchain for building Rust-generated WebAssembly packages.

hljs bash3 lines

# Build the Rust project for WASM target
wasm-pack build --target web

This command compiles your Rust project to a WASM module and generates JavaScript bindings in the pkg/ directory. The --target web flag optimizes the output for use in web browsers.

Using WASM in JavaScript

hljs javascript21 lines
// Example: Loading and using the WASM module in JavaScript
import init, { add } from './pkg/my_wasm_project.js';

async function run() {
  try {
    // Initialize the WASM module
    await init();
    
    // Call the add function
    const result = add(5.0, 3.0);
    console.log("Result from WASM:", result); // Output: Result from WASM: 8
  } catch (error) {
    console.error('Failed to initialize WASM module:', error);
    // Fallback to JavaScript implementation
    const result = 5.0 + 3.0;
    console.log("Fallback result:", result);
  }
}

run();

This code imports the generated JavaScript bindings from wasm-pack, initializes the WASM module, and then calls the add function. The wasm-bindgen library handles all the complexity of interfacing with WASM.

Architecture Diagram: WASM Integration

This diagram illustrates how JavaScript (using Three.js) interacts with a WASM module. The WASM loader in JavaScript fetches and instantiates the WASM module, which then executes native code compiled from Rust.

SIMD: Parallel Data Processing

SIMD (Single Instruction, Multiple Data) is a type of parallel processing that allows a single instruction to operate on multiple data elements simultaneously. This can significantly improve performance for tasks that involve processing large amounts of data, such as vertex manipulation, pixel processing, and physics simulations.

WASM SIMD

WASM supports SIMD instructions, enabling you to write code that takes advantage of parallel processing capabilities. This is particularly useful for vector and matrix operations common in 3D graphics.

Example: Using SIMD in WASM (Rust)

hljs rust27 lines
use wasm_bindgen::prelude::*;
use std::arch::wasm32::*;

#[wasm_bindgen]
pub fn add_vectors(a: &[f32], b: &[f32]) -> Vec<f32> {
    let mut result = Vec::with_capacity(a.len());
    
    // Process 4 floats at a time using SIMD
    for i in (0..a.len()).step_by(4) {
        if i + 4 <= a.len() {
            unsafe {
                let va = v128_load(a.as_ptr().add(i) as *const v128);
                let vb = v128_load(b.as_ptr().add(i) as *const v128);
                let vr = f32x4_add(va, vb);
                
                let temp: [f32; 4] = std::mem::transmute(vr);
                result.extend_from_slice(&temp);
            }
        } else {
            // Handle remaining elements
            result.push(a[i] + b[i]);
        }
    }
    
    result
}

This Rust code uses WASM SIMD intrinsics to add two arrays of floats in parallel. The v128_load function loads 128-bit vectors (containing four 32-bit floats), and f32x4_add performs a parallel addition of the vectors.

Compiling with SIMD Support

To enable SIMD support in Rust, add this to your Cargo.toml:

hljs toml15 lines
[package]
name = "my_wasm_project"
version = "0.1.0"
edition = "2021"

[lib]
crate-type = ["cdylib"]

[dependencies]
wasm-bindgen = "0.2"

[profile.release]
opt-level = 3
lto = true

Then build with:

hljs bash2 lines

RUSTFLAGS="-C target-feature=+simd128" wasm-pack build --target web --release

The target-feature=+simd128 flag enables SIMD support.

Using SIMD in JavaScript

hljs javascript20 lines
// Example: Calling the SIMD function from JavaScript with wasm-bindgen
import init, { add_vectors } from './pkg/my_wasm_project.js';

async function run() {
    // Initialize the WASM module
    await init();
    
    // Input arrays
    const a = new Float32Array([1, 2, 3, 4, 5, 6, 7, 8]);
    const b = new Float32Array([9, 10, 11, 12, 13, 14, 15, 16]);
    
    // Call the WASM SIMD function - wasm-bindgen handles memory automatically!
    const result = add_vectors(a, b);
    
    console.log("SIMD Result:", result); 
    // Output: SIMD Result: [10, 12, 14, 16, 18, 20, 22, 24]
}

run();

Important: Thanks to wasm-bindgen, memory management is handled automatically! The library marshals data between JavaScript and WASM seamlessly, making it much easier to work with compared to manual memory management.

Performance Considerations for SIMD

Data Alignment: SIMD instructions often require data to be aligned in memory. Ensure that your data is properly aligned to maximize performance.

Production Tip: For strict 16-byte alignment in Rust:
hljs rust5 lines
```
#[repr(align(16))]
struct AlignedVec {
    data: Vec<f32>,
}
```
Vectorization: Not all code can be easily vectorized. Carefully analyze your code to identify sections that can benefit from SIMD.
Browser Support: While WASM SIMD is widely supported, it's always a good idea to check for browser compatibility and provide fallback mechanisms if necessary.

Browser Compatibility for WASM SIMD

Browser	WASM SIMD Support	Minimum Version
Chrome	Yes	91+ (May 2021)
Firefox	Yes	89+ (June 2021)
Safari	Yes	16.4+ (March 2023)
Edge	Yes	91+ (May 2021)

Feature Detection:

The most reliable way to detect WASM SIMD support is using the wasm-feature-detect library:

hljs bash2 lines

npm install wasm-feature-detect

hljs javascript17 lines
import { simd } from 'wasm-feature-detect';

async function checkSIMDSupport() {
  const simdSupported = await simd();
  
  if (!simdSupported) {
    console.warn('WASM SIMD not supported, falling back to standard WASM');
    // Load non-SIMD version of your code
    return false;
  }
  
  console.log('WASM SIMD is supported! 🚀');
  return true;
}

checkSIMDSupport();

Alternative: Manual Detection (without libraries):

hljs javascript25 lines
async function detectWasmSIMD() {
  try {
    // Minimal WASM module with SIMD v128.const instruction
    const simdModule = new WebAssembly.Module(
      new Uint8Array([
        0, 97, 115, 109, 1, 0, 0, 0,  // WASM header
        1, 5, 1, 96, 0, 1, 123,        // Type section (function returns v128)
        3, 2, 1, 0,                    // Function section
        10, 10, 1, 8, 0,               // Code section
        65, 0,                         // i32.const 0
        253, 15,                       // v128.const (SIMD instruction)
        253, 98, 11                    // v128.any_true, end
      ])
    );
    return true;
  } catch (e) {
    return false;
  }
}

const simdSupported = await detectWasmSIMD();
if (!simdSupported) {
  console.warn('WASM SIMD not supported');
}

Multithreading with SharedArrayBuffer and Atomics

In multithreaded WebGL applications (e.g., using Web Workers), efficient and thread-safe data sharing is crucial. SharedArrayBuffer provides shared memory between workers, and atomic operations ensure thread-safe access to that memory.

SharedArrayBuffer: The Foundation

SharedArrayBuffer creates a block of memory that can be accessed by multiple Web Workers simultaneously. This is essential for parallel WebGL processing.

hljs javascript8 lines
// Create shared memory (16 integers)
const sharedBuffer = new SharedArrayBuffer(16 * Int32Array.BYTES_PER_ELEMENT);
const sharedArray = new Int32Array(sharedBuffer);

// Share with workers
worker1.postMessage({ buffer: sharedBuffer });
worker2.postMessage({ buffer: sharedBuffer });

JavaScript Atomics API: Simple Synchronization

For most applications, JavaScript's built-in Atomics API is sufficient for thread-safe operations:

hljs javascript10 lines
// In Worker 1: Safely increment a counter
Atomics.add(sharedArray, 0, 1);

// In Worker 2: Wait for a signal
Atomics.wait(sharedArray, 1, 0);  // Wait until index 1 is not 0

// In Main Thread: Send signal
Atomics.store(sharedArray, 1, 1);
Atomics.notify(sharedArray, 1);  // Wake up waiting workers

This is what you should use for 90% of use cases! It's simple, safe, and doesn't require WASM.

Practical Example: Parallel Particle Physics

Here's a realistic example of using SharedArrayBuffer with Web Workers for WebGL particle updates:

hljs javascript42 lines
// Main Thread: Setup
const particleCount = 10000;
const floatsPerParticle = 6; // x, y, z, vx, vy, vz
const sharedBuffer = new SharedArrayBuffer(
  particleCount * floatsPerParticle * Float32Array.BYTES_PER_ELEMENT
);
const particles = new Float32Array(sharedBuffer);

// Initialize particles
for (let i = 0; i < particleCount; i++) {
  const offset = i * floatsPerParticle;
  particles[offset] = Math.random() * 100;     // x
  particles[offset + 1] = Math.random() * 100; // y
  particles[offset + 2] = Math.random() * 100; // z
  // ... velocities
}

// Spawn workers to update different particle ranges
const workerCount = 4;
const particlesPerWorker = Math.floor(particleCount / workerCount);

for (let i = 0; i < workerCount; i++) {
  const worker = new Worker('particle-worker.js');
  worker.postMessage({
    buffer: sharedBuffer,
    startIndex: i * particlesPerWorker,
    endIndex: (i + 1) * particlesPerWorker,
  });
}

// In your render loop
function animate() {
  // Copy particles to WebGL buffer
  gl.bindBuffer(gl.ARRAY_BUFFER, particleBuffer);
  gl.bufferSubData(gl.ARRAY_BUFFER, 0, particles);
  
  // Render particles
  gl.drawArrays(gl.POINTS, 0, particleCount);
  
  requestAnimationFrame(animate);
}

hljs javascript22 lines
// particle-worker.js: Worker updates its particle range
self.onmessage = (e) => {
  const { buffer, startIndex, endIndex } = e.data;
  const particles = new Float32Array(buffer);
  const floatsPerParticle = 6;
  
  // Update loop
  setInterval(() => {
    for (let i = startIndex; i < endIndex; i++) {
      const offset = i * floatsPerParticle;
      
      // Update position based on velocity
      particles[offset] += particles[offset + 3] * 0.016; // x += vx * dt
      particles[offset + 1] += particles[offset + 4] * 0.016; // y += vy * dt
      particles[offset + 2] += particles[offset + 5] * 0.016; // z += vz * dt
      
      // Simple boundary check
      if (particles[offset] > 100) particles[offset + 3] *= -1;
    }
  }, 16); // ~60 FPS
};

Key Benefits:

No locks needed (each worker updates different particles)
No WASM required
Simple and maintainable
Scales to multiple workers easily

When you need atomics: If multiple workers need to access the same data (e.g., a shared collision grid), use Atomics.compareExchange:

hljs javascript10 lines
// Atomic increment for collision counter using CAS (Compare-And-Swap)
let oldValue, newValue;
do {
  oldValue = Atomics.load(sharedArray, collisionCountIndex);
  newValue = oldValue + 1;
} while (Atomics.compareExchange(sharedArray, collisionCountIndex, oldValue, newValue) !== oldValue);

// Or simpler: use Atomics.add for increment
Atomics.add(sharedArray, collisionCountIndex, 1);

Advanced: Lock Striping for High-Contention Scenarios

⚠️ Most applications don't need this! Only consider Lock Striping if you have many workers with high contention on shared resources.

Lock Striping is an advanced technique that uses atomic compare-and-swap (CAS) operations to manage fine-grained locks efficiently.

Load-Linked/Store-Conditional (LL/SC) and WASM Atomics

LL/SC are a pair of atomic instructions found in some CPU architectures. Load-Linked loads a value from memory, and Store-Conditional attempts to store a new value to the same memory location. The Store-Conditional succeeds only if the memory location has not been modified since the Load-Linked. This allows for atomic updates without explicit locking.

Important Note: WASM doesn't directly expose LL/SC instructions. Instead, WASM provides atomic operations through its atomics proposal, which includes:

i32.atomic.load / i64.atomic.load - Atomic reads
i32.atomic.store / i64.atomic.store - Atomic writes
i32.atomic.rmw.cmpxchg - Compare-and-swap (CAS), which provides similar semantics to LL/SC
Other atomic read-modify-write operations

The Compare-And-Swap (CAS) operation can be used to implement lock-free data structures with similar properties to LL/SC-based approaches.

What is Lock Striping?

Lock Striping is a concurrency pattern that uses a table of fine-grained locks (or "stripes"), each protecting a subset of the shared data. When a thread needs to access a shared resource, it acquires the lock associated with that resource's partition using atomic CAS operations. The partitioning reduces contention compared to a single global lock.

Why Lock Striping for WebGL?

Thread Safety: Ensures that multiple Web Workers can access and modify shared WebGL resources (e.g., vertex buffers, textures) without data corruption.
Reduced Contention: Lock Striping reduces contention compared to traditional locks by dividing the shared data into smaller, independently protected regions.
Low Overhead: CAS-based locks avoid the overhead associated with traditional locks, such as context switching and mutex operations.

When to Use Lock Striping (Practical Considerations)

Lock Striping is beneficial when:

You have high contention (many workers accessing shared resources frequently)
You need fine-grained locking across many independent resources
Your application has complex multi-threaded WebGL workloads

Simpler alternatives may be better for:

Single-threaded or simple dual-threaded applications → Use standard JavaScript
Low contention scenarios → JavaScript's Atomics API with SharedArrayBuffer is sufficient
Coarse-grained data sharing → Message passing between workers may be simpler and safer

Reality Check: Most WebGL applications don't need Lock Striping. Consider using:

OffscreenCanvas for simple worker-based rendering
Atomics.wait/notify for basic synchronization
Lock-free message queues for coordination

Only implement Lock Striping if profiling shows lock contention is a bottleneck.

Architecture Diagram: Lock Striping in a Multithreaded WebGL Context

This diagram shows how Lock Striping is used to protect shared WebGL resources accessed by multiple Web Workers. Each worker interacts with the striped lock table to acquire a lock before accessing a specific resource. The lock table uses CAS operations to ensure atomic updates.

Example: Lock Striping Implementation with WASM Atomics

Here's a more realistic implementation using WASM atomics and SharedArrayBuffer:

Rust WASM Module (lib.rs):

hljs rust42 lines
use wasm_bindgen::prelude::*;
use std::sync::atomic::{AtomicI32, Ordering};

#[wasm_bindgen]
pub struct LockTable {
    locks: Vec<AtomicI32>,
}

#[wasm_bindgen]
impl LockTable {
    #[wasm_bindgen(constructor)]
    pub fn new(size: usize) -> Self {
        let mut locks = Vec::with_capacity(size);
        for _ in 0..size {
            locks.push(AtomicI32::new(0));
        }
        Self { locks }
    }
    
    /// Try to acquire lock using Compare-And-Swap (CAS)
    /// Returns true if successful (lock acquired)
    #[wasm_bindgen(js_name = tryAcquire)]
    pub fn try_acquire(&self, index: usize) -> bool {
        if index >= self.locks.len() {
            return false;
        }
        
        // Atomic compare-and-exchange: swap 0 (unlocked) to 1 (locked)
        self.locks[index]
            .compare_exchange(0, 1, Ordering::SeqCst, Ordering::SeqCst)
            .is_ok()
    }
    
    /// Release the lock
    #[wasm_bindgen]
    pub fn release(&self, index: usize) {
        if index < self.locks.len() {
            self.locks[index].store(0, Ordering::SeqCst);
        }
    }
}

JavaScript Usage:

hljs javascript48 lines
// Setup: Initialize WASM module with wasm-bindgen
import init, { LockTable } from './pkg/my_wasm_project.js';

async function setupStripedLocks() {
  await init();
  
  const LOCK_TABLE_SIZE = 16;
  const lockTable = new LockTable(LOCK_TABLE_SIZE);
  
  return lockTable;
}

// Usage in Web Worker
async function workerMain() {
  const stripedLocks = await setupStripedLocks();
  const resourceIndex = 5;
  const maxRetries = 100;
  
  // Try to acquire lock with exponential backoff
  let acquired = false;
  for (let i = 0; i < maxRetries && !acquired; i++) {
    acquired = stripedLocks.tryAcquire(resourceIndex);

    if (!acquired) {
      // Exponential backoff with max delay
      await new Promise(resolve =>
        setTimeout(resolve, Math.min(10 * Math.pow(2, i), 100))
      );
    }
  }

  if (acquired) {
    try {
      // Access and modify shared WebGL resource safely
      console.log("Resource accessed safely!");
      // ... perform operations on shared resource ...
    } finally {
      // Always release the lock
      stripedLocks.release(resourceIndex);
    }
  } else {
    console.error("Failed to acquire lock after retries");
  }
}

// Run in a Web Worker
workerMain();

Key Points:

This uses Rust's atomic operations which compile to WASM atomic instructions
AtomicI32 provides thread-safe, lock-free synchronization
wasm-bindgen makes it easy to expose Rust structs and methods to JavaScript
Compare-and-swap (CAS) provides lock-free synchronization
Exponential backoff handles contention gracefully

SharedArrayBuffer Browser Compatibility

Critical Requirement: Lock Striping requires SharedArrayBuffer, which has strict security requirements:

Browser	Support	Requirements
Chrome	68+	Cross-Origin Isolation
Firefox	79+	Cross-Origin Isolation
Safari	15.2+	Cross-Origin Isolation
Edge	79+	Cross-Origin Isolation

Cross-Origin Isolation Headers Required:

hljs sh3 lines

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Without these headers, SharedArrayBuffer will be unavailable!

Feature Detection with Helpful Error Messages:

hljs javascript29 lines
// Check for SharedArrayBuffer support with detailed diagnostics
function checkSharedArrayBufferSupport() {
  if (typeof SharedArrayBuffer === 'undefined') {
    console.error('❌ SharedArrayBuffer not available!');
    console.log('📋 Possible reasons:');
    console.log('  1. Missing Cross-Origin Isolation headers');
    console.log('  2. Check your server configuration for:');
    console.log('     Cross-Origin-Opener-Policy: same-origin');
    console.log('     Cross-Origin-Embedder-Policy: require-corp');
    console.log('  3. Some browsers disable it in incognito/private mode');
    
    // Check if cross-origin isolated
    if (typeof crossOriginIsolated !== 'undefined') {
      console.log(`  crossOriginIsolated: ${crossOriginIsolated}`);
    }
    
    return false;
  }
  
  console.log('✅ SharedArrayBuffer is available!');
  return true;
}

// Use it before initializing workers
if (!checkSharedArrayBufferSupport()) {
  // Fall back to message passing or Web Workers without shared memory
  console.warn('Falling back to message passing between workers');
}

Using wasm-feature-detect for comprehensive checks:

hljs javascript23 lines
import { simd, threads, bulkMemory } from 'wasm-feature-detect';

async function checkAllWasmFeatures() {
  const features = {
    simd: await simd(),
    threads: await threads(),
    bulkMemory: await bulkMemory(),
    sharedArrayBuffer: typeof SharedArrayBuffer !== 'undefined',
    crossOriginIsolated: typeof crossOriginIsolated !== 'undefined' 
      ? crossOriginIsolated 
      : false
  };
  
  console.table(features);
  
  if (!features.sharedArrayBuffer || !features.crossOriginIsolated) {
    console.error('Multi-threading not available: Missing SharedArrayBuffer or Cross-Origin Isolation');
    return false;
  }
  
  return true;
}

Important Considerations for Lock Striping Implementation:

WASM and Atomics: Atomic operations are best implemented using WASM and its atomics API for maximum performance. JavaScript's built-in atomics work but are generally slower.
Contention Handling: Implement a strategy for handling contention, such as retrying the lock acquisition after a short delay (exponential backoff).
Memory Barriers: Use memory barriers to ensure proper memory ordering and prevent race conditions.
Stripe Count: The number of stripes should be chosen carefully to balance contention and memory overhead.

Flowchart: Lock Striping Acquisition

This flowchart illustrates the Lock Striping acquisition process. A thread attempts to acquire a lock by checking its availability. If the lock is available, the thread attempts to set it using CAS (Compare-And-Swap). If the CAS succeeds, the thread has acquired the lock and can access the shared resource. If the CAS fails (another thread acquired it first), the thread retries.

Putting it all Together: A High-Performance WebGL Pipeline

By combining Three.js, WASM, SIMD, and Lock Striping, you can create a high-performance WebGL pipeline.

Scene Management (Three.js): Use Three.js to manage the overall scene structure, camera controls, and rendering loop.
Performance-Critical Calculations (WASM): Offload computationally intensive tasks, such as physics simulations, vertex transformations, and custom shaders, to WASM.
Parallel Data Processing (SIMD): Leverage SIMD instructions in WASM to accelerate vector and matrix operations, pixel processing, and other data-parallel tasks.
Thread-Safe Data Sharing (Lock Striping): Use Lock Striping to protect shared WebGL resources (e.g., vertex buffers, textures) accessed by multiple Web Workers, ensuring thread safety and reducing contention.

When NOT to Use These Optimizations

WASM: Skip if

Your code is I/O bound (waiting on network, disk) rather than CPU bound
Operations are already fast enough in JavaScript (< 16ms per frame)
Code is rarely executed (one-time initialization)
Overhead of memory copying between JS and WASM exceeds performance gains

SIMD: Skip if

Data sets are too small (< 1000 elements) – overhead dominates
Operations can't be vectorized (heavy branching, unpredictable access patterns)
Browser support is a concern and fallbacks add too much complexity

Lock Striping/Multithreading: Skip if

Your app is simple enough for single-threaded execution
Communication overhead between workers exceeds parallelization benefits
You can achieve 60fps without it
Debugging complexity outweighs performance gains

Golden Rule: Profile first, optimize later. Don't add complexity unless measurements prove it's necessary.

Performance Benchmarks (Typical Gains)

Based on real-world WebGL applications:

Optimization	Scenario	Performance Gain	When to Use
WASM (no SIMD)	Physics simulation (10K particles)	2-3x faster	CPU-intensive calculations
WASM + SIMD	Vertex transformations (100K vertices)	4-8x faster	Large-scale parallel data
SIMD	Matrix operations (4x4, 10K ops/frame)	3-5x faster	Linear algebra heavy loads
Web Workers	Particle system + rendering	1.5-2x faster	Async compute without blocking render
Lock Striping	4+ workers, high contention	20-40% faster vs traditional locks	Complex multi-threaded apps

Notes:

Gains vary based on hardware, browser, and workload characteristics
SIMD performance depends heavily on data alignment and memory access patterns
Multithreading benefits plateau after 4-6 workers due to coordination overhead

Performance Considerations and Optimization Strategies

Profiling First: Use browser developer tools (Chrome DevTools Performance tab) to identify bottlenecks before optimizing.
Memory Management: Optimize memory usage in both JavaScript and WASM to reduce garbage collection pauses.
Shader Optimization: Write efficient shaders that minimize the number of calculations performed per pixel.
Level of Detail (LOD): Use LOD techniques to reduce the complexity of distant objects.
Occlusion Culling: Cull objects that are not visible to the camera.
Texture Compression: Use texture compression to reduce texture memory usage and improve loading times.
Batching: Batch draw calls to reduce the overhead of WebGL API calls.
Instancing: Use instancing to render multiple copies of the same object with different transformations.
Budget Your Frame Time: Aim for 16ms per frame (60 FPS). If compute takes > 8-10ms, consider WASM.

Key Takeaways

What We Covered

Three.js provides a solid foundation for WebGL development with a high-level API
Rust + WASM delivers 2-8x performance gains for CPU-intensive calculations (when used correctly)
SIMD accelerates parallel data operations by 3-5x, especially effective for linear algebra
SharedArrayBuffer + Atomics enables simple, effective multithreading (use this first!)
Lock Striping offers advanced fine-grained synchronization (only for high-contention scenarios)

Critical Lessons

Profile Before Optimizing: Don't add WASM/SIMD complexity unless profiling shows a clear need
Start with JavaScript: SharedArrayBuffer + Atomics is sufficient for most parallel processing
Memory Management Matters: Copying data between JS and WASM can negate performance gains
Browser Compatibility: WASM SIMD is well-supported, but SharedArrayBuffer requires Cross-Origin Isolation
Lock Striping is Rarely Needed: Only 5-10% of applications actually need advanced locking techniques
Measure Everything: Actual performance gains vary widely based on hardware and workload

Practical Implementation Path

Start: Build with Three.js and vanilla JavaScript
Profile: Identify CPU-bound bottlenecks (> 8-10ms)
Optimize Incrementally:
- Level 1: Use Web Workers + SharedArrayBuffer + Atomics (easiest, biggest wins)
- Level 2: Move hot paths to Rust + WASM (for CPU-intensive algorithms)
- Level 3: Add SIMD to WASM code (for vectorizable operations)
- Level 4: Only add Lock Striping if profiling shows lock contention (very rare!)
Measure: Verify each optimization delivers measurable improvement

Technology Decision Matrix

Technique	Use When	Complexity	Performance Gain
Web Workers + SharedArrayBuffer	Need parallelism	⭐ Low	2-4x (multi-core)
Rust + WASM	CPU-bound algorithms	⭐⭐ Medium	2-8x (vs JS)
SIMD	Vectorizable math	⭐⭐⭐ High	3-5x (vs scalar)
Lock Striping	High lock contention	⭐⭐⭐⭐ Very High	10-30% (vs locks)

When You Actually Need This Stack

SharedArrayBuffer + Atomics (90% of cases):

Multi-threaded particle systems
Async physics simulations
Worker-based procedural generation

Rust + WASM (10% of cases):

Complex algorithms (pathfinding, fluid simulation)
Large dataset processing
Cryptography or compression

SIMD (5% of cases):

Matrix/vector operations at scale
Image processing pipelines
Audio DSP

Lock Striping (1% of cases):

8+ workers with shared resource pools
Real-time multiplayer game engines
High-frequency trading visualization

For most WebGL applications, Web Workers + SharedArrayBuffer is all you need.

Remember: Complexity is a liability. Only add advanced optimizations when measurements justify the cost.