Benchmark Methodology¶

This document explains how Spikard benchmarks are designed, executed, and analyzed to provide meaningful, reproducible performance measurements.

Design Principles¶

1. Fairness and Consistency¶

All frameworks are tested under identical conditions:

Same hardware: All benchmarks run on the same machine during a single session
Same workloads: Identical HTTP requests (headers, bodies, parameters) across frameworks
Same configuration: Duration (30s default), concurrency (100 default), warmup (3s)
Same tooling: Single load generator (oha or bombardier) for all frameworks

2. Realistic Workloads¶

Benchmarks test real-world HTTP patterns rather than synthetic microbenchmarks:

Small JSON bodies (86 bytes): Typical API responses
Medium JSON bodies (5 KB): Detailed resource representations
Large JSON bodies (150 KB): Bulk data transfers, file metadata
Path parameters: RESTful resource identifiers (/users/{id})
Query parameters: Filtering, pagination, search (?page=1&limit=20&sort=name)
Form data: Traditional web form submissions

3. Statistical Rigor¶

Results undergo statistical analysis to distinguish meaningful differences from random variance:

Welch's t-test: Compares mean performance between frameworks
Significance threshold: p-value < 0.05 (configurable)
Effect size: Cohen's d measures practical significance
Sample size: 30+ second benchmarks provide thousands of data points

4. Reproducibility¶

Every result includes metadata for exact reproduction:

Git commit hash and branch
Host specifications (CPU model, core count, RAM)
Framework and runtime versions
Benchmark configuration parameters
Timestamp in ISO 8601 format

Workload Categories¶

JSON Bodies¶

Tests JSON serialization and deserialization performance across payload sizes.

Workload	Size	Description
`json-small`	86 bytes	Minimal API response: `{"id": 123, "name": "test", "active": true}`
`json-medium`	5 KB	User profile with nested objects and arrays
`json-large`	52 KB	Product catalog with multiple items
`json-very-large`	150 KB	Bulk data export or detailed analytics payload

What this measures: Serialization overhead, memory allocation patterns, GC pressure from large object creation.

Path Parameters¶

Tests URL path parsing and parameter extraction.

Workload	Pattern	Description
`path-simple`	`/users/{id}`	Single parameter extraction
`path-multiple`	`/orgs/{org}/repos/{repo}`	Multiple parameters
`path-deep`	`/a/{a}/b/{b}/c/{c}`	Deep nesting
`path-int`	`/numbers/{value}`	Integer type validation
`path-uuid`	`/items/{id}`	UUID format validation
`path-date`	`/events/{date}`	Date parsing

What this measures: Regex matching efficiency, parameter type conversion, routing table lookups.

Query Parameters¶

Tests query string parsing with varying parameter counts.

Workload	Parameters	Description
`query-few`	3	`?page=1&limit=20&sort=name`
`query-medium`	8	Typical filtering query
`query-many`	15+	Complex search with many filters

What this measures: Query string parser performance, parameter validation overhead, memory allocation from string splitting.

Form Data¶

Tests URL-encoded form parsing (common in traditional web apps).

Workload	Fields	Description
`form-simple`	4	Basic login form
`form-complex`	12	User registration with address fields

What this measures: Form data parsing, multipart handling, file upload streaming.

Multipart Uploads¶

Tests file upload handling with different file sizes.

Workload	File Size	Description
`multipart-small`	1 KB	Small text file
`multipart-medium`	10 KB	Small image
`multipart-large`	100 KB	Document upload

What this measures: Streaming performance, memory buffering, temporary file handling.

Validation Overhead Analysis¶

A key insight from Spikard benchmarks is measuring the cost of runtime type validation.

Paired Implementations¶

For selected frameworks, we maintain two variants:

Validated: Full type checking with schema validation (msgspec, Zod, Pydantic)
Raw: Direct JSON parsing with no validation (json.loads(), JSON.parse())

Examples:

fastapi vs fastapi-raw
express vs express-raw
spikard-python vs spikard-raw

Overhead Calculation¶

validation_overhead = (raw_rps - validated_rps) / raw_rps * 100

Typical results:

FastAPI with Pydantic: ~40% overhead
Spikard with msgspec: ~15% overhead
Express with Zod: ~25% overhead

This quantifies the performance cost of type safety and helps users make informed tradeoffs between safety and speed.

Metrics Explained¶

Throughput¶

Requests per second (RPS): The primary performance indicator.

RPS = successful_requests / benchmark_duration_seconds

Higher is better. Typical ranges:

100k+ RPS: Native Rust/Go frameworks with minimal overhead
50k-100k RPS: Optimized Python/Node frameworks
10k-50k RPS: Standard Python frameworks (Django, Flask)
<10k RPS: Heavy frameworks or interpreted languages without optimizations

Bytes per second: Network throughput including headers and body.

Bytes/sec = (total_bytes_sent + total_bytes_received) / duration_seconds

Useful for comparing large payload workloads where network I/O dominates.

Success rate: Percentage of requests that completed successfully.

Success_rate = successful_requests / total_requests * 100

Should always be 100% for valid benchmarks. Lower values indicate framework errors or crashes.

Latency Distribution¶

Latency percentiles answer different questions:

p50 (median): Typical user experience
p90: Most users' experience
p95: Nearly all users' experience
p99: Worst-case for 99% of users
p99.9: Extreme outliers

Why percentiles matter: Mean latency can be misleading when a few slow requests skew the average. p99 latency is often 10x the median.

Example interpretation:

median: 2.5ms
p99: 45ms

This means most requests are fast (2.5ms), but 1% of users experience 18x slower responses (45ms). This might indicate GC pauses or lock contention.

Resource Utilization¶

CPU percentage: Process CPU usage as a percentage of one core.

100% = fully utilizing one core
400% = utilizing four cores
Values >100% indicate multi-threaded execution

Memory (RSS): Resident Set Size in megabytes.

Tracks heap allocations, object creation, and memory leaks. Sharp increases during benchmarks indicate allocation pressure; gradual increases suggest leaks.

Language-Specific Profiling¶

Python metrics (collected with py-spy):

GIL wait time: Time spent waiting for Python's Global Interpreter Lock
GIL contention: Percentage of time blocked on GIL (higher = more threading overhead)
FFI overhead: Time spent crossing Python/Rust boundary
GC collections/time: Garbage collector impact

Node.js metrics (collected with clinic.js):

V8 heap usage: Memory allocated by JavaScript objects
Event loop lag: Delay in processing new events (higher = slower async handling)
GC time: V8 garbage collection overhead

Ruby metrics (collected with stackprof):

GC count/time: Frequency and duration of garbage collection
Heap pages: Memory allocated by Ruby VM
Live objects: Number of objects in memory

Statistical Analysis¶

Welch's t-test¶

Tests whether two frameworks have significantly different mean performance.

Null hypothesis (H₀): Framework A and Framework B have equal mean RPS.

Alternative hypothesis (H₁): Frameworks have different mean RPS.

Result interpretation:

p < 0.05: Reject null hypothesis, difference is statistically significant
p ≥ 0.05: Cannot reject null hypothesis, difference may be random

Cohen's d Effect Size¶

Measures the magnitude of performance difference in standard deviation units.

d = (mean_A - mean_B) / pooled_standard_deviation

Interpretation:

d < 0.2: Trivial difference
d = 0.2-0.5: Small difference
d = 0.5-0.8: Medium difference
d > 0.8: Large difference

Example:

Framework A: 50,000 RPS (σ = 1,000)
Framework B: 45,000 RPS (σ = 1,200)
Cohen's d = 4.5 (very large effect)

Multiple Comparison Correction¶

When comparing many frameworks, we use baseline comparison to reduce false positives:

Instead of comparing all pairs (N²/2 comparisons), we compare each framework to a single baseline (N comparisons).

This reduces the risk of finding "significant" differences by chance when running many statistical tests.

Benchmark Execution¶

Load Generator¶

Spikard uses oha (preferred) or bombardier for load generation:

oha -z 30s -c 100 --latency-correction --disable-keepalive http://localhost:8000/endpoint

Parameters:

-z 30s: Run for 30 seconds
-c 100: 100 concurrent connections
--latency-correction: Adjust for coordinated omission
--disable-keepalive: Force new connections (more realistic)

Warmup Period¶

Before measurement, a 3-second warmup period runs to:

Populate caches
Trigger JIT compilation
Stabilize CPU frequency
Initialize connection pools

Server Process Management¶

Each framework application:

Spawns in isolated process
Waits for HTTP readiness (port listening)
Receives warmup traffic
Undergoes measured benchmark
Terminates cleanly

This ensures no cross-contamination between benchmarks.

Common Pitfalls¶

1. Coordinated Omission¶

Problem: Load generators that pause when the server is slow underreport latency.

Solution: Oha's --latency-correction flag compensates for this bias.

2. Insufficient Duration¶

Problem: Short benchmarks (< 10s) may not capture steady-state performance.

Solution: Default 30s duration ensures stabilization. Use 60s+ for production comparisons.

3. Ignoring Success Rate¶

Problem: Frameworks that drop requests appear faster due to fewer completed requests.

Solution: Always check success_rate = 1.0 before comparing RPS.

4. Single Run Results¶

Problem: Performance varies run-to-run due to CPU throttling, background processes.

Solution: Run multiple iterations and report median RPS with standard deviation.

5. Different Runtime Versions¶

Problem: Python 3.11 vs 3.12 may show 20%+ performance differences.

Solution: Lock runtime versions in metadata and compare within same environment.

Best Practices¶

Running Benchmarks¶

Close background applications: Disable browsers, IDEs, and other CPU-intensive processes
Disable CPU frequency scaling: Set governor to performance mode
Pin to physical cores: Use taskset to avoid hyperthreading variance
Run multiple iterations: Execute 3-5 runs and report median
Monitor temperature: Ensure CPU doesn't thermal throttle during benchmarks

Interpreting Results¶

Check success rate first: 100% successful requests required
Compare within language: Don't compare FastAPI (Python) to Express (Node.js)
Consider validation overhead: Raw variants show theoretical maximum
Look at percentiles: p99 latency matters for user experience
Read statistical significance: p-value determines if difference is real

Adding New Frameworks¶

Implement all workload endpoints matching existing apps
Use framework's recommended production configuration
Enable validation if framework supports it
Create a -raw variant for overhead comparison
Document framework version and runtime in app README

Future Enhancements¶

Automated regression detection in CI
Historical trend visualization
Per-endpoint profiling with flamegraphs
WebSocket and SSE streaming benchmarks
Database integration workloads (PostgreSQL queries)
Multi-region latency simulation