Performance Testing: Load, Stress, Spike, and Soak — What Each Does and When to Use Them
Performance Testing: Load, Stress, Spike, and Soak — What Each Does and When to Use Them
The short answer: Performance testing is not a single activity — it is four distinct approaches, each designed to find different failure modes. Load testing validates normal operating capacity. Stress testing finds your breaking point. Spike testing simulates sudden demand surges. Soak testing catches slow degradation over time. Most products that fail under pressure failed because only one or two of these were run, if any.
Performance failures are expensive. The 2023 Gartner research estimated average downtime costs at £4,000–£9,000 per minute for enterprise applications. For e-commerce, a one-second increase in page load time reduces conversions by 7% (Akamai). For SaaS platforms, performance incidents drive churn and undermine renewal conversations. The investment in performance testing is small relative to the cost of getting it wrong in production.
The Four Types of Performance Test
Load Testing
Load testing validates that your application performs correctly under the expected concurrent user load it will experience in production. It answers the question: does the system work at normal scale?
To run a useful load test, you need three numbers: your expected peak concurrent users, your expected transaction mix (what percentage of users are doing what), and your acceptable response time thresholds (typically P95 response time under 2 seconds for web interactions).
What to measure:
- Response time at P50, P95, P99 (median and tail latency)
- Throughput (requests per second)
- Error rate (should be under 0.1% at expected load)
- Resource utilisation (CPU, memory, database connections) at target load
What load testing finds: Capacity shortfalls (the system can't handle expected traffic), slow database queries that aren't visible at low traffic, connection pool exhaustion, and memory utilisation that's higher than anticipated.
What it doesn't find: What happens when you exceed expected load, behaviour under sudden traffic spikes, or slow degradation over hours of operation.
Stress Testing
Stress testing deliberately pushes the system beyond its designed capacity to find the breaking point and understand failure behaviour. It answers: what happens when demand exceeds what we designed for?
The goal is not to prove the system doesn't break — it will. The goal is to understand how it breaks, at what threshold, and whether it recovers cleanly when load returns to normal.
Key questions stress testing answers:
- At what concurrent user count does performance degrade meaningfully?
- At what point do errors start appearing?
- At the absolute breaking point, does the system crash hard or degrade gracefully?
- After the stress is removed, does the system recover automatically and completely?
What stress testing reveals: The system's capacity ceiling, failure modes under overload (does it return errors gracefully or timeout silently?), auto-scaling trigger points, and whether recovery mechanisms work as designed.
Graceful degradation under stress is as important as the capacity ceiling. A system that returns clear error messages and recovers automatically after stress is far better than one that hangs indefinitely and requires manual intervention.
Spike Testing
Spike testing simulates sudden, dramatic increases in load — the kind of traffic patterns caused by a marketing campaign going viral, a TV mention, a news event, or the release of a major product announcement. It answers: can the system handle sudden demand surges without degrading or failing?
The difference from stress testing is the shape of the load: spike testing involves a rapid ramp-up (often in seconds rather than minutes), a short period at peak, and then a rapid return to normal levels.
What spike testing finds: Auto-scaling latency (the time between traffic spiking and new capacity becoming available), connection handling under sudden surge, queue overflow behaviour, and cold-start performance issues in serverless or containerised architectures.
Particularly relevant for: E-commerce platforms (flash sales, product launches), ticketing systems (sale openings), media platforms (breaking news events), and any consumer-facing product with unpredictable demand patterns.
A common failure mode revealed by spike testing: auto-scaling is configured correctly but takes 3–4 minutes to provision new instances — long enough for the spike to cause errors and timeouts before capacity catches up. The fix is either faster scaling, pre-warming, or accepting the risk and designing graceful degradation for the spike window.
Soak Testing
Soak testing runs the system under normal or moderately elevated load for an extended period — typically 12–72 hours — looking for degradation that only manifests over time. It answers: does the system remain stable over sustained operation?
The failure modes soak testing finds are fundamentally different from the other types:
- Memory leaks: Memory usage that grows incrementally over hours, eventually causing out-of-memory errors or garbage collection pauses
- Connection pool leaks: Database or API connections that are acquired but not properly released, eventually exhausting the pool
- Disk space accumulation: Log files, temporary files, or database bloat that consume disk space over time
- Thread exhaustion: Thread pool depletion under sustained load that causes requests to queue indefinitely
- Cache poisoning or stale caches: Caching behaviour that degrades over time as the cache fills or becomes inconsistent
Soak testing failures are particularly damaging in production because they typically manifest gradually and then fail suddenly — the application appears healthy for hours before collapsing under the weight of accumulated leaks.
Metrics That Actually Matter
Performance testing generates enormous amounts of data. The metrics that drive decisions are:
Response time percentiles — P50, P95, P99: The median (P50) tells you how most users experience the system. P95 and P99 tell you about tail latency — the worst 5% and 1% of requests. For user-facing applications, P95 under 2 seconds is the standard benchmark for acceptable web performance. P99 matters for API-dependent architectures where slow responses cascade.
Error rate: The percentage of requests that return errors. Under normal load, error rate should be below 0.1%. Any non-zero error rate under expected load is a finding worth investigating.
Throughput: Requests per second (or transactions per second for business-level metrics). Throughput tells you how much work the system is doing — and whether it's increasing linearly with virtual users or plateauing (indicating a bottleneck).
Apdex score: The Application Performance Index is a standardised measure of user satisfaction based on response time. An Apdex of 1.0 is perfect; 0.94+ is excellent; below 0.5 indicates significant user experience impact.
Tools
k6: The modern standard for developer-friendly performance testing. Scripts in JavaScript, integrates natively with CI/CD pipelines, has excellent Grafana dashboards, and supports cloud execution for high virtual user counts. Our first recommendation for any team without an existing performance testing stack.
Apache JMeter: Mature, widely used, GUI-based. Excellent for teams that prefer visual test design. More resource-intensive than k6 for the same virtual user count.
Gatling: Scala/Java-based, produces excellent HTML reports, performs well at high virtual user counts. Popular in enterprise Java environments.
Artillery: Node.js-based, good for HTTP and WebSocket testing, easy YAML configuration. A lighter alternative to JMeter for teams already in the Node ecosystem.
Key Takeaways
- Load, stress, spike, and soak testing each find different failure modes — running only one type gives incomplete coverage
- Soak testing is the most commonly skipped and finds the most insidious failures (memory leaks, connection pool exhaustion)
- Measure P95 and P99 response time, not just average — tail latency determines real user experience
- Graceful degradation under stress is as important as the capacity ceiling
- k6 is the recommended modern performance testing tool for CI/CD integration