YOLOv8s Performance Benchmarks: A Data-Driven GPU Comparison
When evaluating hardware for computer vision tasks, concrete data is often hard to come by. How much faster is a data center GPU like an H100 compared to a consumer-grade RTX 4000 for a real-world workload? Is it worth upgrading from a T4 to an L4? And what is the true performance gap between a GPU and a CPU?
This post aims to provide clear, data-driven answers to these questions. We've benchmarked a range of modern NVIDIA GPUs and CPU configurations on a common object detection task: running a YOLOv8s model with an image size of 320. Our goal is to provide a valuable reference point that is missing from the public domain, helping you make more informed hardware decisions.
All GPU benchmarks were conducted using the PyTorch version of YOLOv8s. The NVIDIA drivers used were version 570 for the EC2 instances (T4, A10G, L4), 565 for the H100, and 575 for the RTX 4000 SFF Ada workstation card. For the CPU, we compare the performance of PyTorch and OpenVINO.
Peak Performance: A Hardware Showdown
Let's start with a simple question: what's the fastest hardware for a typical YOLOv8s workload? We ran benchmarks across several NVIDIA GPUs and a multi-core CPU to find the maximum achievable Frames Per Second (FPS).
Note: Chart shows peak throughput. CPU: 4-vCPU w/ OpenVINO. RTX 4000: 4 concurrent runs. RTX 4060: 100W laptop GPU w/ 8 concurrent runs. H100: 32 concurrent runs. All GPUs use NVIDIA MPS.
The data reveals two crucial insights. First, even the slowest GPU we tested (the NVIDIA T4 on AWS g4dn) is over 30 times faster than a single vCPU running an optimized OpenVINO workload (360 FPS vs. ~11 FPS). Second, the data quantifies the massive performance gap between hardware tiers. A top-tier H100, when fully saturated with concurrent workloads, is ~6.7x faster than a highly capable RTX 4060 and ~11x faster than an L4, providing a clear picture of the performance you get for the money.
The Concurrency Advantage: More Throughput on Modern GPUs
Achieving high FPS with a single, batched process is a great start, but what happens when you need to serve multiple video streams or users at once? For smaller models like YOLO on large, modern GPUs, you can often achieve significantly higher total throughput by running multiple processes concurrently.
A single inference process, even with batching, often can't saturate a powerful GPU. The key to unlocking its full potential is to serve multiple workloads in parallel. However, simply running multiple processes that target the same GPU can lead to contention and unpredictable performance.
The solution is NVIDIA's Multi-Process Service (MPS). Unlike the default GPU behavior which time-slices access for each process, MPS allows CUDA kernels from different processes to be processed truly concurrently on the GPU's hardware. This cooperative multi-tasking avoids the overhead of context switching, leading to higher overall throughput and more predictable performance.
Let's look at the NVIDIA H100. A single process achieves an impressive 1,184 FPS. But by using MPS to partition the GPU into independent "splits," we can serve more processes and drive the total throughput way up.
| Concurrent Runs (Splits) | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 1184 | 1184 |
| 2 | 1025 | 2050 |
| 4 | 870 | 3480 |
| 8 | 628 | 5024 |
| 16 | 370 | 5920 |
| 24 | 260 | 6240 |
| 32 | 210 | 6720 |
| 48 | 120 | 5760 |
As the chart shows, the total throughput on the H100 scales almost linearly before peaking at an incredible 6,720 FPS with 32 concurrent processes. This is the key insight for deploying models like YOLO at scale: more concurrency leads to more throughput, as long as it's managed correctly with a tool like MPS.
We saw similar, albeit less dramatic, benefits on the NVIDIA L4. Two concurrent runs without MPS caused contention, with each process achieving ~164 FPS. With MPS isolating the two processes into 50% compute slices, each achieved a stable ~176 FPS, improving total throughput and providing predictable performance.
The MPS Advantage on Consumer GPUs: A 4060 Case Study
The benefits of MPS aren't limited to data center GPUs. We tested an RTX 4060 (laptop, 100W) with a batch size of 32 to see how MPS affects performance under concurrent loads.
Without MPS
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 700 | 700 |
| 2 | 370 | 740 |
| 4 | 197 | 788 |
| 8 | 101 | 808 |
With MPS
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 700 | 700 |
| 2 | 426 | 852 |
| 4 | 228 | 912 |
| 8 | 124 | 992 |
Without MPS, performance quickly degrades due to context-switching overhead. With MPS, the GPU handles concurrent processes much more gracefully. At 8 concurrent runs, MPS delivers a 22% increase in total throughput (992 FPS vs. 808 FPS), demonstrating its value even on consumer-grade hardware.
Impact of Batch Size on Throughput
Batch size is another critical factor. A larger batch size generally leads to higher throughput, but the benefits diminish as the GPU becomes saturated. We tested an RTX 4060 (100W, with MPS) at batch sizes of 1, 4, and 32.
As expected, batch 32 provides the highest peak throughput. However, batch 4 is surprisingly competitive, reaching over 1000 FPS with 8 concurrent runs. For latency-sensitive applications, a smaller batch size like 4 might offer a better trade-off between throughput and response time. Batch 1 performance, while lower, still scales well with concurrency, making it a viable option if real-time processing of single frames is required.
Power Limits: 100W vs. 120W
Does supplying more power to a GPU always translate to better performance? We compared the RTX 4060 at 100W and 120W power limits using a batch size of 4 with MPS.
The extra 20W provides a modest but consistent performance uplift of around 8-10% across different levels of concurrency. While not a groundbreaking increase, it shows that for power-constrained environments, even a small bump in wattage can yield a noticeable improvement in throughput.
Framework Comparison: PyTorch vs. ONNX Runtime vs. TensorRT
The hardware you run on is only half the story; the software framework you use for inference can make a massive difference in performance. To quantify this, we benchmarked the same YOLOv8s model on an RTX 4060 (100W, batch size 1) using four different inference configurations:
- PyTorch: The baseline, using the standard Torch-based inference.
- ONNX Runtime (CUDA): Using ONNX Runtime with its CUDA Execution Provider, which offers a good out-of-the-box speedup.
- ONNX Runtime (TensorRT): Using ONNX Runtime with the more optimized TensorRT Execution Provider.
- TensorRT (
trtexec): The peak performance baseline, using NVIDIA'strtexectool to run a pure, highly-optimized TensorRT engine.
The results are clear: moving from PyTorch to a more specialized inference framework yields significant gains. While ONNX Runtime with CUDA provides a solid boost, leveraging its TensorRT provider unlocks even more performance. For maximum throughput, a pure TensorRT implementation is the undisputed winner, delivering over 12% more FPS than the next best option (ONNX with TensorRT) at 8 concurrent runs and over 2.5x the throughput of PyTorch at a single run.
CPU Inference: Getting the Most from PyTorch and OpenVINO
While GPUs are king, sometimes you're limited to a CPU. We tested both PyTorch and OpenVINO on a 4-vCPU instance to see which framework performed better.
| Framework | vCPUs | Inference FPS |
|---|---|---|
| PyTorch | 4 | 22.73 |
| OpenVINO | 4 | 28.96 |
For CPU-bound inference, a properly configured OpenVINO is the clear winner, delivering a 27% performance improvement over native PyTorch. If you must run on a CPU, optimizing your software stack is critical, and OpenVINO is the right tool for the job.
Conclusion & Key Takeaways
- The GPU Imperative: Even a last-generation data center GPU like the T4 is over 30x faster than a single vCPU for YOLOv8s inference.
- Quantifying the Tiers: An H100 isn't just faster—it's a different class of machine, offering 7-11x more throughput than powerful workstation GPUs like the RTX 4000 SFF Ada and L4.
- Concurrency is King: For smaller models like YOLO, maximizing the performance of modern GPUs requires running multiple workloads in parallel with a tool like NVIDIA MPS.
- CPU Choice Matters: If you must use a CPU, OpenVINO provides a significant ~27% performance lift over a standard PyTorch setup.
Raw Benchmark Data
NVIDIA H100 (with MPS)
| Concurrent Runs (Splits) | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 1184 | 1184 |
| 2 | 1025 | 2050 |
| 4 | 870 | 3480 |
| 8 | 628 | 5024 |
| 16 | 370 | 5920 |
| 24 | 260 | 6240 |
| 32 | 210 | 6720 |
| 48 | 120 | 5760 |
NVIDIA L4 (g6.xlarge)
| Metric | Batch 1 | Batch 32 |
|---|---|---|
| Inference FPS | 151.20 | 607.25 |
NVIDIA A10G (g5.xlarge)
| Metric | Batch 1 | Batch 32 |
|---|---|---|
| Inference FPS | 140.34 | 463.46 |
RTX 4000 SFF Ada
| Metric | Batch 1 | Batch 32 |
|---|---|---|
| Inference FPS | 145.28 | 527.93 |
RTX 4000 SFF Ada (Concurrency with MPS)
| Concurrent Runs | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 680 | 680 |
| 2 | 450 | 900 |
| 4 | 230 | 920 |
RTX 4060 (100W & 120W)
RTX 4060 100W Framework Comparison (Batch 1, with MPS)
PyTorch
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 250 | 250 |
| 2 | 235 | 470 |
| 4 | 169 | 676 |
| 8 | 100 | 800 |
ONNX Runtime (CUDA)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 330 | 330 |
| 2 | 272 | 544 |
| 4 | 155 | 620 |
| 8 | 85 | 680 |
ONNX Runtime (TensorRT)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 497 | 497 |
| 2 | 355 | 710 |
| 4 | 208 | 832 |
| 8 | 100 | 800 |
TensorRT (trtexec)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 646 | 646 |
| 2 | 392 | 784 |
| 4 | 220 | 880 |
| 8 | 113 | 904 |
RTX 4060 100W (Batch 32, No MPS)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 700 | 700 |
| 2 | 370 | 740 |
| 4 | 197 | 788 |
| 8 | 101 | 808 |
RTX 4060 100W (Batch 32, with MPS)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 700 | 700 |
| 2 | 426 | 852 |
| 4 | 228 | 912 |
| 8 | 124 | 992 |
RTX 4060 100W (Batch 4, with MPS)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 616 | 616 |
| 2 | 431 | 862 |
| 4 | 240 | 960 |
| 8 | 128 | 1024 |
RTX 4060 100W (Batch 1, with MPS)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 250 | 250 |
| 2 | 235 | 470 |
| 4 | 169 | 676 |
| 8 | 100 | 800 |
RTX 4060 120W (Batch 4, with MPS)
| Concurrency | Avg. FPS per Run | Total Combined FPS |
|---|---|---|
| 1 | 620 | 620 |
| 2 | 440 | 880 |
| 4 | 250 | 1000 |
| 8 | 138 | 1104 |
NVIDIA T4 (g4dn.xlarge)
| Metric | Batch 1 | Batch 32 |
|---|---|---|
| Inference FPS | 130.20 | 360 |
CPU Baseline (c5.xlarge)
| Framework | vCPUs | Inference FPS |
|---|---|---|
| PyTorch | 1 | 11.37 |
| OpenVINO | 1 | 11.32 |
| PyTorch | 4 | 22.73 |
| OpenVINO | 4 | 28.96 |