HoML Logo

HoML

YOLOv8s Performance Benchmarks: A Data-Driven GPU Comparison

Published on 2025-08-20 by The HoML Team

When evaluating hardware for computer vision tasks, concrete data is often hard to come by. How much faster is a data center GPU like an H100 compared to a consumer-grade RTX 4000 for a real-world workload? Is it worth upgrading from a T4 to an L4? And what is the true performance gap between a GPU and a CPU?

This post aims to provide clear, data-driven answers to these questions. We've benchmarked a range of modern NVIDIA GPUs and CPU configurations on a common object detection task: running a YOLOv8s model with an image size of 320. Our goal is to provide a valuable reference point that is missing from the public domain, helping you make more informed hardware decisions.

All GPU benchmarks were conducted using the PyTorch version of YOLOv8s. The NVIDIA drivers used were version 570 for the EC2 instances (T4, A10G, L4), 565 for the H100, and 575 for the RTX 4000 SFF Ada workstation card. For the CPU, we compare the performance of PyTorch and OpenVINO.

Peak Performance: A Hardware Showdown

Let's start with a simple question: what's the fastest hardware for a typical YOLOv8s workload? We ran benchmarks across several NVIDIA GPUs and a multi-core CPU to find the maximum achievable Frames Per Second (FPS).

Note: Chart shows peak throughput. CPU: 4-vCPU w/ OpenVINO. RTX 4000: 4 concurrent runs. RTX 4060: 100W laptop GPU w/ 8 concurrent runs. H100: 32 concurrent runs. All GPUs use NVIDIA MPS.

The data reveals two crucial insights. First, even the slowest GPU we tested (the NVIDIA T4 on AWS g4dn) is over 30 times faster than a single vCPU running an optimized OpenVINO workload (360 FPS vs. ~11 FPS). Second, the data quantifies the massive performance gap between hardware tiers. A top-tier H100, when fully saturated with concurrent workloads, is ~6.7x faster than a highly capable RTX 4060 and ~11x faster than an L4, providing a clear picture of the performance you get for the money.

The Concurrency Advantage: More Throughput on Modern GPUs

Achieving high FPS with a single, batched process is a great start, but what happens when you need to serve multiple video streams or users at once? For smaller models like YOLO on large, modern GPUs, you can often achieve significantly higher total throughput by running multiple processes concurrently.

A single inference process, even with batching, often can't saturate a powerful GPU. The key to unlocking its full potential is to serve multiple workloads in parallel. However, simply running multiple processes that target the same GPU can lead to contention and unpredictable performance.

The solution is NVIDIA's Multi-Process Service (MPS). Unlike the default GPU behavior which time-slices access for each process, MPS allows CUDA kernels from different processes to be processed truly concurrently on the GPU's hardware. This cooperative multi-tasking avoids the overhead of context switching, leading to higher overall throughput and more predictable performance.

Let's look at the NVIDIA H100. A single process achieves an impressive 1,184 FPS. But by using MPS to partition the GPU into independent "splits," we can serve more processes and drive the total throughput way up.

Concurrent Runs (Splits) Avg. FPS per Run Total Combined FPS
1 1184 1184
2 1025 2050
4 870 3480
8 628 5024
16 370 5920
24 260 6240
32 210 6720
48 120 5760

As the chart shows, the total throughput on the H100 scales almost linearly before peaking at an incredible 6,720 FPS with 32 concurrent processes. This is the key insight for deploying models like YOLO at scale: more concurrency leads to more throughput, as long as it's managed correctly with a tool like MPS.

We saw similar, albeit less dramatic, benefits on the NVIDIA L4. Two concurrent runs without MPS caused contention, with each process achieving ~164 FPS. With MPS isolating the two processes into 50% compute slices, each achieved a stable ~176 FPS, improving total throughput and providing predictable performance.

The MPS Advantage on Consumer GPUs: A 4060 Case Study

The benefits of MPS aren't limited to data center GPUs. We tested an RTX 4060 (laptop, 100W) with a batch size of 32 to see how MPS affects performance under concurrent loads.

Without MPS

Concurrency Avg. FPS per Run Total Combined FPS
1 700 700
2 370 740
4 197 788
8 101 808

With MPS

Concurrency Avg. FPS per Run Total Combined FPS
1 700 700
2 426 852
4 228 912
8 124 992

Without MPS, performance quickly degrades due to context-switching overhead. With MPS, the GPU handles concurrent processes much more gracefully. At 8 concurrent runs, MPS delivers a 22% increase in total throughput (992 FPS vs. 808 FPS), demonstrating its value even on consumer-grade hardware.

Impact of Batch Size on Throughput

Batch size is another critical factor. A larger batch size generally leads to higher throughput, but the benefits diminish as the GPU becomes saturated. We tested an RTX 4060 (100W, with MPS) at batch sizes of 1, 4, and 32.

As expected, batch 32 provides the highest peak throughput. However, batch 4 is surprisingly competitive, reaching over 1000 FPS with 8 concurrent runs. For latency-sensitive applications, a smaller batch size like 4 might offer a better trade-off between throughput and response time. Batch 1 performance, while lower, still scales well with concurrency, making it a viable option if real-time processing of single frames is required.

Power Limits: 100W vs. 120W

Does supplying more power to a GPU always translate to better performance? We compared the RTX 4060 at 100W and 120W power limits using a batch size of 4 with MPS.

The extra 20W provides a modest but consistent performance uplift of around 8-10% across different levels of concurrency. While not a groundbreaking increase, it shows that for power-constrained environments, even a small bump in wattage can yield a noticeable improvement in throughput.

Framework Comparison: PyTorch vs. ONNX Runtime vs. TensorRT

The hardware you run on is only half the story; the software framework you use for inference can make a massive difference in performance. To quantify this, we benchmarked the same YOLOv8s model on an RTX 4060 (100W, batch size 1) using four different inference configurations:

  • PyTorch: The baseline, using the standard Torch-based inference.
  • ONNX Runtime (CUDA): Using ONNX Runtime with its CUDA Execution Provider, which offers a good out-of-the-box speedup.
  • ONNX Runtime (TensorRT): Using ONNX Runtime with the more optimized TensorRT Execution Provider.
  • TensorRT (trtexec): The peak performance baseline, using NVIDIA's trtexec tool to run a pure, highly-optimized TensorRT engine.

The results are clear: moving from PyTorch to a more specialized inference framework yields significant gains. While ONNX Runtime with CUDA provides a solid boost, leveraging its TensorRT provider unlocks even more performance. For maximum throughput, a pure TensorRT implementation is the undisputed winner, delivering over 12% more FPS than the next best option (ONNX with TensorRT) at 8 concurrent runs and over 2.5x the throughput of PyTorch at a single run.

CPU Inference: Getting the Most from PyTorch and OpenVINO

While GPUs are king, sometimes you're limited to a CPU. We tested both PyTorch and OpenVINO on a 4-vCPU instance to see which framework performed better.

Framework vCPUs Inference FPS
PyTorch 4 22.73
OpenVINO 4 28.96

For CPU-bound inference, a properly configured OpenVINO is the clear winner, delivering a 27% performance improvement over native PyTorch. If you must run on a CPU, optimizing your software stack is critical, and OpenVINO is the right tool for the job.

Conclusion & Key Takeaways

  • The GPU Imperative: Even a last-generation data center GPU like the T4 is over 30x faster than a single vCPU for YOLOv8s inference.
  • Quantifying the Tiers: An H100 isn't just faster—it's a different class of machine, offering 7-11x more throughput than powerful workstation GPUs like the RTX 4000 SFF Ada and L4.
  • Concurrency is King: For smaller models like YOLO, maximizing the performance of modern GPUs requires running multiple workloads in parallel with a tool like NVIDIA MPS.
  • CPU Choice Matters: If you must use a CPU, OpenVINO provides a significant ~27% performance lift over a standard PyTorch setup.

Raw Benchmark Data

NVIDIA H100 (with MPS)

Concurrent Runs (Splits) Avg. FPS per Run Total Combined FPS
1 1184 1184
2 1025 2050
4 870 3480
8 628 5024
16 370 5920
24 260 6240
32 210 6720
48 120 5760

NVIDIA L4 (g6.xlarge)

Metric Batch 1 Batch 32
Inference FPS 151.20 607.25

NVIDIA A10G (g5.xlarge)

Metric Batch 1 Batch 32
Inference FPS 140.34 463.46

RTX 4000 SFF Ada

Metric Batch 1 Batch 32
Inference FPS 145.28 527.93

RTX 4000 SFF Ada (Concurrency with MPS)

Concurrent Runs Avg. FPS per Run Total Combined FPS
1 680 680
2 450 900
4 230 920

RTX 4060 (100W & 120W)

RTX 4060 100W Framework Comparison (Batch 1, with MPS)

PyTorch

Concurrency Avg. FPS per Run Total Combined FPS
1 250 250
2 235 470
4 169 676
8 100 800

ONNX Runtime (CUDA)

Concurrency Avg. FPS per Run Total Combined FPS
1 330 330
2 272 544
4 155 620
8 85 680

ONNX Runtime (TensorRT)

Concurrency Avg. FPS per Run Total Combined FPS
1 497 497
2 355 710
4 208 832
8 100 800

TensorRT (trtexec)

Concurrency Avg. FPS per Run Total Combined FPS
1 646 646
2 392 784
4 220 880
8 113 904

RTX 4060 100W (Batch 32, No MPS)

Concurrency Avg. FPS per Run Total Combined FPS
1 700 700
2 370 740
4 197 788
8 101 808

RTX 4060 100W (Batch 32, with MPS)

Concurrency Avg. FPS per Run Total Combined FPS
1 700 700
2 426 852
4 228 912
8 124 992

RTX 4060 100W (Batch 4, with MPS)

Concurrency Avg. FPS per Run Total Combined FPS
1 616 616
2 431 862
4 240 960
8 128 1024

RTX 4060 100W (Batch 1, with MPS)

Concurrency Avg. FPS per Run Total Combined FPS
1 250 250
2 235 470
4 169 676
8 100 800

RTX 4060 120W (Batch 4, with MPS)

Concurrency Avg. FPS per Run Total Combined FPS
1 620 620
2 440 880
4 250 1000
8 138 1104

NVIDIA T4 (g4dn.xlarge)

Metric Batch 1 Batch 32
Inference FPS 130.20 360

CPU Baseline (c5.xlarge)

Framework vCPUs Inference FPS
PyTorch 1 11.37
OpenVINO 1 11.32
PyTorch 4 22.73
OpenVINO 4 28.96