HoML Logo

HoML

YOLOv8s Performance Benchmarks: A Data-Driven GPU Comparison

Published on 2025-08-20 by The HoML Team

When evaluating hardware for computer vision tasks, concrete data is often hard to come by. How much faster is a data center GPU like an H100 compared to a consumer-grade RTX 4000 for a real-world workload? Is it worth upgrading from a T4 to an L4? And what is the true performance gap between a GPU and a CPU?

This post aims to provide clear, data-driven answers to these questions. We've benchmarked a range of modern NVIDIA GPUs and CPU configurations on a common object detection task: running a YOLOv8s model with an image size of 320. Our goal is to provide a valuable reference point that is missing from the public domain, helping you make more informed hardware decisions.

All GPU benchmarks were conducted using the PyTorch version of YOLOv8s. The NVIDIA drivers used were version 570 for the EC2 instances (T4, A10G, L4), 565 for the H100, and 575 for the RTX 4000 SFF Ada workstation card. For the CPU, we compare the performance of PyTorch and OpenVINO.

Peak Performance: A Hardware Showdown

Let's start with a simple question: what's the fastest hardware for a typical YOLOv8s workload? We ran benchmarks across several NVIDIA GPUs and a multi-core CPU to find the maximum achievable Frames Per Second (FPS).

The data reveals two crucial insights. First, even the slowest GPU we tested (the NVIDIA T4 on AWS g4dn) is over 30 times faster than a single vCPU running an optimized OpenVINO workload (360 FPS vs. ~11 FPS). Second, the data quantifies the massive performance gap between hardware tiers. A top-tier H100, when fully saturated with concurrent workloads, is ~7.3x faster than a highly capable RTX 4000 SFF Ada (a 70W workstation GPU) and ~11x faster than an L4, providing a clear picture of the performance you get for the money.

The Concurrency Advantage: More Throughput on Modern GPUs

Achieving high FPS with a single, batched process is a great start, but what happens when you need to serve multiple video streams or users at once? For smaller models like YOLO on large, modern GPUs, you can often achieve significantly higher total throughput by running multiple processes concurrently.

A single inference process, even with batching, often can't saturate a powerful GPU. The key to unlocking its full potential is to serve multiple workloads in parallel. However, simply running multiple processes that target the same GPU can lead to contention and unpredictable performance.

The solution is NVIDIA's Multi-Process Service (MPS). Unlike the default GPU behavior which time-slices access for each process, MPS allows CUDA kernels from different processes to be processed truly concurrently on the GPU's hardware. This cooperative multi-tasking avoids the overhead of context switching, leading to higher overall throughput and more predictable performance.

Let's look at the NVIDIA H100. A single process achieves an impressive 1,184 FPS. But by using MPS to partition the GPU into independent "splits," we can serve more processes and drive the total throughput way up.

Concurrent Runs (Splits) Avg. FPS per Run Total Combined FPS
1 1184 1184
2 1025 2050
4 870 3480
8 628 5024
16 370 5920
24 260 6240
32 210 6720
48 120 5760

As the chart shows, the total throughput on the H100 scales almost linearly before peaking at an incredible 6,720 FPS with 32 concurrent processes. This is the key insight for deploying models like YOLO at scale: more concurrency leads to more throughput, as long as it's managed correctly with a tool like MPS.

We saw similar, albeit less dramatic, benefits on the NVIDIA L4. Two concurrent runs without MPS caused contention, with each process achieving ~164 FPS. With MPS isolating the two processes into 50% compute slices, each achieved a stable ~176 FPS, improving total throughput and providing predictable performance.

CPU Inference: Getting the Most from PyTorch and OpenVINO

While GPUs are king, sometimes you're limited to a CPU. We tested both PyTorch and OpenVINO on a 4-vCPU instance to see which framework performed better.

Framework vCPUs Inference FPS
PyTorch 4 22.73
OpenVINO 4 28.96

For CPU-bound inference, a properly configured OpenVINO is the clear winner, delivering a 27% performance improvement over native PyTorch. If you must run on a CPU, optimizing your software stack is critical, and OpenVINO is the right tool for the job.

Conclusion & Key Takeaways

  • The GPU Imperative: Even a last-generation data center GPU like the T4 is over 30x faster than a single vCPU for YOLOv8s inference.
  • Quantifying the Tiers: An H100 isn't just faster—it's a different class of machine, offering 7-11x more throughput than powerful workstation GPUs like the RTX 4000 SFF Ada and L4.
  • Concurrency is King: For smaller models like YOLO, maximizing the performance of modern GPUs requires running multiple workloads in parallel with a tool like NVIDIA MPS.
  • CPU Choice Matters: If you must use a CPU, OpenVINO provides a significant ~27% performance lift over a standard PyTorch setup.

Raw Benchmark Data

NVIDIA H100 (with MPS)

Concurrent Runs (Splits) Avg. FPS per Run Total Combined FPS
1 1184 1184
2 1025 2050
4 870 3480
8 628 5024
16 370 5920
24 260 6240
32 210 6720
48 120 5760

NVIDIA L4 (g6.xlarge)

Metric Batch 1 Batch 32
Inference FPS 151.20 607.25

NVIDIA A10G (g5.xlarge)

Metric Batch 1 Batch 32
Inference FPS 140.34 463.46

RTX 4000 SFF Ada

Metric Batch 1 Batch 32
Inference FPS 145.28 527.93

RTX 4000 SFF Ada (Concurrency with MPS)

Concurrent Runs Avg. FPS per Run Total Combined FPS
1 680 680
2 450 900
4 230 920

NVIDIA T4 (g4dn.xlarge)

Metric Batch 1 Batch 32
Inference FPS 130.20 360

CPU Baseline (c5.xlarge)

Framework vCPUs Inference FPS
PyTorch 1 11.37
OpenVINO 1 11.32
PyTorch 4 22.73
OpenVINO 4 28.96