When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.
In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of utilizations:
- Transfers. Data-movements through cables, interconnects, etc.
- Processors. Data-processing. with data in and data out.
Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.
Transfer utilisation: Throughput
How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.
The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.
It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.
Transfer utilisation: Latency
What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”
This is important in streaming applications – think of applications in broadcasting and networking.
There are three causes for latency:
- Reaction time: hardware/software noticing there is a job
- Transport time: it takes time to copy data, especially when we talk GBs
- Process time: computing the data can
When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).
Processor utilisation: Throughput
Given the current algorithm, how much potential is left on the given hardware?
The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.
The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.
This is why HBM is so important.
Processors utilisation: Latency
How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.
For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.
GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.
Determining the theoretical speed of a system
A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.
When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.