Processors that can do 20+ GFLOPS per Watt (2012)

Posted by Vincent Hindriksen on 27 August 2012 with 44 Comments

energy-efficient — System for communicating power-efficiency of new equipment. “A” being best, “F” being worst. 2011-A is incomparable with 2012-A.

For yearly power-usage there is a rule-of-thumb which states that a device that is continuously on, costs the amount of Watt times 1.5 in Euro per year. So the computer in front of me, that takes around 107 Watt, costs me €160 a year if I would leave it on. A moderate cluster with several GPUs of a few hundred Watts each, would cost a few thousand Euros a year. I would say: very doable for most companies.

So why is the performance per Watt? There is more to a Watt than just the costs. The energy to cool a cluster is quite high, as most of the energy escapes via heat. And then there is the increase in demand for portable power. In cases you are thinking of sweeping you credit card for a top 10 supercomputer, then these energy-costs are extremely high.

In this article I try to get an overview of who is entering the 20+ GFLOPS/Watt area. All processors that do less than 20 GFLOPS/Watt, need to have other advantages to survive. And you’ll see that all the green processors are programmed with OpenCL, the technology StreamHPC is all about.

IMPORTANT: The total power used is sometimes including and sometimes excluding memory-transfers. So the comparison below IS NOT FAIR. The graphics cards are including memory-transfers, while the CPUs and SoCs are not.

The list

Understand that since I mix CPUs, GPUs and SoCs (= CPU+GPU) the list is really only an indication of what is possible. Also a computer is built up of more energy-consuming parts than just the processors: interconnects, memory, harddrives, etc.

Disclaimer: The below list is incomplete and based on theoretical values. TDP is assumed to be consumed when processor is working at maximum performance. Actual FLOPS/Watt values can be much lower, depending on many factors. If you want to buy hardware specifically for the purpose of highest FLOPS/Watt have your software tested on the device.

Processor	Type	Year	GFLOPS (32bit)	GFLOPS (64bit)	Watt (TDP)	GFLOPS/Watt (32bit)	FLOPS/Watt (64bit)
Adapteva Epiphany-IV	Epiphany	2013	100	N/A	2	50	N/A
Movidius Myriad	ARM SoC: LEON3+SHAVE	2012	15.28	N/A	0.32	48	N/A
Nvidia GT 630, 2nd revision (GK208)	X86 GPU	2013	692	?	25	27.68	?
AMD Radeon HD 8970M	X86 GPU	2013	2304	144	100?	23	1.44
Nvidia Tesla K10	X86 GPU	2012	4577	190	225	20.34	?
ZiiLabs ZMS-40	ARM SoC	2012	58	N/A	?	20?	N/A
ARM + MALI T604	ARM SoC	2012	8 + 68	N/A	4?	19?	N/A
NVidia GTX 690	X86 GPU x 2	2012	5621	234?	300	18.74	0.78
Geforce GTX Titan	X86 GPU	2013	4500	1300	250	18	5.2
GeForce GTX 680	X86 GPU	2012	3090	128	195	15.85	0.65
AMD Radeon HD 7970 GHz	X86 GPU	2012	4300	1075	300+	14.3	3.58
Intel Xeon Phi 5110P	X87	2012	2022	1011	225	8.99	4.49
AMD A10-5800K + HD 7660D	X86 SoC	2012	121 + 614	?	100	7.35	?
Intel Core i7-3770 + HD4000	X86 SoC	2012	225 + 294,4	112 + 73.6	77	6.74	2.41
NVIDIA CARMA (complete board)	ARM + GPU	2012	? + 200	?	40	5.00	?
IBM Power A2	Power CPU	2012	204?	204	55	3.72?	3.72
Intel Core i7-3770	X86 CPU	2012	225	112	?	?	?
AMD A10-5800K	X86 CPU	2012	121	60?	?	?	?

Beware that the list is updated, while this text is not!

The list contains recent and general available processors, but I will add any processor you want to see in the list – just request them in a comment.

Please also point me to sources where official data can be found on these processors, as it seems to be top-secret data. As not all the data was available, I had to make some guesses.

Below you find a graph of the list, including architectures grouped by GFLOPS + GFLOPS/Watt.

Below is a maybe more interesting view: Watt/GFLOPS. This projection has the advantage that low-power processors (< 2Watt) don’t get overrated and are closer together.

watt_gflops_excl_cpus — Watt/GFLOPS (lower is better) vs GFLOPS, excluding the CPUs. You see the Radeons doing best if it comes to performance and Watt/GFLOPS. The left-upper area is where we need to go.

CPU vs GPU

Let’s be clear:

A GPU needs a CPU as a host.
A GPU is great in vector-computations, a CPU much better in scalar computations.

In other words, a mix between a scalar and a vector processor is best. But once a problem can be defined as a vector-problem, the GPU is much, much faster than a CPU.

64 bit vs 32 bit

As the memory-usage is energy-consuming and results in half the number of data showing up at the processor, we have two reasons why more energy is consumed. Due to architecture-differences, CPUs have a penalty for 32 bit and GPUs a penalty for 64 bit.

Notice that most X86-alternatives have no 64 bit support, or just recently started with it. GPUs crunch double precision numbers at a fourth or less of the 32-bit performance-roof.

Architectures

ARM, X86/X87, Power and Epiphany all have different architecture-choices to get their targeted trade-off between precision, power-consumption and performance-optimisation (control unit). These choices make it sometimes impossible to get with the pace of other architectures in a certain direction.

Current winner: Adapteva Epiphany

Their 64-core Epiphany-IV is programmable with OpenCL (buggy compiler though) and the 50 GFLOPS/Watt makes it worth to put time in porting software if you need a portable device. People who have ported their software to OpenCL already have an advantage here. Adapteva even claims 72 GFLOPS/Watt, as you can read here. With a 100-core CPU coming up, they will probably even raise the bar.

X86 CPUs have the advantage of precision and legacy code, of which precision is the biggest advantage. As X86 GPUs (with Nvidia on top) have a great performance/Watt entering the 20+ GFLOPS/Watt, this could be very interesting for defending the X86 market against ARM.

ARM-processors have a lot of software written for it (via Android) and is very flexible in design, while keeping power-usage for the CPU-part around 1Watt. For instance ZiiLabs’ processor can be compared to the design of Adapteva, but then with an ARM-CPU attached to it.

Conclusion

There is much more than just this number of GFLOPS/Watt, and which architecture will be mainstream architecture in a few years one can only speculate on. Luckily recompiling for other architectures is getting easier with compiler-technologies such as LLVM, so we don’t need to worry too much. Except to redesign our software for multi-core of course. You have read above that new architectures are programmed with OpenCL. It is better to invest in this technology now than later.

The LEAP-conference is all about exactly this subject. Meet StreamHPC at this very unique event on 21 May 2013.

More reading

As memory-access takes energy, minimising memory-calls can lower consumption. This article on the ARM blog explains how this is done with MALI GPUs.

The Mont Blanc project is a supercomputer based on ARM. This 12 page PDF shows some numbers and specifications of this supercomputer.

As supercomputers eat lots of power, The Green 500 tries to stimulate to build greener HPC.

44 thoughts on “Processors that can do 20+ GFLOPS per Watt (2012)”

david moloney 30 August 2012

If you were at HotChips 2011 you would have seen Movidius Myriad which delivers 50GFLOPS/W
- StreamHPC 30 August 2012
  
  Ok, added to the list – I will update the text later. How much GFLOPS does it deliver?
david moloney 30 August 2012

Also Epiphany is shown as ARM based in your table which I’m sure must be a mistake.
- StreamHPC 30 August 2012
  
  Oops, that’s not ARM at all! Thanks for noticing!
  - pip010 3 September 2012
    
    64 RISC units, but not mentioning what. good chance it is ARM!
MySchizoBuddy 30 August 2012

Can you include Tilera chips on the list.
http://www.tilera.com/

I don’t know how many GFlops it delivers
- StreamHPC 30 August 2012
  
  True, not mentioned. I only found http://www.tgdaily.com/business-and-law-features/39408-tilera-goes-pro-with-tilepro64 from 2007 – they do not provide any information on actual performance/Watt anywhere. Or very hidden.
PENG ZHAO 31 August 2012

How about Nvidia Tesla K10 and Geforce GTX 690? I found some figures.
Tesla K10:
Power: 225 W
Single float: 4577 Gigaflops, 20.342 GFlops/W
Double float: 190 Gigaflops, 0.8444 GFlops/W

GTX 690:
Power: 300 W
Single float: 5621Gigaflops, 18.74 GFlops/W
Double float: ?

The single float computation power is impressive, but the double float one is rubbish.
Even worse Nvidia seems to stop the update of their OpenCL implementation.
- StreamHPC 31 August 2012
  
  The GTX 690 is a double GPU, so therefore I chose to put the 680 in the list – maybe good point to add double-GPU cards too.
  
  It seems that my source for the K20 was completely wrong. I’ll update for the K10 for now.
E P 31 August 2012

The table says FLOPS/Watt, instead of GFLOPS/Watt.

Can you, please, include Integer arithmetic?

Depending on floating point (especially 32-bit) is sometimes not an option due to accumulation of errors. So, a lot of integer arithmetic algorithms have been developed. Main point in porting them to OpenCL will be keeping the integer arithmetic calculations. And, that becomes even more important having in mind that a lot of devices increase performance and/or decrease power consumption at the expense of accuracy.
- StreamHPC 31 August 2012
  
  It was extremely difficult (and exhausting) to find the data already in the list. I will therefore focus on what is already there and try to complete the list for just 32-bit and 64-bit (being it floats or integers).
  
  The trade-off between precision and the other aspects of computing is an interesting subject though.
Pingback: Processors that can do 20 GFLOPS/Watt | Adapteva
rahul garg 31 August 2012

Corrections:

1. The 3770K’s peak (CPU-only) is about 225 GFlops (at base frequency, with turbo slightly higher).

2. Knight’s corner has fp32 at twice the rate of fp64. So I expect 2 teraflop for fp32 for knights corner.

3. 3770K CPU-only fp64 peak is half of fp64 peak = 112 gflops.
rahul garg 31 August 2012

Another correction: HD 4000 on 3770K has fp64 peak of 73.6 gflops.
- StreamHPC 1 September 2012
  
  Thanks for all the feedback! You’re great! Together we can make the picture.
Stuart 13 September 2012

http://www.kalray.eu/en/technology/mppa-256.html seems to be a good performer but no OpenCL support.
- StreamHPC 20 March 2013
  
  They have their own compilers, but it would be a good choice for Kalray to start supporting OpenCL besides their own.
Mxgolfcpu 18 September 2012

Since the GPU for AMD APUs can be programmed through OpenCL, the FLOPs of the integrated GPU should also be considered. You can calculate the GFLOPs for the AMD APUs through GFLOPs = (# of x86 cores x (128 bit (FPUs) / 32 bit (SP Operation)) * CPU Frequency) + (# of shader units * (64 bit (shader) / 32 bit (SP Operation)) * GPU Frequency

For the 3.8GHz A10-5800K with 800MHz GPU that is 675.2 GFLOPs for the APU by itself

You should be able to do similar calculations for the other devices
- StreamHPC 20 September 2012
  
  I have chosen to only use the officially given GFLOPS and Watt as told by the vendor. This will make sense later.
Bharat 20 September 2012

does the GTX 690 give 5621GFlops/s? I tried out N-body similuation in gpu computing sdk and it was giving 600GFlops/s with one card, which is far far less than 2800GFlops/s…is something wrong with my card
- Sean Happe 21 December 2012
  
  That’s the thing about GPUs.. they’re fragile.. only an exceeding small number of applications match the structure of the hardware.. most applications don’t match, so they don’t get anywhere near the peak performance. In contrast, for CPUs, most applications get a high percentage of the peak. Normally, divide GPU peak by 5 and that’s the maximum you can expect from a well tuned application.. Unfortunately, the power will not go down in proportion to delivered GFLOPs.. you’ll still burn a high percent of the peak power, but won’t get much of the performance..
  - Donald Becker 27 December 2012
    
    A wide range of applications work reasonably well on GPGPUs, and a few commercially important ones work quite well. Just as with commodity clusters, the set of addressable applications increased well beyond what was initially expected. (You pushed a hot button: the claim of an “exceedingly small number of applications” echoed what was loudly repeated two decades about clusters, just as they were becoming viable.)
    
    I certainly agree that GPGPUs typically get a lower percentage of peak. But it’s not that low with tuned code, and it’s more than offset by an astonishingly high peak number.
    
    Recent GPGPUs have sophisticated internal power management and use power proportional to the work they do.
    
    Disclosure: I work on the CARMA project at NVIDIA, although it’s a small proportion of almost three decades of working with parallel and cluster systems.
MySchizoBuddy 24 September 2012

leaked specs of AMD 8870 shows 24 Gflops/W that’s quite an impressive increase from the 7870 which does 12.8 Gflops/W

http://www.extremetech.com/extreme/136636-amd-hd-8000-specs-leak-point-to-major-performance-boost
Michael 16 December 2012

How is it that the top reated green500 is only 2Gflops/W, yet this article speaks of 20+ being the norm?
- StreamHPC 16 December 2012
  
  Good point! The Green500 describes the whole system, above solely the processor (and sometimes processor + memory). So the Green 500 includes all memory, hard-disks, network, case-cooling, etc.
  
  But this is exactly why it is quite difficult to compare processors. For example GPUs have their own memory, meaning that the CPUs perform worse in comparison. At the other hand GPUs need a CPU to operate.
  
  If you want to know what is best for you, focus on the system you have in mind – as there is no “the norm”. A cluster is incomparable to a 3-GPU-desktop or a powerful tablet.
Leandro 20 March 2013

what’s the frequency?? there is no point in just having 50 GFLOPS/W if it takes a whole life to get these GFLOPS done!
- StreamHPC 20 March 2013
  
  I suggest you read http://en.wikipedia.org/wiki/FLOPS and https://streamhpc.com/knowledge/what-is/opencl/
Pingback: ¿Pueden los móviles superar la potencia de consolas y PCs? | Blog Personal de Ariel Infante
jipe4153 18 June 2013

The new Nvidia chip GK208 runs at 692 GFLOPS @ 25 watts which yields roughly 27.5 GFLOPS / watt [1].

Also, you do know that the numbers are counted without the memories included, so the performance numbers are a bit irrelevant…

[1] http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#GeForce_600_Series
- StreamHPC 18 June 2013
  
  Thanks for the suggestion. I could find the TDP of 25W at http://www.geforce.com/hardware/desktop-gpus/geforce-gt-630/specifications
  But the GFLOPS of the “GT 630, 2nd revision” I could not find anywhere, only sites citing each other.
  - jipe4153 18 June 2013
    
    Zotac:
    http://webcache.googleusercontent.com/search?q=cache:http://www.zotacusa.com/specsheet/ZT-60408-20L.pdf
    
    http://www.newegg.com/Product/Product.aspx?Item=N82E16814500305
    
    http://www.newegg.com/Product/Product.aspx?Item=N82E16814500304
  - jipe4153 18 June 2013
    
    Take core clock times number of cores * 2 (for FMA instructions)
    
    Hence:
    
    0.902 GHz * 384 cores * 2 => 692 GFLOPS
    
    The core clock is well documented.
  - StreamHPC 19 June 2013
    
    Added it. Many thanks for the tip!
  - jipe4153 19 June 2013
    
    hth!
jipe4153 18 June 2013

And btw Knighs Corner is rate at 300 watt…
- StreamHPC 19 June 2013
  
  225 officially. You have other sources?
  - jipe4153 19 June 2013
    
    Yes the 5110P is indeed 225 watt (not 200 which was previously there)
    
    I was thinking of their high end 7120X which is rated at 300 watt [1].
    
    I have however seen documents and reports where the 5110P was running at closer to ~240 watt, so their 225 watt is not MAX TDP as is used by Nvidia and AMD. The 225 watt is a marketing number…
    
    [1] http://exxactcorp.com/index.php/product/prod_detail/488?utm_content=jimmy.pettersson%40hpcsweden.se&utm_source=VerticalResponse&utm_medium=Email&utm_term=Intel%20Xeon%20Phi%207120X&utm_campaign=The%20New%20Intel%20Xeon%20Phi%207100%20and%203100%20Series%20Coprocessorscontent
desi 30 July 2013

For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!
Add Intel® Core™ Host Processor  >1TFlop!

Watts is either 65 or 47. depend on mobile or desktop cpu.

http://www.khronos.org/assets/uploads/developers/library/2013-siggraph-opencl-bof/OpenCL-Intel-BOF_SIGGRAPH-2013.pdf
jipe4153 14 January 2014

Hi Vincent,

The Tegra K1 GPU does 365 GFLOPS @ 5 watts (whole SoC TDP) and it’s already shipping to oems.

This would place it at ~73 GFLOPS / watt for the whole SoC (excluding the arm core FLOPS).

This is likely the most efficient general purpose processor on this planet…

Regards
- StreamHPC 14 January 2014
  
  True. I put up a new article specially for the 2014 mobile GPUs. Because there the fight is on bandwidth rather than compute power. The GPU is very interesting, though!
  - Ibrahim Awwal 29 September 2015
    
    I’d be interested in this article or a newer one, but I can’t find it on this site. Could you link to it if you remember which one it is? Thanks!
Krishnaraj 24 February 2014

Curious about how you got numbers for Mali-T604. kyokojap.myweb.hinet.net/gpu_gflops/ says either 17, 34 or 81 gflops
jim 29 May 2014

OK, so a given processor is capable of a certain number of GFLOPS per Watt. If you are coding a algorithm that you expect to use 1/10th or 1/100th of its capacity, I am guessing that this doesn’t mean you only need 1/10th or 1/100th of the power to run that, right? How do you figure out the actual power requirments of your algorithm on a particular processor apriori from the manufacturer’s specifications? Is there a curve or closed form expression somewhere for these things?
kalvdans 15 January 2015

The PEZY-SC chip makes 3.0TFlops[1] single-precision with 90W[2] which gives 33 GFlops/W. Nice to see another small player in this market. Third place on the green500 list.

[1] http://pezy.co.jp/en/products/pezy-sc.html
[2] https://twitter.com/dadeba/status/534820796926271488