Still waiting for Exascale: Japan’s Fugaku outperforms all competition once again

FRANKFURT, Ger­ma­ny; BERKELEY, Calif.; and KNOXVILLE, Tenn.— The 58th annu­al edi­ti­on of the TOP500 saw litt­le chan­ge in the Top10. The Micro­soft Azu­re sys­tem cal­led Voy­a­ger-EUS2 was the only machi­ne to shake up the top spots, clai­ming No. 10. Based on an AMD EPYC pro­ces­sor with 48 cores and 2.45GHz working tog­e­ther with an NVIDIA A100 GPU and 80 GB of memo­ry, Voy­a­ger-EUS2 also uti­li­zes a Mel­lan­ox HDR Infi­ni­band for data transfer. 

While the­re were no other chan­ges to the posi­ti­ons of the sys­tems in the Top10, Perl­mut­ter at NERSC impro­ved its per­for­mance to 70.9 Pflop/s. Housed at the Law­rence Ber­ke­ley Natio­nal Labo­ra­to­ry, Perlmutter’s increased per­for­mance couldn’t move it from its pre­vious­ly held No. 5 spot. 

Fug­a­ku con­ti­nues to hold the No. 1 posi­ti­on that it first ear­ned in June 2020. Its HPL bench­mark score is 442 Pflop/s, which excee­ded the per­for­mance of Sum­mit at No. 2 by 3x. Instal­led at the Riken Cen­ter for Com­pu­ta­tio­nal Sci­ence (R‑CCS) in Kobe, Japan, it was co-deve­lo­ped by Riken and Fuji­tsu and is based on Fujitsu’s cus­tom ARM A64FX pro­ces­sor. Fug­a­ku also uses Fujitsu’s Tofu D inter­con­nect to trans­fer data bet­ween nodes. 

In sin­gle or fur­ther-redu­ced pre­cis­i­on, which are often used in machi­ne lear­ning and A.I. appli­ca­ti­on, Fug­a­ku has a peak per­for­mance abo­ve 1,000 PFlop/s (1 Exaflop/s). As a result, Fug­a­ku is often intro­du­ced as the first “Exas­ca­le” supercomputer.

While the­re were also reports about seve­ral Chi­ne­se sys­tems rea­ching Exa­flop level per­for­mance, none of the­se sys­tems sub­mit­ted an HPL result to the TOP500

Here’s a sum­ma­ry of the sys­tems in the Top10:

  • Fug­a­ku remains the No. 1 sys­tem. It has 7,630,848 cores which allo­wed it to achie­ve an HPL bench­mark score of 442 Pflop/s. This puts it 3x ahead of the No. 2 sys­tem in the list. 
  • Sum­mit, an IBM-built sys­tem at the Oak Ridge Natio­nal Labo­ra­to­ry (ORNL) in Ten­nes­see, USA, remains the fas­test sys­tem in the U.S. and at the No. 2 spot world­wi­de. It has a per­for­mance of 148.8 Pflop/s on the HPL bench­mark, which is used to rank the TOP500 list. Sum­mit has 4,356 nodes, each housing two Power9 CPUs with 22 cores each and six NVIDIA Tes­la V100 GPUs, each with 80 strea­ming mul­tipro­ces­sors (S.M.). The nodes are lin­ked tog­e­ther with a Mel­lan­ox dual-rail EDR Infi­ni­Band network.
  • Sier­ra, a sys­tem at the Law­rence Liver­mo­re Natio­nal Labo­ra­to­ry, CA, USA, is at No. 3. Its archi­tec­tu­re is very simi­lar to the #2 sys­tems Sum­mit. It is built with 4,320 nodes with two Power9 CPUs and four NVIDIA Tes­la V100 GPUs. Sier­ra achie­ved 94.6 Pflop/s.
  • Sun­way Tai­hu­Light is a sys­tem deve­lo­ped by China’s Natio­nal Rese­arch Cen­ter of Par­al­lel Com­pu­ter Engi­nee­ring & Tech­no­lo­gy (NRCPC) and instal­led at the Natio­nal Super­com­pu­ting Cen­ter in Wuxi, China’s Jiangsu pro­vin­ce is lis­ted at the No. 4 posi­ti­on with 93 Pflop/s.
  • Perl­mut­ter at No. 5 was new­ly lis­ted in the TOP10 in last June. It is based on the HPE Cray “Shas­ta” plat­form, and a hete­ro­ge­neous sys­tem with AMD EPYC based nodes and 1536 NVIDIA A100 acce­le­ra­ted nodes. Perl­mut­ter impro­ved its per­for­mance to 70.9 Pflop/s
  • Sele­ne, now at No. 6, is an NVIDIA DGX A100 Super­POD instal­led in-house at NVIDIA in the USA.  The sys­tem is based on an AMD EPYC pro­ces­sor with NVIDIA A100 for acce­le­ra­ti­on and a Mel­lan­ox HDR Infi­ni­Band as a net­work. It achie­ved 63.4 Pflop/s.
  • Tian­he-2A (Mil­ky Way-2A), a sys­tem deve­lo­ped by China’s Natio­nal Uni­ver­si­ty of Defen­se Tech­no­lo­gy (NUDT) and deploy­ed at the Natio­nal Super­com­pu­ter Cen­ter in Guang­zhou, Chi­na, is now lis­ted as the No. 7 sys­tem with 61.4 Pflop/s.
  • A sys­tem cal­led “JUWELS Boos­ter Modu­le” is No. 8. The Bull­Se­qua­na sys­tem build by Atos is instal­led at the For­schungs­zen­trum Jue­lich (FZJ) in Ger­ma­ny. The sys­tem uses an AMD EPYC pro­ces­sor with NVIDIA A100 for acce­le­ra­ti­on and a Mel­lan­ox HDR Infi­ni­Band as a net­work simi­lar to the Sele­ne Sys­tem. This sys­tem is the most powerful sys­tem in Euro­pe, with 44.1 Pflop/s. 
  • HPC5 at No. 9 is a PowerEdge sys­tem built by Dell and instal­led by the Ita­li­an com­pa­ny Eni S.p.A. It achie­ves a per­for­mance of 35.5 Pflop/s due to using NVIDIA Tes­la V100 as acce­le­ra­tors and a Mel­lan­ox HDR Infi­ni­Band as the network. 
  • Voy­a­ger-EUS2, a Micro­soft Azu­re sys­tem instal­led at Micro­soft in the U.S., is the only new sys­tem in the TOP10. It achie­ved 30.05 Pflop/s and is lis­ted at No. 10. This archi­tec­tu­re is based on an AMD EPYC pro­ces­sor with 48 cores and 2.45GHz working tog­e­ther with an NVIDIA A100 GPU with 80 G.B. memo­ry and uti­li­zing a Mel­lan­ox HDR Infi­ni­band for data transfer. 

Other TOP500 highlights 

While the­re were not many chan­ges to the Top10, we did see a smat­te­ring of shifts within the Top15. The new Voy­a­ger-EUS sys­tem from Micro­soft fol­lo­wed its sibling into the No. 11 spot, while the SSC-21 sys­tem from Sam­sung intro­du­ced its­elf to the list at No. 12. Pola­ris, also a new sys­tem, came in at No. 13 while the new CEA-HF took No. 15. 

Like the last list, AMD pro­ces­sors are see­ing a lot of suc­cess. Fron­te­ra, which has a Xeon Pla­ti­num 8280 pro­ces­sor, got bum­ped by Voy­a­ger-EUS2, which has an AMD EPYC pro­ces­sor. What’s more, all of the new Top15 machi­nes descri­bed abo­ve have AMD processors 

Unsur­pri­sin­gly, sys­tems from Chi­na and the USA domi­na­ted the list. Alt­hough Chi­na drop­ped from 186 sys­tems to 173, the USA increased from 123 machi­nes to 150. All told, the­se two count­ries account for near­ly two-thirds of the super­com­pu­ters on the TOP500

The new edi­ti­on of the list didn’t show­ca­se much chan­ge in terms of sys­tem inter­con­nects. Ether­net still domi­na­ted at 240 machi­nes, while Infi­ni­band accoun­ted for 180.  Omin­path inter­con­nects saw 40 spots on the list, the­re were 34 cus­tom inter­con­nects, and only 6 sys­tems with pro­prie­ta­ry networks. 

Green500 results 

The sys­tem to cla­im the No. 1 spot for the Green500 was MN‑3 from Pre­fer­red Net­works in Japan. Rely­ing on the MN-Core chip and an acce­le­ra­tor opti­mi­zed for matrix arith­me­tic, this machi­ne was able to achie­ve an incre­di­ble 39.38 gigaflops/watt power-effi­ci­en­cy. This machi­ne pro­vi­ded a per­for­mance 29.7- gigaflops/watt on the last list, cle­ar­ly show­ca­sing some impres­si­ve impro­ve­ment. It also enhan­ced its stan­ding on the TOP500 list, moving from No. 337 to No. 302. 

The new SSC-21 Sca­lable Modu­le an HPE Apol­lo 6500 sys­tem instal­led at Sam­sung Elec­tro­nics in South Korea achie­ved an impres­si­ve 33.98 gigaflops/watt. They did so by sub­mit­ting an power opti­mi­zed run of the HPL bench­mark. It is lis­ted at  posi­ti­on 292 in the TOP500.

NVIDIA instal­led a new liquid coo­led DGX A100 pro­to­ty­pe sys­tem cal­led Tethys. With a power opti­mi­zed HPL run Tethys achie­ved 31.5 gigaflops/watt and gar­ne red the No. 3 spot on the Green500. It is lis­ted at posi­ti­on 296 in the TOP500.

The Wilkes‑3 sys­tem impro­ved its results but was still pushed down to the No.4 spot on the Green500.  Wilkes‑3, which is housed at the Uni­ver­si­ty of Cam­bridge in the U.K., had a power-effi­ci­en­cy of 30.8 gigaflops/watt. Howe­ver, it was pushed from No. 100 to No. 281 on the TOP500 list.

The Uni­ver­si­ty of Flo­ri­da in the USA with its HiP­er­Ga­tor AI sys­tem was pushed from the No. 2 spot to the No. 5 spot. This machi­ne held ste­ady at 29.52 gigaflops/watt. This NVIDIA sys­tem has 138,880 cores and reli­es on an AMD EPYC 7742 pro­ces­sor. Despi­te this impres­si­ve per­for­mance, HiP­er­Ga­tor AI was pushed from No. 22 to No. 31 on the TOP500

HPCG Results 

The TOP500 list has incor­po­ra­ted the High-Per­for­mance Con­ju­ga­te Gra­di­ent (HPCG) Bench­mark results, which pro­vi­de an alter­na­ti­ve metric for asses­sing super­com­pu­ter per­for­mance and is meant to com­ple­ment the HPL measurement.

The HPCG results here are very simi­lar to the last list. Fug­a­ku was the clear win­ner with 16.0 HPCG-peta­flops, while Sum­mit retai­ned its No. 2 spot with 2.93 HPCG-peta­flops. Perl­mut­ter, a USA machi­ne housed at Law­rence Ber­ke­ley Natio­nal Labo­ra­to­ry, took the No. 3 spot with 1.91 HPCG-petaflops. 

HPL-AI Results 

The HPL-AI bench­mark seeks to high­light the con­ver­gence of HPC and arti­fi­ci­al intel­li­gence (AI) workloads based on machi­ne lear­ning and deep lear­ning by sol­ving a sys­tem of line­ar equa­tions using novel, mixed-pre­cis­i­on algo­rith­ms that exploit modern hardware.

Achie­ving an HPL-AI bench­mark of 2 Exa­flops, Fug­a­ku is lea­ding the pack in this regard. With such excel­lent metrics year-after-year, com­bi­ned with a con­side­ra­ti­on by many as the first “Exas­ca­le” super­com­pu­ter, Fug­a­ku is cle­ar­ly an exci­ting system.