Today's mainstream supercomputers are "PC clusters"—in other words, a multitude of PC servers equipped with general-purpose CPUs that are connected together to form a cluster. As the cluster scale increases, the speed and stability of each PC (node) and the network that connects an enormous number of nodes become problematic. A Fujitsu research team tackled this problem to develop Oakforest-PACS, a supercomputer jointly introduced by the University of Tokyo and the University of Tsukuba that successfully demonstrated globally top-tier performance. In recognition of this large-scale PC cluster configuration technology, the team received the prize for Science and Technology from Japan's Minister of Education, Culture, Sports, Science, and Technology in April 2018. In this interview, Kohta Nakashima talks about what went on behind the scenes to make the impossible possible as well as his passion for research.
Combining Small Computers for High Performance
Nakashima: "Conventionally, PC clusters have attempted to connect a multitude of personal computers with one another to build a supercomputer from such an array of computers. Despite having 'PC' in the name, today's PC clusters basically use many commodity servers, which are cost-efficient servers with general-purpose processors that are in common use. So, a PC cluster is actually a large array of servers with slightly better specs than standard PCs.
Each such server is referred to as a compute node. For example, this (image above) is a server, or compute node, with Intel CPUs. One box contains eight nodes. The Oakforest-PACS PC cluster installed at the University of Tokyo and the University of Tsukuba has 8,208 nodes. There are stacks of about a thousand of these boxes. Our technology's key point is how we bring out performance by combining these servers developed by the Business Division."
A Research Career Tracking the History of PC Clusters
"PC clusters became a hot research topic in the latter half of the 1990's. At that time, I was a university student and had a great interest in PC cluster technologies that improved performance by adding more computers. In the lab, I researched software technology and was attempting to find a way to transfer external data coming in over a network to memory as fast as possible.
When I joined Fujitsu Laboratories around 2002, use of technologies for high-speed computer connections (InifniBand and Myrinet) had begun to spread. As PC clusters rapidly became the mainstream method for building supercomputers, the RSCC PC cluster that Fujitsu delivered to Riken was ranked seventh globally among all supercomputers. As a new recruit, I marveled at this technology while watching my senior colleagues carry out the project."
On-Site Problems Give Birth to New Technologies
"When building a large-scale cluster of general-purpose servers, the quality of performance inevitably varies from server to server. Among a thousand servers, several will be defective. It is essential to check how each component works, what performance it provides, and what is slowing down the cluster. Moreover, systems become faulty not constantly but sporadically: it is difficult to determine which component is malfunctioning. When a large-scale cluster has an intricate configuration, determining how to control it is also problematic. We must devise a method to configure a system that improves performance when the entire system performs parallel computing. These problems of ours led to the development of performance analysis and network control technologies."
Challenges to Building Oakforest-PACS: Reality Diverse from Theory
"The more advanced the components used, the more difficult it is to make them function together and bring out their true performance. In addition, in larger systems, minor problems tend to impair the functionality and performance of the entire system. Oakforest-PACS, a PC cluster that commenced operation in 2016, was the most difficult PC cluster so far.
The cluster uses Intel Xeon Phi, which was then the latest many-core processor; this was a challenging processor to employ because it has up to 68 cores, whereas current mainstream processors have at most 20. In addition, its unconventional architecture features two types of memory: high-band memory and high-capacity memory.
Making full use of this processor was Fujitsu's role. We evaluated the processor's characteristics and prepared for configuration with the problem of scaling up in mind. However, when we actually connected the 8,208 nodes, the cluster did not work at all, contradicting our theory! Problems we had never expected emerged. We had to consult with the SEs and CEs responsible for building the cluster, locate the apparent causes of the problem, actually isolate the causes using analytics technology, and thus build the cluster alongside the Business Division, which developed the servers, and Intel, which designed the processor. The key to resolving the problems was to find the best way to analyze the situation and to organize information for better understanding."
The Key to Solving Unknown, Difficult Problems Lies in Past Experience
"When an unanticipated problem arises, the solution must be found on a trial-and-error basis, including the analysis method. We could overcome such difficulties because of the technologies we have accumulated. For example, in the Oakforest-PACS project, we created software to isolate problems that occurred in the Oakforest-PACS cluster that was under development and to analyze them on the fly based on our past development experience in major projects and our expertise in analyzing networks of actual systems.
Even though the situation became so hopeless at one point that we lost our confidence that we would successfully carry out the project, we strengthened our resolve to overcome the obstacles no matter what. Since we, the research team, placed our bets on this project, we were united in our desire to advance the project and had to share this determination with the other sections. By acting on my own initiative amidst hopeless circumstances, I think I could meaningfully contribute to our unified efforts to move the project forward."
All Team Members Cheer across Japan and the US
"The semiannual TOP500 list ranks the world's top 500 supercomputers, and we are motivated to get ours on the list. Oakforest-PACS ranked sixth globally in November 2016.
When we achieved the target performance for the first time after fully executing the program, all of our several dozen members, who were in the Kashiwa campus where Oakforest-PACS is housed, in the research laboratory and office in Kawasaki, at home, and in the Intel offices on the east and west coasts of the US, were on the phone together. When notified of the results, we were all excited, and I heard a thunder of applause from the US, which was the best moment."
In Part 2 of this interview, concerns a team that creates innovations and Nakashima's current research on AI.
Project Director, Advanced Computer Systems Project, Computer Systems Laboratory, Fujitsu Laboratories Ltd.
Kohta Nakashima graduated from the Department of Electrical Engineering and Computer Science of Kyushu University's School of Engineering in 2000 and earned his master's degree there in 2002.
That same year, he joined Fujitsu Laboratories Ltd.
He has since engaged in research and development of technologies to build faster PC clusters and to control high-speed networks.
He obtained his Ph.D. in engineering from Okayama University and received a prize along with four other researchers in the category of "Development of Configuration Technology for Large-Scale Clusters" in the 2018 Commendation for Science and Technology from Japan's Ministry of Education, Culture, Sports, Science, and Technology.