A Parallel Graph Environment for Real-World DataAnalytics Workflows
Vito Giovanni Castellana1,a, Maurizio Drocco1,b, John Feo1,c, Jesun Firoz2,k, Thejaka Kanewala2,l, Andrew Lumsdaine1,d, Joseph Manzano1,e, Andrés Marquez1,f, Marco Minutoli1,g, Joshua Suetterlein1,h, Antonino Tumeo1,i and Marcin Zalewski1,j
1High Performance Computing Pacific Northwest National Laboratory Richland, WA, USA
avitoGiovanni.castellana@pnnl.gov
bmaurizio.drocco@pnnl.gov
cjohn.feo@pnnl.gov
dandrew.lumsdaine@pnnl.gov
ejoseph.manzano@pnnl.gov
fandres.marquez@pnnl.gov
gmarco.minutoli@pnnl.gov
hjoshua.suetterlein@pnnl.gov
iantonino.tumeo@pnnl.gov
jmarcin.zalewski@pnnl.gov
2School of Informatics, Computing, and Engineering Indiana University Bloomington, IN, USA
kjsfiroz@iu.edu
ljthejkane@iu.edu
ABSTRACT
Economic competitiveness and national security depend increasingly on the insightful analysis of large data sets. The diversity of real-world data sources and analytic workflows impose challenging hardware and software requirements for parallel graph platforms. The irregular nature of graph methods is not supported well by the deep memory hierarchies of conventional distributed systems, requiring new processor and runtime system designs to tolerate memory and synchronization latencies. Moreover, the efficiency of relational table operations and matrix computations are not attainable when data is stored in common graph data structures. In this paper, we present HAGGLE, a high-performance, scalable data analytics platform. The platform’s hybrid data model supports a variety of distributed, thread-safe data structures, parallel programming constructs, and persistent and streaming data. An abstract runtime layer enables us to map the stack to conventional, distributed computer systems with accelerators. The runtime uses multithreading, active messages, and data aggregation to hide memory and synchronization latencies on large-scale systems.
Keywords: Graph Analytics, Attributed Graphs