Architectural Improvements for GPGPU Platforms

GeForce_GTX_980_Block_Diagram_FINAL
A little over a decade ago, graphical processing units, or GPUs, were fixed-function processors built around a pipeline that was dedicated to rendering 3-D graphics. As the potential of GPUs for massive compute parallelism became apparent, GPU vendors such as NVIDIA and AMD decided to enable programmability of these platforms, and new programming languages such as CUDA and OpenCL arose to allow general-purpose computation in GPUs. That is the reason why general-purpose GPUs (GPGPUs) have become increasingly popular in almost all market segments (datacenters, supercomputers, desktop/server machines, embedded systems, etc.). The tremendous horsepower of GPGPUs comes from their superior ability to efficiently provide massive data/thread-level parallelism, floating-point execution and fast Single-Instruction Multiple-Data (SIMD) processing. As compared to conventional CPU architecture (multi/many-core chip-multiprocessors and uniprocessor systems), the transistor budget in GPGPU is mainly dedicated to compute resources, thereby keeping simpler the control logic (e.g., GPGPU gets rid of OoO, speculative execution, etc.). This, in turn, limits the class of general-purpose applications that can fully exploit GPGPU platforms and, in developing their GPGPU applications, programmers should deal with high uniformity and regularity when distributing data to process in parallel, regular flow control (e.g., predictable loops), and data access patterns that can exploit high off-chip memory bandwidth (e.g., coalesced memory accesses). As a result, current designs of GPGPUs show a barrier to migrate a wider range of general-purpose applications onto GPGPU like those ones that feature irregular memory access patterns, fine-grained synchronization between threads, or irregular control flow, as graph-based parallel applications (IRREGULAR-APPs) exhibit.

Prof. Jose L. Abellan, his colleagues at BIO-HPC research group, and other external collaborators focus their research on three main research lines. First, they are looking into novel architectural improvements for GPGPU systems that can consolidate GPGPU as a mainstream platform for all kind of general-purpose parallel applications. To meet this ambitious and challenging goal, novel GPGPU-specific hardware techniques such as optimized synchronization, improved cache coherency, optimized SIMD execution, more efficient tasks/warps scheduling, and advanced control logic are being investigated to optimize execution of IRREGULAR-APPs onto the GPGPU platform. As our second research line, since our BIO-HPC research group is a recognized expert on developing emerging Bioinformatic applications (BIOAPPs) for high-performance computing (HPC) systems, they are investigating new opportunities to modify the GPGPU architecture that can remove all major performance bottlenecks that limit speedups when running BIOAPPs onto the GPGPU platform. For that, they start by carrying out a comprehensive on/off-line characterization of this kind of GPGPU workloads to get insight into potential GPGPU architectural improvements. This strategy on modifying the GPGPU architecture to optimize certain application domains are also extended to other HPC applications. Finally, our third research line consists in enhancing energy efficiency of GPGPU platforms. To this end, novel GPGPU-specific thermal and power management techniques are also being explored along with leveraging state-of-the-art technologies such as nano-photonic technology and 3-D stacking.

 

 

 

For further information, please contact with Prof. Jose L. Abellan: jlabellan [at] ucam [dot] edu

 

 

Some previous group results

  • Jose L. Abellan, Juan Fernandez and Manuel E. Acacio. “A G-line-based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs”. Proc. of the International Conference on Parallel Processing (ICPP 2010). September 2010.
  • Jose L. Abellan, Juan Fernandez and Manuel E. Acacio. “GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs”. Proc. of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011). May 2011. BEST PAPER AWARD in Architectures track.
  • Jose M. Cecilia, Jose L. Abellan, Juan Fernandez, Manuel E. Acacio, Jose M. Garcia and Manuel Ujaldon. “Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE”. The Journal of Supercomputing, 62(2), 787-803. 2012.
  • Jose L. Abellan, Alberto Ros, Juan Fernandez and Manuel E. Acacio. “ECONO: Express Coherence Notifications for Many-Core CMPs”. Proc. of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2013). July 2013.
  • Tiansheng Zhang, Jose L. Abellan, Ajay Joshi and Ayse K. Coskun. “Thermal Management of Manycore Systems with Silicon-Photonic Networks”. Proc. of the Design, Automaton & Test in Europe Conference (DATE 2014). March 2014.
  • Amir Kavyan Ziabari, José L. Abellán, Rafael Ubal, Chao Chen, Ajay Joshi, David R. Kaeli: Leveraging Silicon-Photonic NoC for Designing Scalable GPUs. ICS 2015: 273-282
  • Baldomero Imbernon, Antonio Llanes, Jorge Peña-García, José L. Abellán, Horacio Pérez Sánchez, José M. Cecilia: Enhancing the Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs. IWBBIO (2) 2015: 620-626
  • Amir Kavyan Ziabari, José L. Abellán, Yenai Ma, Ajay Joshi, David R. Kaeli: Asymmetric NoC Architectures for GPU Systems. NOCS 2015: 25:1-25:8