Developing a Highly Parallelized TCR Synthesis Algorithm on GPGPU and FPGA for Accelerating the Study of Immune Systems
Publisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
The $V(D)J$ recombination process is the primary mechanism for generating a diverse repertoire of T-cell receptors (TCRs) essential to the adaptive immune system for recognizing a wide variety of diseases. The diverse set of TCR are required by the immune system to increase the chance of accurate identification of the foreign invaders, which results in successful recovery from diseases. Furthermore, analysis of the TCR repertoire helps immunologists to understand the functionality of immune system in presence of different diseases and find the correlation between diseases and immune receptors. However, modeling this diverse TCR repertoire is computationally challenging as the total number of TCRs to be generated and processed can exceed $10^{18}$ sequences. This massive scale of data processing poses as the barrier for immunologists to successfully understand the functionality of human immune system. Therefore, reducing the timescale of modeling the TCR repertoire will help immunologists to test their assumptions and solve their fundamental questions. In this study, we propose FPGA and GPU-based implementation of $V(D)J$ recombination process for accelerating the analysis of TCR repertoire. For the GPU-based implementation, we propose a \emph{bit-wise} implementation of the $V(D)J$ recombination algorithm, which reduces the memory footprint and the execution time by factors of 4 and 2 respectively compared to the current state-of-the-art GPU-based implementation. We devise an encoding procedure to convert the input data set from character based domain to binary domain and pack a sequence of four characters into a single byte for the bit-wise implementation. We also proposed new \emph{indexing} scheme for addressing input data that are not aligned with the byte-addressing. We present a multi-GPU implementation, experimentally identify suitable workload partitioning strategies for both single- and multi-GPU implementations, and finally expose the relationship between workload size and limited scalability offered by the algorithm on a cluster with up to eight GPUs. We show that the bit-wise implementation reduces the execution time from a time scale of 40.5 hours to 18.9 hours on a single GPU and to 4.3 hours on a 8-GPU configuration. For the FPGA-based implementation, we first utilize the N-level parallelization approach that is used for the GPU-based implementation. Simulation results show that this approach does not perform as expected for the FPGA-based implementation of $V(D)J$ recombination process due to the communication overhead between FPGA components. Therefore, we propose the VJ level parallelization approach to reduce the communication among components. We show that the VJ-level architecture reduces the execution time by a factor of 2.34 in comparison with the N-level parallelization approach.Type
textElectronic Thesis
Degree Name
M.S.Degree Level
mastersDegree Program
Graduate CollegeElectrical & Computer Engineering