Skip to content

yonglid/CS205-Final-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

CS205-Final-Project

Final Project Current team: Peter Chang, Yong Li Dich, Alexander Wu, Anita Chandrahas

Introduction

The tunicate, commonly known as the sea squirt, exhibits the phenomenon of blood flow direction reversal. There are two main potential methods researched on how the tunicate carries out this nonpareil event: 1) two pacemakers with the same rates but with natural deviations 2) two pacemakers with different rates that change at every k where k is between 1 and infinity pumps. The math was initially coded out in Python to generate a video simulation of the blood flow in order to observe the two possible etiologies for the blood flow reversal.

One major issue with the research was the length of time associated with simulation generation. For each 30 second video modeling the blood reversal process, about 45 minutes of computations were needed, which is much too long when multiple parameters are needed to be tested. For this project, in order to assist in more efficient and productive research to test out more hypotheses on this phenomenon, the team implemented various parallelization methods in order to drastically speed up the simulations. The main goal of this approach was to reduce the simulation generation time, allowing for improved means of understanding the phenomenon of blood flow reversal, though the domain of blood flow simulation is also very interesting to explore (like the Lattice Boltzmann approach).

Background: Basic Physiological Equations

In order for a heart to pump blood, a pacemaker is required at the end of the heart fibers. This pacemaker creates electric jolts at a specific interval in order to send waves throughout the entire fiber. The heart of a sea squirt may be modeled as having two pacemakers, one at either end of the heart fiber (Krijgsman, Miller and Waldrop), which allows for blood to flow in both directions. A unique feature about wave mechanics within a heart fiber is that waves which collide do not pass through each other as most waves do. Rather, the nature of the mechanics causes the two waves to "collapse" upon collision. This allows only one of the directions to be dominant at any given moment.

Based on the Mitchell-Schaeffer model, there are two main components that govern the propagation of waves within the heart - the processes of individual heart cells and the diffusion between adjacent heart cells. When dealing with individual heart cells, there are two differential equations that govern how electric potential is stored. The first is the primary equation for voltage (Cain and Schaeffer):

Where tin and tout are physiological constants, v is voltage inside the cell, and h is a constant between 0 and 1 that represents how open or closed the cell door introducing external voltage is. The variable h is governed by the following piecewise ordinary differential equation:

where topen and tclose are once again physiological constants and vcrit is a certain voltage level that determines how the cell will behave.

The diffusion equation expressing voltage with respect to time is:

where k is another physiological constant. Diffusion will be an intrinsic factor of the cell and the cells immediately adjacent to it. We can estimate the change by diffusion using the standard Heat Equation:

where K is the diffusion coefficient (.001 for most biological cases), and u is the estimated voltage at time m and at location k.

We also must consider boundary conditions on the cells at the beginning and ends of the fiber. By assuming the conditions under which no voltage passes through the ends, i.e. du/dx(0,t) = 0 and du/dx(L,t) = 0, we will observe an increased effect from the single adjacent cell:

where N is the number of cells within the heart fiber.

Potential Hypotheses for Blood Flow Reversal

  1. Controlled Shifting Method: In this hypothesis, the two pacemakers at the ends of the heart fiber do not have the same rate of heart pumping. At the beginning, one pacemaker will begin with a slow rate and the other begins with a fast rate. After each pump, the slow pacemaker will increase speed by a small amount and the fast pacemaker will decrease its speed. Naturally, the side with the faster rate will dominate initially, however they will eventually trade dominance and the blood flow will change. Intuitively, this hypothesis will clearly generate blood flow reversals, however the main goal was to see if the simulations generated were realistic.

  2. Random Variation Method: In this method, the two pacemakers maintain the same average pacemaker rate but with some standard deviation between each pump. This hypothesis requires more testing in order to determine whether reversals can occur and with which parameters (such as diffusion rate, fiber length, number of cells, etc.) this model can be sustained to retain biological consistency. This parameter-heavy model is the reason why parallelising this code is important. Multiple tests must be run with different constants in order to determine whether this model can prove to be sufficient.

Technical Description of Parallel Software Solution

In order to reduce the runtimes associated with generating simulations of blood flow, we implemented several different parallelization techniques.

SIMT Parallelization
We employed single instruction, multiple thread (SIMT) parallelization in two different ways. First, we converted the original Python implementation of the simulation to the Cython language. While the syntax of Cython mirrors that of Python, Cython importantly supports calling C functions and declaring C types on variables and class attributes. Thus, upon compilation of the Cython code, we were able to take advantage of the intrinsic efficiency of the C language relative to Python. More importantly, though, was the ability of the Cython language to readily support parallelization. In particular, we utilized the prange() function in the cython.parallel module to parallelize via multithreading the for loops that exist within the computationally intensive regions of our simulation. An example of the Cython implementation is included below.

In addition, we manually converted the original Python version of the blood flow simulation to the C programming language. This task offered the opportunity to experience speedup due to the increased efficiency of C as well as with the integration of a number of C-compatible parallel programming models. The main SIMT parallel programming model that we selected to test was OpenACC. With OpenACC, we retained the translated C implementation of our simulation algorithm and included OpenACC directives to enable SIMT parallelization within the same highly parallelizable regions of code as in the Cython version. Specifically, we used parallel loop clauses to achieve parallelization in combination with gang, worker, and vector clauses to more explicitly specify the way in which parallelization is mapped across threadblocks, warps, and CUDA threads, respectively.

SPMD Parallelization
We also sought to use single program, multiple data (SPMD) parallelization models to achieve greater speedup in our simulation execution. In our implementation of this model, we designed a hybrid OpenACC + MPI program that enables multiple processors to simulateneously execute the same program while operating on different different subsets of the data. With regards to the implementation of this hybrid approach, we built upon the OpenACC version of the simluation by first initializing an MPI execution environment, called a communicator, prior to the bulk updating procedures in the simulation. We then broadcast the array storing voltage values to all other processes of the communicator via the MPI_Bcast() function. In doing so, we enable multiple copies of the OpenACC procedures to be run across a corresponding number of nodes by distributing the computation across these multiple nodes. We identified the voltage array as the optimal "message" to be broadcast due to the frequency of its use in the simulation process as well as the parallelizability of the procedures for updating voltage values across the various time points in the simulation. After the simulation is completed, the root node that initially distributed the data collects the results back and the algorithm is finished with its execution.

Benchmarking:

Since heart fibers range between 10 micrometers and 100 micrometers, we ran a few separate cases for benchmarking. L=3.0cm is a biologically reasonable fiber length for a tunicate. Consequently, an N of 300 will give a 100 micrometer fiber cell, N=600 gives a 50 micrometer fiber cell, and N=3000 gives a 10 micrometer fiber cell.

N = 300

Python Serial

ms resolution time GFlops/s
10 10 1.06 -
100 10 6.02 -
100 100 5.76 -
1000 600 52.2 -
10000 600 527.0 .016
30000 600 1753.3 .0144

Cython

ms resolution time GFlops/s
10000 600 463.67 0.018
30000 600 1553.78 0.016
60000 600 3242.53 0.016

We can see that overall, the Python implementation has very poor performance and the Cython parallelisation does very little to actually improve the throughput.

C Implementation

ms resolution time GFlops/s
30000 600 125.01 0.203
60000 600 307.41 0.165
150000 600 856.34 0.148
500000 600 2986.51 0.142

OpenACC

ms resolution time GFlops/s
30000 600 61.09 0.415
60000 600 123.01 0.412
150000 600 308.05 0.412
500000 600 1029.38 0.411

OpenACC + MPI code

ms resolution time GFlops/s
30000 600 60.87 0.416
60000 600 120.19 0.422
150000 600 299.80 0.423
500000 600 1001.02 0.423

We can see that the C implementation already provides much faster simulation generation than the Python code does. Additionally, the parallelisation of the code produced much better speedups that the parallelisation of the Python code. Using OpenACC, OpenACC + MPI, and then OpenACC on the NVIDIA Tesla P100, our throughput drastically increased and the computation time was at worst, halved.

In order to better show the effects of the parallelisation, we doubled the value N, which would increase the computation in the areas that we had parallelised. An N of 600 works well and maintains good biological accuracy with the hearts cells now being 50 micrometers in length. We did not increase the N any more than this due to the necessity of numerical stability. If .001(cellsize)/(timestep^2) > 1/2, then the numerical approximations that we use will diverge in value and not provide accurate simulations. Using a value of N which is larger than 600 would necessitate a smaller timestep, which would drastically increase computation time and slow down the code more than the parallelisation would speed it up.

N=600

Python Serial

ms resolution time GFlops/s
10 10 1.02 -
100 10 10.04 -
100 100 8.28 -
1000 600 87.96 -
10000 600 784.6 .0215
30000 600 2480.43 .0203

Cython Implementation

ms resolution time GFlops/s
10000 600 832.72 0.020
30000 600 2601.10 0.019
60000 600 5411.19 0.019

When we increased the size of N, we actually saw a decrease in the benefits of parallelisation using Cython. In fact, the Cython code performed worse than the serial Python code. At the same time though, the throughput of the Python code was decreasing as we increased size while the Cython code stayed relatively constant. It would be plausible that for longer simulations, the Cython code would eventually overtake the Python code again. However at the objective speeds that this code was running at, running this experiment in Python/Cython is unfeasible anyway and the focus should be on doing simulations using C code.

C Implementation

ms resolution time GFlops/s
30000 600 501.390000 0.101
60000 600 999.530000 0.101
150000 600 2532.830000 0.100
500000 600 8472.320000 0.100

OpenACC

ms resolution time GFlops/s
30000 600 118.470000 0.427
60000 600 238.150000 0.425
150000 600 599.230000 0.422
500000 600 1989.600000 0.424

OpenACC + MPI

ms resolution time GFlops/s
30000 600 110.920000 0.456
60000 600 220.810000 0.459
150000 600 548.230000 0.461
500000 600 1830.490000 0.461

We can see that we get the best performance out of the parallelisation with the N=600. This is fantastic since an N of 600 is likely the most biologically accurate, but it also provides the scaling necessary to see an increase in the performance through parallelisation. The speedup observe from this implementation would easily allow biological hypotheses to be tested in a more reasonable amount of time, allowing for much faster scientific research to be performed.

Advanced Features

NVIDIA Tesla P100 GPU Accelerators

To further explore ways in which we can improve the execution of our simulation algorithm, we sought to assess the benefits of using NVIDIA Tesla P100 GPU accelerators. These GPUs are amongst the advanced available on the market and deliver the world's fastest compute node. In hopes of further expediting the blood flow simulation process, we evaluated the performance of our serial and OpenACC implementations on these P100 GPUs via the Bridges supercomputer. The results of our analysis are included below.

C Serial (NVIDIA Tesla P100): N = 300

ms resolution time GFlops/s
30000 600 216.46 0.117
60000 600 446.11 0.121
150000 600 1000.32 0.127
500000 600 3368.88 0.125

OpenACC (NVIDIA Tesla P100): N = 300

ms resolution time GFlops/s
30000 600 55.99 0.453
60000 600 103.77 0.489
150000 600 271.70 0.467
500000 600 903.99 0.467

C Serial (NVIDIA Tesla P100): N = 600

ms resolution time GFlops/s
30000 600 374.12 0.135
60000 600 720.99 0.140
150000 600 1798.59 0.140
500000 600 6054.04 0.139

OpenACC (NVIDIA Tesla P100): N = 600

ms resolution time GFlops/s
30000 600 99.86 0.506
60000 600 200.11 0.505
150000 600 501.82 0.504
500000 600 1662.24 0.507

As can be seen from the data, the execution of both our serial and OpenACC paralellized models demonstrated noticeably more throughput with the P100 GPUs than with the GPUs available in Odyssey. In addition, we observe differences in the amount of speedup of the program between the two machines. When comparing the C serial implementation of the program with the OpenACC model within the context of the machines on which they are executed, it appears as if the speedup achieved by the OpenACC implementation on the P100 GPUs in Bridges is greater than that of the GPUs used in Odyssey when N = 300. However, once we increase the problem size to N = 600, the speedup achieved on the P100 GPUs appears to be less than the speedup observed in Odyssey.

Modeling: The Lattice Boltzmann Model (LBM) Click here to see the python code

Our main research problem was to parallelize simulations to test the two major hyptheses for blood flow reversal, but we also wanted to explore blood flow simulation. We chose to use a 2D Lattice Boltzmann model with approximate values of tunicate blood density and velcocity.

The chosen lattice model combines the macroscopic Navier-stokes equation for fluid dynamics simulations and microscopic kinetics of individual molecules. This is especially useful since blood is a a multiphase non-Newtonian viscoelastic fluid. These fluid properties essentially mean the continuum approximations of Navier-stokes are too simple to accurately model blood flow. On the otherhand, modeling the kinetics of individual molecules with Boltzmann's theory has a large computational cost. Lattice-Boltzmann method is a simplification of Boltzmann’s original idea by restricting the number of particles and confining the velocity vectors to the nodes of a lattice. Therefore, LBM is an appropriate mesoscopic model where a fixed number of individual molecules are limited to two directional vector on a lattice.

Overall, there are a few advantages of using LBM to model blood flow.

  1. It can be used to simulate multiphase flow
  2. It can more readily include complex boundary conditions
  3. It can easily be parallelized.

Lattice-Boltzman uses discrete particles on a lattice which can be summed to create a simplified 2D Navier-Stokes model.

We focus on the two-dimensional blood flow simulation by using LBM to model Navier stokes.

Lattice scheme to model Navier-Stokes.

The basic process of Lattice-Boltzmann is illustrated below:

Each point on the lattice has particles with discrete velocities.

Transport phase: shift of data along each independent velocity vector.

Relaxation phase: Determines the microscopic dynamics towards local equilibrium and macroscopic transport coefficients (tune to get desired dynamics)

Repeat transport and relaxation

Results: The Lattice Boltzmann model shows that at least for the approximate values for blood flow, the velocity does not converge to the expected values. Overall, this alludes to limitations with using a simple 2D LBM model. For multiphase fluids, Lattice-Boltzmann assumes that all components have the same viscocity. More accurate results have been shown with a bi-visocity model to simulate blood flow (Liu, 2012). An additional reason the D2Q9 LBM model does not match the expected curve could be the compressiblity error becomes dominant. To improve this model, one solution is to use incompressible boundary conditons.

Conclusions

The phenomenon of blood flow reversal in tunicates is one of the most fascinating biological properties observed in nature. Efforts to better understand this unique process have included developing complex mathematical models drawing from areas in wave mechanics, physiology, and fluid dynamics. Simulations of some of these models are, however, computationally intensive and the time-consuming nature of these simulations serve as a barrier to learning more about these creatures. In this project, SIMT and SPMD parallel programming models were developed to tackle this obstacle and they succeeded in improving not only the speeds but also the efficiency of the algorithms' execution. Exploration of advanced GPU technologies with the NVIDIA Tesla P100 GPU accelerators as well as deeper analyses into complementary models in blood flow simulation, like the Lattice Boltzmann model, provided even further insight into potential improvement strategies for the future. By incorporating these features into the analysis pipeline, we hope that these methods serve as a set of first steps toward providing more accessible and efficient tools for researchers in the field.

Citations

B. J. Krijgsman, Biological Reviews, 31, 288, 1956

C. C. Mitchell, D. G. Schaeffer, Bulletin of Mathematical Biology, 65, 767, 2003

L. D. Waldrop and L. Miller, Journal of Experimental Biology, 218, 2753, 2015

J. W. Cain, D. G. Schaeffer, SIAM Review 48, 537, 2006

J. W. Cain, E. G. Tolacheva, D. G. Shaeffer, and D. J. Gauthier, Physical Review E70, 061906, 2004

M. E. Kriebel, Journal of General Physiology, 50, 2097, 1967

M. E. Kriebel, Biological Bulletin, 134, 434, 1968

C. H. Luo and Y. Rudy, Circulation Research 74, 1071, 1994

Y. Liu, Applied Mathematical Modelling, 36,7, pp.2890-2899, 2012

About

Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •