Parallelizing IntaRNA – Analysis

I attempted to parallelize a part of the project IntaRNA. It is a new general and fast approach to the prediction of RNA–RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. Taking up this project in an interdisciplinary topic posed a lot of challenges. I had to make the code faster with no knowledge of the underlying algorithm and no understanding of the input and the output.

The author Martin-mann had already made the serial code optimized to the maximum possible and marked the regions that needed to be parallelized. This is how I knew where to begin from, however I wanted to see the figures for myself and so I decided to do an analysis of the runtime of the project. I used gprof to do code profiling. I had some troubles interpreting the results however I figured out that the fillHybridE was approximately taking 162.27s out of the total runtime of 249s which is around 65% of the total time and was getting called once for each target.

%     cumulative   self                     self        total

time   seconds    seconds  calls    s/call       s/call         name

0.06    162.27        0.09         20       0.00        0.02     PredictorMfe2dHeuristic::fillHybridE()

The code structure for fillHybridE consists of four nested for loops. Martin had suggested me to parallelize the inner two loops, however I wanted to have as much as parallel region as possible and I started with the outermost for loop. I have described in detail the approach taken by me and the mistakes I made in my previous blogs The Project and Fixing Race Conditions.

I used an Intel core i5 processor with four cores to perform the analysis. Sample input data was provided by Martin https://github.com/BackofenLab/IntaRNA/issues/18#issuecomment-274523517. I used variations of this data to vary the word load. The given graphs show the variations in time with respect to the work load.

timevsquery.png As the inner for loop deals with the query size, we can see the parallel code keeps getting faster as the query length increases.timevstarget.png

The speedup obtained by increasing the number of targets is not very prominent as it adds the overhead of creation and removal of threads on each iteration of outer for loop.

Advertisements

Fixing Race Conditions

As I progressed with my work I found out that all my previous conclusions, from profiling, in the post The Project were incorrect. I started working on the file PredictorMfe2dHeuristic.cpp and manually timed the function fillHybridE. It showed me a completely different statistics. It was taking approximately more than 50% of the total execution time.

I started with parallelizing the inner loop and found out of bound exception. The call to getE function inside the innermost for loop was raising an exception claiming that i2 +w2 > hybridE.size2(). This was interesting as this condition was the test condition for the loop and the loop should have failed before executing this instruction. It took me a long time to figure this out and with help from my colleague Saurav Sood and my mentor Viraj Kumar I figured out that it was because of a race condition. I am not actually sharing variable across threads but since the variables were declared outside the loop it was getting shared. I shifted all the declarations and also moved updateOptima inside the loop to avoid variable sharing. I introduced a critical section around the call to updateOptima as shared variables get updated in the function.

The code run successfully giving correct results however the time of execution doubled. The first thing that came to me was the critical section that I had introduced but I tried the code without it and still got the same results. Also I realized that since I shifted the updateOptima into parallelizable region nothing stopped me form parallelizing the outermost loop itself. This would also reduce the overhead or creation and deletion of threads on each iteration. However the number of seconds didn’t come down even by a single digit.

After some looking around and experimenting I figured out that the extra time that was getting added was because of the unnecessary changes that I had made to the code in the process of testing and my solution ultimately gave a performance improvement. I am able to run a 45 test case target against a single query in 29.153s which was taking 38.191s in serial execution.

Martin pointed out that parallelizing the outermost loop is breaking the underlying logic and so I had to give up some gain in performance by parallelizing the second for loop. I also avoided the critical section by moving updateOptima outside the inner for loop though it required me to do some redundant calculations by recalculating energy for i1 and i2 pair. Despite of the redundancy I was getting performance gain.

I can save the result of getE when calculated the first time. This presented me with a perfect time-memory trade off. I will be using i1 x i2 extra space to avoid the repeated calculation. The comparison between the two is shown by the following graph.graph.png

We see that we don’t get any improvement in performance even if we use the extra memory to avoid the repeated calculations. Therefore I decided to let the final code have repeated calculation of energy.

The Project

I took up the task of parallelizing the code for the open source project IntaRNA. It is a new general and fast approach to the prediction of RNA–RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. It can successfully predict bacterial sRNA targets and determined the exact locations of the interactions with a higher accuracy than competing programs. Taking up this projects gives me an opportunity to work on an interdisciplinary topic and also to apply the skills learned in a class room to a real life application.

The first step was to install the software and understand how it works and what it does. I had a few troubles however with the help of the author and managing organization I was able to get it running and try some test cases. The executable accepts a target and a query string which are RNA sequence and give out details of possible interaction between the two sequence.

intarna

With a crude idea of the input and output I went through the compilation process. I understood the make file. The author martin-mann had marked parts of code which can be parallelized. I went ahead with profiling just to get more information of the working. I changed the -O3 (optimization flag) to -pg to enable the profiler. I had to switch O3 off as the optimizer and profiler doesn’t work well together.

Profiling was not an easy task. For small inputs the output of gprof showed lots of functions with 0% execution time for all. On trying bigger inputs I got some numbers, but it was not very helpful. Maximum amount of time taken were by vector access functions and even they were as small as 4-5 %.

In a process to find parallelizable regions I looked at the function fillHybrid(). This function computes hybridization energy and fills it in the matrix. According to the results of profiling this function consumes 1.52% of the time and is called 45 times which is the number of targets I provided.

According to the results of profiling this function looks like a possible candidate for parallelization. On taking a closer look we find that it contains nested loops but only the inner loop can be parallelized as variables are updated in the outer loop, which might cause race conditions.

I went ahead and parallelized the inner loop to compare the results. The parallelized code was a little faster compared to the serial one. However the speedup is almost negligible and as well be caused by noise. Therefore my conclusion will be that parallelizing this piece won’t give us much of an advantage.

Running the code n 45 targets I got the following result:

Time taken by serial code: 11m 25s

Time taken by making inner loop parallel: 11m 18s

Experiment: Initializing array in parallel

The first experiment I tried involves writing to elements of an array. For a 2d array this operation is of complexity n^2 and often needs to be parallelized. If we are writing to separate array locations it is also free from race conditions. The first experiment is to compare performance of filling up an array in serial and parallel.

The code used for the experiment is very simple. I take the size of the array as input from the user. Create an array of size nxn and then initialize array index i, j with the product ixj.

void compute(int **arr,int n){

    for(int i = 0;i<n;i++){

        for(int j = 0;j<n;j++){

            arr[i][j]=i*j; 

        }

}

}

The results obtained clearly show that the parallel operations are faster. However as the array size reaches almost 30,000 the performance for both the serial and parallel execution becomes the same. The reason for this is at array size 30,000 the memory occupied is 3.6GB and with 4GB ram on the system we start getting a lot of cache misses which dominate the performance.

Array size

Serial(in sec)

Parallel(in sec)

27,000

2.558

1.213

28,000

5.074

2.447

29,000

15.729

12.131

30,000

34.403

26.737