As I progressed with my work I found out that all my previous conclusions, from profiling, in the post The Project were incorrect. I started working on the file PredictorMfe2dHeuristic.cpp and manually timed the function fillHybridE. It showed me a completely different statistics. It was taking approximately more than 50% of the total execution time.
I started with parallelizing the inner loop and found out of bound exception. The call to getE function inside the innermost for loop was raising an exception claiming that i2 +w2 > hybridE.size2(). This was interesting as this condition was the test condition for the loop and the loop should have failed before executing this instruction. It took me a long time to figure this out and with help from my colleague Saurav Sood and my mentor Viraj Kumar I figured out that it was because of a race condition. I am not actually sharing variable across threads but since the variables were declared outside the loop it was getting shared. I shifted all the declarations and also moved updateOptima inside the loop to avoid variable sharing. I introduced a critical section around the call to updateOptima as shared variables get updated in the function.
The code run successfully giving correct results however the time of execution doubled. The first thing that came to me was the critical section that I had introduced but I tried the code without it and still got the same results. Also I realized that since I shifted the updateOptima into parallelizable region nothing stopped me form parallelizing the outermost loop itself. This would also reduce the overhead or creation and deletion of threads on each iteration. However the number of seconds didn’t come down even by a single digit.
After some looking around and experimenting I figured out that the extra time that was getting added was because of the unnecessary changes that I had made to the code in the process of testing and my solution ultimately gave a performance improvement. I am able to run a 45 test case target against a single query in 29.153s which was taking 38.191s in serial execution.
Martin pointed out that parallelizing the outermost loop is breaking the underlying logic and so I had to give up some gain in performance by parallelizing the second for loop. I also avoided the critical section by moving updateOptima outside the inner for loop though it required me to do some redundant calculations by recalculating energy for i1 and i2 pair. Despite of the redundancy I was getting performance gain.
I can save the result of getE when calculated the first time. This presented me with a perfect time-memory trade off. I will be using i1 x i2 extra space to avoid the repeated calculation. The comparison between the two is shown by the following graph.
We see that we don’t get any improvement in performance even if we use the extra memory to avoid the repeated calculations. Therefore I decided to let the final code have repeated calculation of energy.