# of cores | Native / Guest | Multi processor directives | Execution time (secs) |
1 | Guest | Top level loop | 76.4 |
4 | Guest | Top level loop | 20.1 |
4 | Native | Top level loop | 19.8 |
4 | Native | Matrix multiplication subroutine | 23.0 |
The same code runs only 1.5% slower in the guest than it does running native.
Also interesting: putting the multi-processing directives into particularly intensive sub-routines (like matrix multiplication) but not in the top level program does generate a substantial improvement, not that different from multi-processing at the top level with no MP in the subroutine. (Of course this assumes the program calls this
routine frequently...) My concern was that the overhead of flipping into and out-of the MP environment when MP subroutines are called frequently would negate its benefits - but at least in this case it appears not to be a problem.
So if I identify the routines that are called frequently and are particularly intensive, I can improve performance in the subroutine libraries and not have to worry when coding the main routine. (Of course I probably should have written more efficient code to begin with, but that's another story).