Modern hardware systems consist of highly-complex multi-processor hardware architectures that are often supported by specialised accelerator devices. This has resulted in an increased burden on programmers to write software capable of fully exploiting parallel, co-operative, and heterogeneous modes of execution, which is a challenging task that can lead to significantly sub-optimal parallel program execution. Parallel execution profiling and post-mortem performance analysis is a well-established technique to investigate potential sub-optimal execution and thus guide subsequent optimisation efforts. However, micro-architectural support for recording on-chip activity during the execution of parallel programs is limited. Furthermore, due to the complexity of interactions between the application, dynamic runtime system and underlying hardware, modern systems can produce large volumes of intricate profiling data captured during parallel program execution, rendering manual performance analysis techniques difficult, time-consuming, and error-prone. Moreover, while performance analysis techniques may identify regions of sub-optimal execution and determine the underlying performance issues, the necessary configuration required to optimise the performance may be unclear. This thesis explores three aspects of this research area: 1. parallel performance profiling techniques; 2. analysis and interpretation of profiling data; and 3. guiding performance optimisation based on interpreted performance characteristics. In order to relieve the programmer from the burden of manual analysis and optimisation effort, this thesis focuses on statistical, empirical approaches that automate these aspects. It is shown that such approaches can enable more comprehensive generation of profiling data, automated extraction of insights from that profiling data, as well as the potential automation of subsequent performance optimisation decisions.