Heterogeneous computing is becoming a common approach to speed up processing, especially for embedded systems which require minimum power consumption. Dedicated processors like graphics processing units (GPU), digital signal processors (DSP) and field programmable gate arrays (FPGA) are often used besides the traditional central processing units (CPU) in order to meet real time processing needs whilst staying within a restricted power usage. When using such systems, the communication between the various processors and the management of tasks across them are important challenges that need to be tackled. This work studied the possible interfacing options between a traditional CPU and an FPGA device such that a high transfer rate could be obtained. A memory-based custom bridge with configurable transaction translation was designed to interface a CPU and an FPGA. The bridge makes use of a flash memory controller that is widely available in embedded systems, enabling the addition of a re-configurable hardware accelerator without dedicated interfaces like the Peripheral Component Interconnect Express (PCIe). The bridge consists of two sub-interfaces to handle all communication scenarios; one of them allows access to non-prefetchable memory, and the other provides prefetching to improve bandwidth for sequential access via stream buffers, achieving up to 148.45 MB/s, an improvement of about 20% when compared to existing designs. The developed bridge was incorporated into the Intel FPGA OpenCL framework to enable OpenCL-based FPGA acceleration for embedded systems. This includes the development of an FPGA design with the developed bridge as the part of the fixed elements, and software required for the configuration of the fixed elements and the communication between the CPU and the FPGA, including direct memory access (DMA) between the two. It demonstrates the possibility to have OpenCL in low-cost embedded platforms, lowering the entry point for FPGA accelerated computing. The work also looks to provide an automatic optimiser to generate CPU schedules for Halide which is a domain specific language that separates the algorithm and the schedule of a conventional program. The optimiser avoids the loss of optimisation opportunities from the way a function is expressed and presents a new way of analysing the pipeline to generate schedules for optimal performance, and improves the performance up to 50% when compared to Halideâs built-in auto-scheduler.