FPGAs are rapidly gaining traction in the domain of HPC thanks to the advent of FPGA-friendly data-flow workloads, as well as their flexibility and energy efficiency. However, these devices pose a new challenge in terms of how to better support their communications, since standard protocols are known to hinder their performance greatly either by requiring CPU intervention or consuming too much FPGA logic. Hence, the community is moving towards custom-made solutions. This paper analyses an optimization to our custom, reliable, interconnect with connectionless transport|a mechanism to register and track inbound RDMA communication at the receive-side. This way, it provides completion notifications directly to the remote node which saves a round-trip
latency. The entire mechanism is designed to sit within the fabric of the
FPGA, requiring no software intervention. Our solution is able to reduce
the latency of a receive operation by around 20% for small message sizes
(4KB) over a single hop (longer distances would experience even higher improvement). Results from synthesis over a wide parameter range confirm that this optimization is scalable both in terms of the number of concurrent outstanding RDMA operations, and the maximum message size.