In modern digital wireless communication, software-defined radio (SDR) technology is widely used. SDR enables the implementation of a communication PHY layer entirely in software, using an FGPA programmed with HDL or a general purpose processor (GPP) running interpreted or compiled code. The first approach (FPGA with HDL) is known for the virtually limitless performance that can be achieved and the straight-forward workflow that leads directly to a deployable hardware implementation (enabled by the reusability of the HDL code and the generated netlist). The second approach (GPP with interpreted or compiled code) is mostly known for its accessibility (PCs are commonly used), code reusability across platforms, hardware abstraction, support for easy-to-use interpreted languages like Python, and automatic code generation via SDR development toolkits like GNU Radio.

Performance trade-offs

These benefits of the GPP-based approach, however, come with a trade-off in performance. Latency is often brought up when such an approach is considered. Latency is typically undesirable. Most communication standards specify the delay tolerance between components of the processing chain. Additional latency observed when implementing a wireless PHY on a GPP (when compared to the FPGA approach) comes from two sources:

1- Firstly, additional latency is introduced by the link used to pass the data between the radio DACs/ADCs and the GPP (virtually non-existent between radio DACs/ADCs and an FPGA). Typical links (USB, Gigabit Ethernet, PCI Express) buffer and transfer data in diverse ways which, in some cases, result in an increased latency that varies between technologies, both in magnitude and predictability.

Figure 1

This problem can be addressed by choosing the right physical link based on a latency requirement and by using drivers designed for real-time and low latency data exchange, such as Nutaq’s Real-Time Data Exchange (RTDEx).

2- Secondly, additional latency is introduced by the scheduling and buffering of individual processing blocks of the PHY. When programming a PHY on a GPP with interpreted or compiled code, processing blocks are typically implemented through a function that reads a vector of data from memory, processes it, and stores the output back to memory. The same applies to GNU Radio implementations. Latency of individual processing blocks is not negligible, considering that it sums-up throughout a whole GNU Radio flow graph. Memory read/write operations have a cost and buffer sizes are usually increased to limit the frequency of memory accesses. This enables the processing of as much data as possible every time the processing block (function) is called and increases the overall throughput. This results in increased latency across the GNU Radio flow graph since the relationship between buffer sizes and latency is typically proportional.

Figure 2

When working with FPGAs, the use of internal FIFOs to interconnect processing blocks mitigates the impact of buffering on the processing time by eliminating external memory read/write operations.


GNU Radio addresses this problem by providing a mechanism to impose the latency parameter of a flow graph as well as adjust the latency of each processing block individually, if required. Buffer size can be reduced until task overruns start occurring. Increasing the frequency of read/write operations uses more processing resources, but the blocks are run at minimal latency. For a given throughput, the end result is the ability to trade GPP resources for delay reduction (until task overruns occurs).

Consequently, researchers from the University of Tokyo have suggested a method for automatic buffer-size tuning based on the individual processing block average throughput, thus leading to an optimal delay for an entire GNU Radio flow graph without manual intervention.

Unless otherwise specified, GNU Radio runs a scheduler aimed at optimizing the throughput of processing blocks. The scheduler used is dynamic. Processing blocks in a flow graph pass vectors of samples from their output to the input of the next block. The sizes of these vectors will be adjusted automatically depending on the time required for processing the input data. Throughput (the amount of data processed for a given period of time) is prioritized over latency. The buffer sizes for each processing blocks are set so they can hold as much data as the block can possibly generate each time it is called. The result is that processing blocks will often be called with a large number of samples to process. In terms of processing speed (throughput), this is highly efficient since the time is taken up for processing samples rather than reading/writing data to memory. Smaller vectors would result in the processing block being called more often by the scheduler, hence more read/write operations to retrieve and store the samples which increases the workload and reduces the throughput. The downside of such an optimization (based only on throughput) is the large latency obtained while blocks are processing large vectors of data.

GNU Radio added a new feature in version 3.5.1 to address this problem. For applications with strict latency requirements, the gr_top_block can take (as an argument) a limit on the number of output samples a processing block can receive. A thousand is used in this example.

A processing block may receive fewer samples than this given number, but never more. The parameter serves as an upper limit to the latency any block of the flow graph may have. One drawback, however, is that by limiting the number of samples per call, the scheduler’s overhead is increased and the efficiency (throughput) of each block is reduced. On the other hand, this enables projects with extra processing resources to use them to reduce latency and guarantee that the given requirement will be met. This method exerts global latency control over the entire flow graph.

In a subsequent release, set_max_noutput_items was added to the GNU Radio API in order to give a processing block the ability to override the global latency parameter with one of its own. In the example given in Figure 2, the global setting is 1000 samples (max), except for the processing block named flt, which can receive up to 2000 samples.


A more advanced feature, set_max_output_buffer, was also added to the GNU Radio API to restrict the actual size (in bytes) of each buffer. Unlike the set_max_noutput_items method (based on number on samples taken by a block), this method, based on the actual buffer size, prevents a buffer from accumulating data when the subsequent block lags. Experimental test results for GNU Radio features aimed at better controlling the latency can be found at

In this article, the authors see these new features as an opportunity to develop a method for automatically tuning the buffer size based on individual processing block average throughput, leading to an optimal delay for an entire GNU Radio flow graph, without manual intervention. The average throughput of a processing block is computed using the amount of output samples over a given execution time.

By modifying the gr_block executor class, the authors implemented a method for measuring the average throughput of each block and comparing the average throughput to a metric they define as the magnification ratio. By modifying the gr_buffer class, they have implemented a method for dynamic buffer reallocation.

Table 1 from Infocom Saganuma shows the experimental results of tuned buffer sizes and average measured latency when using the flow graph benchmark as a test bench. The latency was reduced from 16.01 ms to 4.42 ms by the tuning algorithm.

Table 1


Part of the extra latency observed when implementing a wireless communication PHY layer on a GPP (compared to the FPGA approach) comes from the buffering of individual processing blocks of the PHY. The SDR development toolkit, GNU Radio, has features to deal this phenomena, enabling it to be used the development of wideband and low latency PHY layers. GNU Radio lets you control the maximum delay of a complete flow graph as well as the individual processing blocks in exchange for extra processing resources.