Context

As part of a research project, the REDS was mandated to design a system retrieving a data stream from a camera and transferring it to a CPU with the lowest delay. We used a SoC-FPGA board from Xilinx’s Zynq family. This type of device has a CPU core (PS) and a programmable FPGA part (PL). The solution chosen, to meet the need for high throughput, was to use a DMA. This was implemented in the FPGA part (PL) and enabled data to be transferred to the processor’s main memory (PS). Unfortunately, the performance achieved was well below the customer’s expectations, despite the fact that the DMA’s input data rate (on the PL side) was adequate. The problem was that the actual throughput available in the user space in Linux was much lower. It turned out that the team lacked experience in DMA configuration and driver development in Linux. Nevertheless, the need for DMA transfers is common to a large number of projects at the institute. 

System-on-Chip (SoC) FPGA components feature highly efficient buses between the processor system (PS) and the programmable logic (PL). However, implementing a Direct Memory Access (DMA) system to fully leverage these high-speed buses remains a significant challenge.

Introduction

In this post we explore the DMA solutions offered by Xilinx, documenting their setup, and establishing benchmarks and guidelines to facilitate the implementation of such systems in future projects. The aim is to create a comprehensive documentation base from scratch, with a design that initially transfers the contents of a counter in the PL to the Linux system running on the PS. Subsequently, the configuration will be made more complex to thoroughly explore the available capabilities.

The proposed DMA provides data at a set frequency. The goal is to deliver this data to a user space application with the highest possible throughput, determining the most effective strategy for this transmission. With a DMA that enables transmission at a specific frequency, the system must supply the user space with a data amount as close as possible to the maximum transmitted by the DMA. Therefore, it is crucial to identify the bottleneck and minimize it to the greatest extent possible.

System

Board – Zybo Z7

The Zybo Z7 is a feature-rich, ready-to-use development board built around the Xilinx Zynq-7000 family. The Zynq family is based on the Xilinx All Programmable System-on-Chip (AP SoC) architecture, which tightly integrates a dual-core ARM Cortex-A9 processor with Xilinx 7-series Field Programmable Gate Array (FPGA) logic.

ZYNQ processor:

  • 667 MHz dual-code Cortex-A9 processor
  • DDR3L memory controller with 8 DMA channels and 4 High Performance AXI3 Slave ports
  • High-bandwidth peripheral controllers: 1 Gb Ethernet, USB 2.0, SDIO
  • Low-bandwidth peripheral controllers: SPI, UART, CAN, I2C
  • Programmable from JTAG, Quad-SPI flash, and microSD card
  • Programmable logic equivalent to Artix-7 FPGA

Memory:

  • 1 GB DDR3L with 32-bit bus at 1066 MHz
  • 16 MB Quad-SPI Flash with factory programmed 128-bit random number and 48-bit globally unique EUI-48/64™ compatible identifier

This project has been developed with the Zybo Z7-10 model.

Zybo Z7-10 (Zybo reference manual)

The Zynq APSoC is divided into two distinct subsystems: the PS and the PL. The figure below shows an overview of the Zynq APSoC architecture, with the PS colored in light green and the PL in yellow.

Zynx AP SoC architecture (Zybo reference manual)

AXI DMA IP

The use of a Direct Memory Access (DMA) controller is essential to offload the Central Processing Unit (CPU) when transferring large amounts of data. There are several methods to perform DMA transfers in a Zynq-based device.

The Zynq chip offers a hard DMA controller embedded in the PS side of the chip, called PS DMA Controller (DMAC) (ARM PrimeCell DMA Controller (PL330)). 

The main disadvantage of the DMAC is that its interface to the PL is via AXI General Purpose (GP) ports (as shown in the picture below), which have higher latency and lower bandwidth  – due to the data width and central interconnect delays – when compared to the AXI High Performance (HP) ports used by the AXI DMA IP.

The AMD LogiCORE™ IP AXI Central Direct Memory Access (CDMA) core is a soft AMD Intellectual Property (IP) core that can be used with the Vivado™ Design Suite. The AXI CDMA provides high-bandwidth Direct Memory Access (DMA) between a memory-mapped source address and a memory-mapped destination address using the AXI4 protocol. An optional Scatter Gather (SG) feature can be used to offload control and sequencing tasks from the system CPU. Initialization, status, and control registers are accessed through an AXI4-Lite slave interface, suitable for the AMD MicroBlaze™ processor.

The AXI CDMA provides a high-performance CDMA function for embedded systems.

AXI Central Direct Memory Access V4.1 Logic CORE IP Product Guide (PG034)

Overview of the blocks:

  • Register module: This block contains control and status registers that allow you to configure AXI CDMA through the AXI4-Lite slave interface.
  • Scatter/Gather block: The AXI CDMA can optionally include Scatter/Gather (SG) functionality for off-loading CPU management tasks to hardware automation. The Scatter/Gather Engine fetches and updates CDMA control transfer descriptors from system memory through the dedicated AXI4 Scatter Gather Master interface. The SG engine provides internal descriptor queuing, which allows descriptor prefetch and processing in parallel with ongoing CDMA data transfer operations.
  • DMA controller: This block manages overall CDMA operation. It coordinates DataMover command loading and status retrieval and updates it back to the Register Module.
  • DataMover: The DataMover is used for high-throughput transfer of data. The DataMover provides CDMA operations with 4 KB address boundary protection, automatic burst partitioning, and can queue multiple transfer requests. Furthermore, the DataMover provides byte-level data realignment (for up to 512-bit data widths) allowing the CDMA to read from and write to any byte offset combination.

AXI DMA has two independent high-speed channels:

  • MM2S (Memory-Mapped to Stream): transports data from system memory to the stream target
  • S2MM (Stream to Memory-Mapped): transports data from the stream target to system memory, for instance a VGA sensor.
AXI Central Direct Memory Access with endpoint

This DMA has a Scatter Gather (SG) mode available. SG DMA relies on the use of descriptors. A descriptor is a data structure that provides the DMA controller with information about each segment of the transfer. It typically includes the source address, destination address, transfer length, and optionally a link to the next descriptor. Descriptors are chained together in memory, forming a linked list that the DMA controller follows.

The SG reduces CPU involvement in data transfer operations. DMA in cyclic mode is a type of scatter gather DMA which is designed to manage audio and video streams by continuously cycling through a buffer without stopping. https://docs.amd.com/r/en-US/pg021_axi_dma/Cyclic-DMA-Mode

The AXI DMA can be run in cyclic mode by making certain changes to the Buffer Descriptor (BD) chain setup. In cyclic mode, DMA fetches and processes the same BDs without interruption. The DMA continues to fetch and process until it is stopped or reset. To enable cyclic operation, the BD chain should be set up as shown in the figure below. 

Cyclic Buffer Descriptor setup

Linux

The Processing System (PS) runs Linux (linux-xlnx). This SOCFPGA Linux distribution is modified to compile the system with libraries (listed later) enabling benchmarking of various components that impact the performance of data transfer.

Buildroot is used to generate the toolchain, the rootfs, the bootloader and the Linux binary.

The specifications of the system are: 

The generated bitstream is loaded by the bootloader during the boot phase.

Custom counter

Initially, the project was intended to use images from a camera. However, this did not allow for flexible streaming or the ability to validate the received data. Therefore, it was decided to create a data generator based on a counter.

Each data point consists of 32 bits. The counter can be configured to generate a continuous stream of data or to generate chunks of data with a configurable pause between chunks.

The custom counter allows simulation of a data stream and eases the detection of missing data. This stream can be configured in various ways to simulate the transfer of different types of data. The table below details the different stream parameters.

ElementDescription
Clock tickClock of the counter
FrameWindow of N clock ticks that is reserved to send the data stream. One frame contains one or multiple packets
PacketContains N data and each data is sent per clock ticks without interruption 
Pause Between 2 packets, there are “pause” number of ticks where no data is being sent
Chronogram of the custom counter functionalities

Benchmark

Setup

AXI DMA

The DMA buffer size is 2²⁰ bytes. The maximum size of a buffer descriptor should not exceed half of the DMA buffer size (max 2¹⁰ bytes). During a transfer of the buffer to the memory, the DMA keeps accepting data from the stream, filling its buffer. Leaving half of the buffer empty ensures data coherency. The DMA buffer is used as a circular buffer.

The DMA is connected to the HP0 port. The clock speed is 100 MHz, and the bus width is 32 bits, resulting in a theoretical maximum transfer speed of 400 MB/s. 

Application

A Linux kernel driver is implemented to allocate DMA buffers, initialize the DMA, configure it in cyclic mode, reserve the necessary memory space for the DMA to cyclically transfer data into this circular buffer, and provide configuration options to control the DMA settings from user space.

The kernel driver provides a data buffer to user space via a copy_to_user() function. 

The HP0 port does not have a snoop control unit ensuring the cache coherency. A non-cacheable buffer is reserved to store the data coming from the DMA. When a DMA transfer completes the acquisition of a chunk of data, there is a time period before the next transfer initiates. 

Our approach involves initially transferring the acquired data into a non-cacheable buffer in kernel space. This buffer ensures data consistency and real-time processing capability by bypassing the CPU cache. Immediately following the DMA transfer, the data is quickly copied into a second, cacheable buffer. This secondary buffer allows for much faster data access due to the benefits of caching, which is contingent upon the size of the cache line and the manner in which the data is accessed.

As it stands, the driver does not ensure the validity of the data. If the buffer being read is overwritten by the DMA transferring data, the behavior is undefined.

The kernel driver exposes values to configure the DMA. The counter used to simulate the data stream is configurable with a similar system.

The AXI DMA driver does not provide any means to determine the size of the transferred data; one must rely on the values used during configuration and assume that the hardware is functioning correctly. This feature can be added to the AXI DMA driver.

System analysis

The rootfs was updated to support some tools used to analyze the system performances and its limitations, as well as to measure execution time and identify bottlenecks.

1- Tinymembench

Tinymembench is a simple memory benchmark program, which tries to measure the peak bandwidth of sequential memory accesses and the latency of random memory accesses. Bandwidth is measured by running different assembly code for the aligned memory blocks and attempting different prefetch strategies

Tinymembench output

The C copy speed (342.6 MB/s) represents the baseline performance of a simple memory copy operation. It reflects the speed you can expect from straightforward copying operations.

2- Cache calibrator

Cache calibrator is a tool designed to measure and calibrate the performance characteristics of the CPU cache. This tool is useful for understanding the behavior and efficiency of the cache subsystem, which is critical for optimizing software performance, especially in high-performance and real-time applications.

Usage: 

./calibrator [MHz] [size] [filename]
  • [MHz]: gives the CPU’s clock speed in megahertz
  • [size]: gives the amount of memory to be used. 
  • [filename]: fives the filename where the results are stored

Command used: 

cache_calibrator 667 16M cache_calibration_results.txt
Cache calibrator output

Caches:

  • Level: indicates this is the L1 cache
  • Size: The L1 cache size is 32 KB.
  • Linesize: The size of each cache line is 32 bytes.
  • Miss-latency: The time it takes to fetch data from the next memory level when there is a cache miss. For the L1 cache, this is 44.65 ns (or 30 cycles).
  • Replace-time: The time it takes to replace an old cache line with a new one when there is a cache miss. This is also 44.67 ns (or 30 cycles), which suggests that the miss latency and replace time are effectively the same in this context.

TLBs:

  • Level: Indicates this is the L1 TLB.
  • Entries: The number of entries in the TLB is 32.
  • Pagesize: Each TLB entry maps a page of 4 KB.
  • Miss-latency: The time it takes to handle a TLB miss, which is 18.78 ns (or 13 cycles). A TLB miss occurs when a virtual memory address translation is not found in the TLB, and the system must look up the address mapping in the page table.
3- Ftrace

Ftrace is configured to retrieve information about the bottleneck functions. It is used to retrieve the amount of time spent in the functions.

DMA transfer performances

Variables:

  • Bf: Bus frequency in Hz
  • Bs: Bus size in bytes
  • Cs: Chunk size in bytes
  • Cl: Clock ticks lost

The formula used to calculate the maximum theoretical throughput is shown below:

The data lost per transfer is the amount of clock ticks while the DMA can not get data from the counter. The value is 5 clock ticks every time a transfer from the DMA to the memory starts. This was tested with the application, trying to transfer data without interruption. The custom counter IP was modified to continue sending data (and updating the values) even when the DMA could not receive the data in its buffer. For any chunk of data sent, the value remains the same and 5 values are lost.

Applying this formula to the setup:

  • Bf = 100’000’000 (100 MHz)
  • Bs = 4
  • Cs = x
  • Cl = 5
Plot of the throughput formula

From a chunk value higher than 1500 bytes, the amount of data lost is negligible and this will not be the bottleneck of the system. It is therefore required to use chunk sizes of at least 1500 bytes.

Copy to user performances

Ftrace setup was used to retrieve the copy to user function execution time.

Copy to user performances (comparing chunk sizes)

The execution time is linear, slightly better performances are observed with bigger chunks of data transferred. 

Use case

To assess the behavior system in a typical real-life application, we simulate a video streaming task that transfers N bytes at a chosen framerate. 

From the DMA benchmarks, the maximum throughput to be expected is close to 400 MB/s. This is the maximum available data that can be transferred.

Period of 1 iteration (data transfer and data processing): 

Copy to user execution time: 

For the measures, we use a framerate of 120 FPS and a 640*480 image grayscale 16 bits (614’400 bytes).

Theoretical benchmarking

Calculate the period window:

Calculate the copy to use execution time:

Calculate ratio of the window period required to transfer the data from the kernel space to the user space.

From this result, there is 79.44% of the iteration period to process the data.

Real benchmarking

Ftrace is used to measure the execution time of the copy_to_user function. The data gathered to analyze the performances come from a transfer of 10’000 images 

Execution time (ms)
Minimum2’239
Average2’241
Maximum2’248

 The execution time is stable, there is a low variance to observe. There is also a low standard deviation. We can also observe a narrow range between minimum and maximum, reinforcing the notion of stability.

The expected execution time was  1.796491 milliseconds and the observed average execution time is 2.241 milliseconds. The observed measure is 25% higher than the theoretical measure.

Calculate ratio of the window period required to transfer the data from the kernel space to the user space:

From this result, there is 73.11% of the iteration period to process the data.

Conclusion

System analysis and various tests have allowed us to choose the optimal approach for delivering data to the user space. It was decided to copy the data into a buffer that caches the data, enabling faster access to contiguous data. The final implementation consists of a kernel driver that, via a callback, serves an interrupt from the DMA, signaling that a chunk of data has been transferred and is available. A user space application then requests the copy of this data into a cacheable user space buffer. The identified bottleneck lies in the data copy process (copy_to_user), and the duration of the copy determines the remaining available time for applying an algorithm or analyzing the received data.

The analysis focused on a specific use case involving the transfer of 640x480x2 images at 120 FPS. The driver and user space application provide flexible configurations to modify these variables. Additionally, parameters such as chunk size and other settings detailed in the user guide can be adjusted. Depending on the simulated use case, the user space application for data access may require modifications.

It is important to note that the implementation of such a system depends on the needs. If it is unnecessary to access all the data, or if the size of the data to be analyzed is smaller, then another architecture might be more suitable. The development of the solution follows this principle. Both the driver and the user space application possess a certain flexibility, making the adaptation of the solution less laborious. It is possible to quickly modify the configuration to avoid buffer copying, allowing the user space application to access the data via mmap. Although accessing the data this way is slower, it becomes available more quickly. If there is no need to access the data in its entirety, this approach is advisable.

The system and benchmark demonstrate that no single solution perfectly handles every use case. Each approach has its pros and cons, but throughput and data availability can be monitored and traced to ensure the system adapts to the specific use case. The benchmarked use case successfully met the requirements, proving that different solutions are viable depending on the context.