Hardware accelerated computations is a rising trend today. More and more, we are moving towards heterogeneous systems where a CPU and one or more hardware accelerators collaborate together to speedup computationally intensive tasks. FPGAs, with their reconfiguration capabilities, are very good candidates towards hardware acceleration. One very important part in this field is data transfers between a CPU and a hardware accelerator, as they are a critical in such a high performance environment. On FPGAs, though, how does data transfer happen ?
In this article, we will show you how to perform data transfers between a userspace application in Linux and a hardware entity in FPGA through DMA operations using Altera-Intel’s mSGDMA IP. The demonstration is carried on a DE1-SoC board running Linux, which comprises an Altera-Intel Cyclone V SoC.
More specifically, we will design a dummy entity in FPGA which will receive data from one DMA, and send data through a second DMA. Each DMA will interface with the platform’s SDRAM. Then, we will write a simple character device driver for Linux which will drive both DMAs. It will provide simple read and write operations between one or more buffers in SDRAM and the FPGA. Finally, we will test throughput during a transfer from software to FPGA with our test design.
The mSGDMA IP
But first, let’s present the mSGDMA IP (modular Scatter-Gather Direct Memory Access). Its purpose is to carry DMA operations to and/or from an FPGA. As the name suggests, this DMA device has a modular design. It gives the hardware designer the ability to tailor the DMA to its needs by enabling or disabling major features so that the hardware footprint can be kept minimal on the FPGA. Furthermore, the mSGDMA is highly configurable with regard to data width, FIFO depth, maximum transfer length etc.
Three (exclusive) modes are available :
- Memory-Mapped to Streaming : data copied from a memory location in SDRAM is buffered by the DMA and made available through an Avalon Streaming interface. Throughout this interface, a hardware entity can consume incoming data.
- Streaming to Memory-Mapped : same as Memory-Mapped to Streaming, but data flows in the other way. A hardware entity provides data through the Avalon Streaming interface, the DMA buffers it and then moves to a memory location in SDRAM.
- Memory-Mapped to Memory-Mapped : the DMA copies data from one memory location to another memory location.
Instantiation and setup of the mSGDMA
For the rest of the article, we assume that you are familiar with Qsys (now Platform Designer in Quartus 18.0 and above) and able to create a new design from scratch.
In our example, we will instantiate two mSGDMAs. One will be in Memory-Mapped to Streaming mode, the other in Streaming to Memory-Mapped mode. This configuration will show both read and write operations in action.
First, start with an initial design that comprises a PLL and the HPS entity :
Before adding the two mSGDMAs, the HPS has to be configured so that it exposes two ports of the FPGA2SDRAM bus, one for each mSGDMA. FPGA2SDRAM is a bus that directly connects the FPGA fabric with the SDRAM of the HPS which isn’t shared with any other peripheral. To expose the two ports, edit the HPS entity. Then, in the FPGA interface tab, look for the FPGA-to-HPS SDRAM interface section and add a new port. We suggest you to configure the two ports such that one is in write-only mode and the other in read-only mode. You can let the interface at its maximum width, that is 256 bit, but note that a smaller width will impact the theoretical peak bandwidth on the interface.
Also, don’t forget to enable FPGA-to-HPS interrupts in the HPS configuration panel as we will make use of interrupts.
Now that the HPS is correctly configured for two mSGDMAs, your HPS entity should look like this :
Instantiate two mSGDMAs and, in their respective configuration panel (right-click on the instance then “edit“), configure them such that one is in Memory-Mapped to Streaming mode, and the other one is in Streaming to Memory-Mapped mode. In the same configuration panel, tweak the parameters of the corresponding mSGDMA to your liking. Here are the main parameters :
- Data width : indicates how the data is “packetized”. The DMA will handle packs of either 8, 16, 32 bits etc. For example, this parameter will affect the data width of the Avalon Streaming interface.
- Data path FIFO depth : allocates the data FIFO buffer with a specified size. This buffer holds data waiting to be transferred or consumed depending on the direction of the data. The larger the buffer, the lower the chance that the DMA will block because of one end not ready to consume data. Note that a large buffer will have a large footprint on the space taken in FPGA.
- Descriptor FIFO depth : allocates the descriptor FIFO buffer with a specified size. The larger the better, because if your DMA transactions are small (because your data is scattered in memory), you may want to chain descriptors as much as possible. Evidently, a larger buffer has a large footprint in FPGA.
- Maximum transfer length : specifies the maximum transfer length per descriptor.
- Burst enable and maximum burst count : enables or disable burst transfer, and defines the maximum burst length transfer allowed.
Once they are instantiated and appropriately configured, connect the links appropriately. Each one of them connects to an Avalon MM port of the HPS and the PLL drives them. Moreover, the CSR and descriptor_slave ports of each mSGDMA might be connected to the HPS-to-FPGA lightweight AXI interface of the HPS. These ports give access to the registers that drive a mSGDMA. Once they are connected, they get mapped in memory with their respective offset relative to the base address 0xff200000. In our case for example, all registers of the CSR port of msgdma_1 are available from address 0xff200040. Finally, connect the interrupt sources of each mSGDMA to the HPS each on a different line.
The final step is to connect the Streaming Source and Streaming Sink interfaces of both mSGDMAs. For the demonstration, we are going to create a very simple entity called dummy that interfaces with both mSGDMAs. The data coming from the Streaming Source will be fed into the dummy entity and discarded. As for the Streaming Sink, the dummy entity will feed the interface with a counter value, incremented at each read request.
Now you can compile your design and be ready to program your FPGA. On our side, we compile for the Cyclone V SoC.
Software architecture and implementation
To make transfers happen, we are going to write a basic Linux driver specifically for the device we just created. The operations to be implemented by the driver are very simple. Given a device node /dev/msgdma_test attached to the driver, we want to be able to write/read data to/from the device :
# Write 1GB of data to the device. dd if=/dev/zero of=/dev/msgdma_test bs=1M count=1024 # Read 1GB of data from the device. dd if=/dev/msgdma_test of=/dev/null bs=1M count=1024
From now on, we assume that you have basic knowledge of Linux device driver development. To understand the rest of this section, you should know the basic principles of character device drivers, platform device drivers and interrupt handling.
We start off with a very basic character device driver. It should at least support read and write file operations. When the device is probed, the driver needs to initialize a few things first.
As you will see, initializing the mSGDMA is very straightforward. First, it needs to be reset. To do so, set the Reset bit in the CSR register, and poll the Resetting bit until it clears to 0.
setbit_reg32(®->csr_ctrl, RESET_DISPATCHER); while(ioread32(®->csr_status) &RESETTING);
Next, as we will use interrupts, activate interrupts by enabling the Global Interrupt Enable Mask, also in the CSR register.
setbit_reg32(&data->msgdma0_reg->csr_ctrl, GLOBAL_INT_EN_MASK);
That’s all what is needed for initialization.
Now let’s focus on the read and write operations that the driver must provide. When a userspace application wants to write or read data to the device, it calls the system calls read or write on the device node’s file descriptor, along with a buffer address to copy or write and its length. When the device’s driver receives such requests, it has to initiate DMA transactions.
Submitting a DMA transaction is performed by writing to the memory-mapped descriptor slave port. In the Standard Descriptor Format, the port has four registers available. Together, they form a descriptor to be consumed by the DMA :
- Read address : specifies the source address, in memory, of a transfer. This field has no meaning if the corresponding mSGDMA is in Streaming to Memory-Mapped mode as there is no source address in this case.
- Write address : specifies the destination address, in memory, of a transfer. This field has no meaning if the corresponding mSGDMA is in Memory-Mapped to Streaming mode as there is no destination address in this case.
- Length : specifies the length of a transfer. Recall that the mSGDMA has a maximum transfer length parameter at instantiation.
- Control : specifies how the mSGDMA behaves when executing the descriptor. For example, it allows interrupt signaling when the current transfer completes.
When all fields have been filled with appropriate values, a descriptor is submitted, queued and ready to execute by writing a ‘1’ to the GO bit in the Control register.
So, submitting a DMA transaction should looks like this :
static void msgdma_push_descr( struct msgdma_reg *reg, dma_addr_t rd_addr, dma_addr_t wr_addr, u32 len, u32 ctrl) { iowrite32(rd_addr, ®->desc_read_addr); iowrite32(wr_addr, ®->desc_write_addr); iowrite32(len, ®->desc_len); iowrite32(ctrl | GO, ®->desc_ctrl); }
Transferring to or from a userspace buffer might be done in multiple transactions by the DMA. The first reason is that the mSGDMA may have a maximum transfer length per descriptor that is smaller than the length of the buffer. The maximum transfer length depends on its configuration at instantiation. The second reason is that the buffer might be scattered in physical memory, generally because of paging. So, the driver’s read or write function should issue len/MAX_TX_LEN descriptors to the corresponding DMA, and the last one should fire an interrupt to signal the end of a transfer sequence. Activating an interrupt on a descriptor is as simple as :
/* Submits a descriptor with IRQ signaling enabled, * which ends up writing the bit TX_COMPLETE_IRQ_EN to '1' * in the control field of the descriptor */ msgdma_push_descr(src, dst, len, TX_COMPLETE_IRQ_EN);
Up to this point, this is all what is needed to enable DMA transfers with the mSGDMA through a Linux device driver.
Data bandwidth through the mSGDMA
Let’s have a look at transfer times. The following command writes 1GiB of data to the device and measures the time between the first and last byte sent :
root@socfpga:~# time dd if=/dev/zero of=/dev/msgdma_test bs=1M count=1024 1024+0 records in 1024+0 records out real 0m3.092s user 0m0.000s sys 0m1.676s
The dd command took a little more than 3 seconds to execute. By extrapolation, we can get a rough estimation of the mean throughput : 2^30 bytes sent over 3.092 seconds results in approximately 347.3 MB/s.
Of course, this estimation doesn’t exactly match the real rate at which data is transferred by the mSGDMA. The dd command has an overhead, but most importantly the driver doesn’t make the mSGDMA work between each write request by design. On the software side, there clearly is room for optimization.
Lastly, we can make an estimate of the peak throughput achievable by the mSGDMA depending on the following characteristics :
- clock frequency : the FPGA-to-SDRAM bus port, the mSGDMA and the entity attached to the Streaming interface must run in the same clock domain. Frequency will depend on the FPGA design.
- port width on the FPGA-to-SDRAM bus : the larger the port, the more data the mSGDMA can transfer each clock cycle on the bus.
- data port width on the Streaming Sink interface of the mSGDMA : the larger the width, the higher the throughput on the interface at the exact same frequency.
In our case, we configured the mSGDMA with the following parameters :
- FPGA-to-SDRAM data port width : 256 bits or 32 bytes
- Streaming interface data width : 32 bits or 4 bytes
The compiled design achieved 115MHz on the Cyclone V. If we suppose that data is available and consumed each clock cycle on the Streaming interface, we would get a theoretical peak throughput of 460 MB/s as the Streaming interface is the bottleneck. Consequently, we achieved a little more than 75% of the theoretical capacity of this configuration.
References
Embedded Peripherals IP User Guide (chapter 30 – mSGDMA core) : https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_embedded_ip.pdf
8 comments
Nice tutorial, and a very useful piece of IP. Would you be willing to share the full source for the mSGDMA character device driver?
Thank you for the feedback, we appreciate it !
A repository has been setup and the complete design (quartus project and device driver) has been pushed on it.
Here is the link to the repository : https://gitlab.com/reds-public/msgdma_example
Hello,
This post was very helpful. I was wondering if I could get the source for the mSGDMA character device driver?
Thank you !
Here is a link to our new repository with the entire example project (quartus project and device driver) : https://gitlab.com/reds-public/msgdma_example
Nice work, the source code you shared is very useful to me!
I am wondering how could I use this driver in a regular userspace c code?
I want to initialize a msgdma transaction after I use OpenCV to read an image
The driver registers a character device. You can attach it to a device node with the mknod command via its major/minor number. Once the device node is created, it becomes the interface to the driver. Open the device node like a regular file and you can perform simple read and write operations on it to stream your data to/from the mSGDMA on FPGA.
See http://wiki.linuxquestions.org/wiki/Mknod for creating a device node.
Very nice tutorial! Very instructive! Thank you.
I saw a dts code to build the device tree in your software part.
Is there any recommended tutorial that you have to understand what’s going on there?