QEMU is a wonderful system emulator capable of running full x86 systems with PCIe and NVMe amongst many other platforms. QEMU is not only versatile, very fast, but also open-source !

At REDS we use QEMU to facilitate development of hardware through emulation, for example emulating FPGA platforms in software before going to real hardware, e.g., Full System Simulation (FSS), Zynq Software-Hardware CoSimulation. The full visibility over the system through emulation make development and debugging very effective. With a QEMU machine running a full Linux operating system we can debug not only userspace and kernel code but also the underlying (emulated) busses and hardware. This allows us to see everything and have a good baseline to compare behaviour on a non emulated machine.

As an example, I’ll show how to add custom commands to the emulated NVMe device in QEMU. Let’s imagine we are developing a new NVMe firmware and want to experiment quickly on an existing reliable base, well QEMU is a perfect base for this.

In this blog post I’ll show how to add custom admin and IO commands in the existing QEMU NVMe driver (emulated device). The NVMe specifications are available from nvmexpress.org and fully describe the NVMe model, architecture and command set.

NVMe commands are sent through NVMe queues, each NVMe controller has one admin queue for admin commands and possibly multiple IO queues for IO commands. Once commands are executed the controller fills completion queues that allow to host to know which commands have finished. (The queues reside in the host memory and are accessed by the NVMe controller through PCIe when the host rings the controller’s doorbell).

Following NVMe commands down into QEMU NVMe emulation

(The code and links in this article refer to QEMU 6.1 source code)

First, let’s have a look at how this is implemented in the QEMU NVMe emulation. In the QEMU project the NVMe controller code lies in hw/nvme/ctrl.c.

Let’s imagine the host wrote an admin command in the admin queue, for the moment there is no interaction with the NVMe controller. The host then rings the controllers doorbell, this is done by writing a register of the controller in the PCIe memory space. Doorbells are registers that contain the current address of the queues head and tail (implemented as circular buffers). So adding a new command implies that the queue has a new head, writing this head to the controllers doorbell allows the controller to know that there are new commands in the queue. (The host could write multiple commands before ringing the bell for better performance).

Writing to an NVMe doorbell results in nvme_process_db() being called in the controller. After some checks, updating the queue head/tail values from the doorbell, the controller will set a timer associated with the queue. For example :

timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);

Where sq is a submission queue (cq would be a completion queue). Ringing the doorbells updates the controller that the host added commands to a submission queue or the host removed completions from a completion queue. (The controller will remove from the submission queue and fill the completion queue).

In the QEMU NVMe emulation each queue has a timer, when the timer reaches it’s alarm (here 500 ns simulated time in the future). An associated function will be called. The timer are created in the nvme_init_sq() / nvme_init_cq() functions respectively and from here we can find out the associated functions.

static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
                         uint16_t sqid, uint16_t cqid, uint16_t size)
{
    ...
    sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
    ...
}

nvme_process_sq() and nvme_post_cqes() respectively. For simplicity let’s follow the submission queue on the way down, we’ll go back up through the completion later.

nvme_process_sq() will go over the submission queue commands and move them from host memory to NVMe memory, internally the controller holds the commands as “requests” stored in QEMU Tail Queues (QTAILQ) structures. Inside the controller the requests are created once at (NVMe) queue initialisation and only move around until the (NVMe) queue is deleted were the resources are freed. Each submission queue has two QTAILQ structures “req_list” and “out_req_list”. The completion queues also hold a QTAILQ structure “req_list”. Throughout a request life it will be moved between these three structures. Once nvme_process_sq() has copied the commands from host into internal “request” structures. The filled requests are moved from the “req_list” into the “out_req_list”, this represent the fact that they are being processed, later they will be moved to the completion queue “req_list” to finally return to the initial submission queue “req_list” for future use.

nvme_process_sq() not only moves the request but also calls nvme_admin_cmd() or nvme_io_cmd() depending on if it is an admin or IO command.

static void nvme_process_sq(void *opaque)
{
    ...
    while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
        ...
        status = sq->sqid ? nvme_io_cmd(n, req) :
            nvme_admin_cmd(n, req);
        if (status != NVME_NO_COMPLETE) {
            req->status = status;
            nvme_enqueue_req_completion(cq, req);
        }
    }
}

It does so by checking the queue ID, if the queue ID is 0 then it is the admin queue, else it’s an IO queue.

NVMe admin commands

The admin commands are handled by nvme_admin_cmd() this function is a big switch-case that decodes the command opcode. But before going in the switch case it checks if the command is supported by looking in the “cse_acs” (Command Submission & Execution – Admin Command Supported) structure. This structure holds information about the supported commands (and can be queried with a “get log page” NVMe admin command). The flag “NVME_CMD_EFF_CSUPP” must be set in this structure otherwise the opcode of the command will be rejected with an “NVME_INVALID_OPCODE” return code.

static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
{
    ...
    if (!(nvme_cse_acs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
        trace_pci_nvme_err_invalid_admin_opc(req->cmd.opcode);
        return NVME_INVALID_OPCODE | NVME_DNR;
    }
    ...
    switch (req->cmd.opcode) {
    case NVME_ADM_CMD_DELETE_SQ:
        return nvme_del_sq(n, req);
    case NVME_ADM_CMD_CREATE_SQ:
        return nvme_create_sq(n, req);
    case NVME_ADM_CMD_GET_LOG_PAGE:
        return nvme_get_log(n, req);
    case NVME_ADM_CMD_DELETE_CQ:
        return nvme_del_cq(n, req);
    case NVME_ADM_CMD_CREATE_CQ:
        return nvme_create_cq(n, req);
    case NVME_ADM_CMD_IDENTIFY:
        return nvme_identify(n, req);
    case NVME_ADM_CMD_ABORT:
        return nvme_abort(n, req);
    case NVME_ADM_CMD_SET_FEATURES:
        return nvme_set_feature(n, req);
    case NVME_ADM_CMD_GET_FEATURES:
        return nvme_get_feature(n, req);
    case NVME_ADM_CMD_ASYNC_EV_REQ:
        return nvme_aer(n, req);
    case NVME_ADM_CMD_NS_ATTACHMENT:
        return nvme_ns_attachment(n, req);
    case NVME_ADM_CMD_FORMAT_NVM:
        return nvme_format(n, req);
    default:
        assert(false);
    }

    return NVME_INVALID_OPCODE | NVME_DNR;
}

The admin commands are described in the NVMe Base Specifications Section 5 “Admin command Set”

Here we can clearly see the parallel between the specifications and the switch-case above, in QEMU the definitions of command opcodes can be found here. If we look at the bottom of the table we find

The NVMe specifications reserve opcodes 0xC0 to 0xFF for “Vendor specific” commands. If we want to add a new admin command, we should choose an opcode in the range 0xC0 to 0xFF. We can add any function in the switch case like this

    case NVME_ADM_CMD_MY_NEW_COMMAND:
        // NVME_ADM_CMD_MY_NEW_COMMAND = 0xC0 for example
        return run_my_new_admin_command();

With this simple change we added a new command ! Of course we should now implement the “run_my_new_admin_command()” function, which could be as simple as

static uint16_t run_my_new_admin_command(void) {
    printf("Hello world through NVMe admin cmd !\n");
    return NVME_SUCCESS;
}

Of course one would want the command to do a little more, but this is a good start ! The return codes are defined here in case one would return something more sinister than “NVME_SUCCESS”. The QEMU emulation uses the reserved 0xFFFF value to indicate that the command has not finished yet “NVME_NO_COMPLETE” this is outside of the NVMe standard and is not returned to the host, this is used internally to handle things like asynchronous IO, which will, after completion, set the correct return code.

We now have code for our new command, we still must update the “cse_acs” structure in order to be able to use it.

static const uint32_t nvme_cse_acs[256] = {
    [NVME_ADM_CMD_DELETE_SQ]        = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_CREATE_SQ]        = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_GET_LOG_PAGE]     = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_DELETE_CQ]        = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_CREATE_CQ]        = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_IDENTIFY]         = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_ABORT]            = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_SET_FEATURES]     = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
    [NVME_ADM_CMD_NS_ATTACHMENT]    = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
    [NVME_ADM_CMD_FORMAT_NVM]       = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
    // Add support flag for our new command
    [NVME_ADM_CMD_MY_NEW_COMMAND]   = NVME_CMD_EFF_CSUPP,
};

Now the check for the support flag at the beginning of nvme_admin_cmd() will pass with our new opcode and it can be decoded by the switch case.

NVMe IO commands

For NVMe IO commands the emulation is very similar to admin commands, it goes through nvme_io_cmd() and checks the IO command opcode.

Similarly as for the admin commands there is a test for support before going through the switch case. The “iocs” (IO Command Support) structure holds the support flags.

static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
{
    ...
    if (!(ns->iocs[req->cmd.opcode] & NVME_CMD_EFF_CSUPP)) {
        trace_pci_nvme_err_invalid_opc(req->cmd.opcode);
        return NVME_INVALID_OPCODE | NVME_DNR;
    }
    ...
    switch (req->cmd.opcode) {
    case NVME_CMD_WRITE_ZEROES:
        return nvme_write_zeroes(n, req);
    case NVME_CMD_ZONE_APPEND:
        return nvme_zone_append(n, req);
    case NVME_CMD_WRITE:
        return nvme_write(n, req);
    case NVME_CMD_READ:
        return nvme_read(n, req);
    case NVME_CMD_COMPARE:
        return nvme_compare(n, req);
    case NVME_CMD_DSM:
        return nvme_dsm(n, req);
    case NVME_CMD_VERIFY:
        return nvme_verify(n, req);
    case NVME_CMD_COPY:
        return nvme_copy(n, req);
    case NVME_CMD_ZONE_MGMT_SEND:
        return nvme_zone_mgmt_send(n, req);
    case NVME_CMD_ZONE_MGMT_RECV:
        return nvme_zone_mgmt_recv(n, req);
    default:
        assert(false);
    }

    return NVME_INVALID_OPCODE | NVME_DNR;
}

If we want to add a custom IO command we would do exactly the same as for the admin commands but we have to check the NVMe Base specifications for the “Vendor specific” range first. In Section 7 I/O Commands there is a limited table with a few opcodes.

The missing opcodes 0x01-0x0C are actually defined in the NVMe Command Set Specifications. There we can find the full table.

Here we can find the “Vendor specific” range, from 0x80 to 0xFF. We can add a new command exactly the same way as we did for the admin command. Add a case in the switch-case.

    case NVME_CMD_MY_NEW_COMMAND:
        // NVME_CMD_MY_NEW_COMMAND = 0x80 for example
        return run_my_new_io_command();

And add the associated function.

static uint16_t run_my_new_io_command(void) {
    printf("Hello world through NVMe io cmd !\n");
    return NVME_SUCCESS;
}

We also have to add the support flag in the “iocs” structure

static const uint32_t nvme_cse_iocs_nvm[256] = {
    [NVME_CMD_FLUSH]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
    [NVME_CMD_WRITE_ZEROES]         = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
    [NVME_CMD_WRITE]                = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
    [NVME_CMD_READ]                 = NVME_CMD_EFF_CSUPP,
    [NVME_CMD_DSM]                  = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
    [NVME_CMD_VERIFY]               = NVME_CMD_EFF_CSUPP,
    [NVME_CMD_COPY]                 = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
    [NVME_CMD_COMPARE]              = NVME_CMD_EFF_CSUPP,
    // Add support flag for our new command
    [NVME_CMD_MY_NEW_COMMAND]       = NVME_CMD_EFF_CSUPP,
};

We now have two new commands, let’s find out how the result is returned to the host.

Following a completion back up to the host

Upon completion of the admin or io command, the result (in our case “NVME_SUCCESS”) is set in the “request” status field at the end of the nvme_process_sq() function.

static void nvme_process_sq(void *opaque)
{
    ...
    while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
        ...
        status = sq->sqid ? nvme_io_cmd(n, req) :
            nvme_admin_cmd(n, req);
        if (status != NVME_NO_COMPLETE) {
            req->status = status;
            nvme_enqueue_req_completion(cq, req);
        }
    }
}

Here the nvme_process_sq() function will check if the status is not “NVME_NO_COMPLETE” which is used to indicate that the operation is not yet done (e.g., async IO in the emulation). This has nothing to do with the NVMe standard, this is just used internally in the emulation (and a valid NVMe return code is returned by a callback for these cases).

If the status is not “NVME_NO_COMPLETE”, e.g., “NVME_SUCCESS” or any other return code, then the function nvme_enqueue_req_completion() is called. This will move the “request” from the “out_req_list” in the submission queue to the completion queue “req_list”. Then nvme_enqueue_req_completion() sets an alarm on the completion queue timer 500 ns in the future.

The callback function associated with the timer is nvme_post_cqes() upon activation this function will take all “requests” stored in the “req_list” of the completion queue and for each write the result back in the completion queue (which resides in the host memory) by calling pci_dma_write(). Internally the “request” data structure is moved back to the submission queue “req_list” for future usage, completing the circle. Once all pending completions (“requests” in the completion queue “req_list” structure) are written to the host, the NVMe controller raises an interrupt with nvme_irq_assert(). This lets the host know there are new completions in the completion queue.

Here we have come full circle and followed an NVMe command down to execution and back up to the host through the completion. Our new commands should print a simple message in the output of the QEMU emulator upon activation and return a successful completion. More intricate operation can be studied by looking at the other commands already implemented.

Testing our new commands

In order to modify and run QEMU we need to download the source code, we can simply clone the git repository.

git clone https://gitlab.com/qemu-project/qemu.git

For the example code to work let’s checkout the version I used for this post and create a new branch so that we can hack away.

cd qemu
git checkout stable-6.1
git checkout -b my_hacky_branch

From here apply the modifications to the QEMU source code, this is the fun part, you’ll have to explore the project on your own and find the different files and do your own changes (functions are all linked above).

If you just want to test the example code above you can apply the following git patch.

git am < 0001-Blog-example-new-commands.txt

This will apply the changes described above.

Building QEMU

mkdir build # from inside the qemu directory
cd build
../configure --target-list=x86_64-softmmu
make -j$(nproc)
./qemu-system-x86_64 -h # Show help to ensure executable is built

Feel free to enable KVM or other options in the “configure” script. If the configure command fails you probably lack some dependencies, install them accordingly. There may be missing support, this is not an issue if the configure script finishes and the build succeeds. It may be best to build once before starting hacking away to ensure that the base project builds (and that all required support libraries are installed).

Running QEMU

Before we run QEMU we need a backend file for our NVMe drive, we will use the QCOW format which generates a tiny file that only grows when written to. Since we won’t write to the NVMe drive in this example it will be very small. We will run Ubuntu from a live-cd so no other drive is necessary.

# Note : the qcow2 format generates a small file that will grow when the disk is written to
./qemu-img create -f qcow2 nvm.qcow 1G # 1 Gbyte

Download the Ubuntu live-cd, either from the website or directly with the command line

wget https://releases.ubuntu.com/21.10/ubuntu-21.10-desktop-amd64.iso

We will run QEMU with the machine “q35” with 4 CPUs and 4G of RAM but feel free to change this. If emulation is slow don’t hesitate to use an accelerator such as KVM for Linux or hvf for Mac OS X, or other, see documentation. Setting up acceleration for you system will make the VM much faster and responsive.

./qemu-system-x86_64 -M q35 -smp 4 -m 4G \
     -drive file=nvm.qcow,if=none,id=nvm \
     -device nvme,serial=deadbeef,drive=nvm \
     -cdrom ubuntu-21.10-desktop-amd64.iso

The ubuntu iso is mounted as a CD-ROM to use it as a live-cd. (For further development it is recommended to add a drive to the VM and install the operating system instead of running the live-cd).

Testing our new NVMe commands from userspace

In order to test our new NVMe commands let’s use nvme-cli, which is available in Ubuntu’s as an apt package

sudo apt install nvme-cli

This utility will allow us to send an NVMe command from the command line. The utility relies on the libnvme and libnvme itself uses the ioctl interface provided by the Linux NVMe driver.

The nvme-cli utility provides the following command line tool to send admin commands. While there are many options, they should look familiar. Here we find the “–opcode or -o” option which allows us to define the opcode. The other options and flags refer to other data in the NVMe SQE (Submission Queue Entry) and are described in the NVMe Base Specifications section 3.3.3.1. In the QEMU NVMe emulation they correspond to these structures (NvmeCmd and the specialisations below e.g., NvmeRwCmd).

nvme-admin-passthru <device> [--opcode=<opcode> | -o <opcode>]
                [--flags=<flags> | -f <flags>] [-rsvd=<rsvd> | -R <rsvd>]
                [--namespace-id=<nsid>] [--cdw2=<cdw2>] [--cdw3=<cdw3>]
                [--cdw10=<cdw10>] [--cdw11=<cdw11>] [--cdw12=<cdw12>]
                [--cdw13=<cdw13>] [--cdw14=<cdw14>] [--cdw15=<cdw15>]
                [--data-len=<data-len> | -l <data-len>]
                [--metadata-len=<len> | -m <len>]
                [--input-file=<file> | -i <file>]
                [--read | -r ] [--write | -w]
                [--timeout=<to> | -t <to>]
                [--show-command | -s]
                [--dry-run | -d]
                [--raw-binary | -b]
                [--prefill=<prefill> | -p <prefill>]

So let’s send our new NVMe admin command (You can use the -s option to show the NVMe SQE in the terminal).

sudo nvme admin-passthru /dev/nvme0 --opcode=0xc0

And voila ! If we check in our QEMU console, we can see the simple but very satisfying “Hello world through NVMe admin cmd !” message.

We can do the same for the IO command with

sudo nvme io-passthru /dev/nvme0 --opcode=0x80 --namespace-id=1

And we get our “Hello world through NVMe IO cmd !” message.

Our new commands are executed in the emulated NVMe controller !

But why ?

Why would one want to add custom commands to NVMe ?

Well there are plenty of reasons, for fun and for profit.

The reason could be as mundane as to customise the glow of a RGB adorned NVMe drive. (This could also be achieved through a PCIe command and some manufacturers do this).
(Image of a Gigabyte NVMe SSD)

Or more involved such as implementing compute functions inside an NVMe drive.
(Image from Computational Storage Architecture and Programming Model)

The possibilities are endless !

NGD have implemented TCP over NVMe (see slide 9, I suppose through custom read/write commands) to communicate with their computational storage processors through SSH.

With these kind of possibilities and the existence of NVMe-oF (over fabrics) which makes it possible to run NVMe commands over TCP we could imagine mad scientist craziness such as running NVMe-oF over TCP over NVMe and go full circle 🙂

References