Next Friday, March 16th, our University is going to have its annual open day event (if you are close to Yverdon-les-Bains it is a great opportunity to see many interesting projects!). We decided to present an SDR demo: we get a live feed from a professional camera, we use a custom developed encoder, and then we transmit the signal to a receiver, then to a decoder, and finally we display it on a large screen. The whole chain is depicted in the picture below.

We decided to start from the great gr-dtv library available in the latest GNU Radio distribution. After some minor modifications (we added a re-sync mechanism to the receiver side and a modified gr-net sink able to output specific MPEG2-TS packets) we had a system up and running, but with a slightly insufficient data rate for our purposes. We decided to explore why a little deeper, and this post details how we did it.

We focus here on the receiver (stable/src/gnuradio/gr-dtv/examples/dvbt_rx_demo_8k.py), but similar techniques and principles can be applied more broadly. The first thing we tried — without great success — is the GNU Radio built-in performance monitor ctrlport, we then switched to more standard Linux monitoring tools.

After starting the design with

/usr/bin/python2 -u stable/src/gnuradio/gr-dtv/examples/dvbt_rx_demo_8k.py

we used top to get the PID of the process. With this information we looked at which thread was demanding more resources. htop can be handy for that, if correctly configured to display custom thread names (this can be easily achieved by pressing the F2 key and, in the “Display options” sub-menu, selecting “Show custom thread names”). In our case:


htop -p PID

gave us

As you can see in the figure, the thread 28667 (viterbi decoder) is the most critical one.

Another way to tackle the problem is using the FlameGraph approach and perf. While running the GNU Radio flowgraph, we can record a trace with

sudo perf record -g -F 99 -p PID

and then extract a great image from the acquired trace with

sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > flame.svg

The full trace we obtained is available here (if you open the file with firefox for instance you can click and zoom in the interesting part of the image). As you can see once more, the Viterbi decoder is the most time-consuming part of the design. perf can be used to dig even deeper in the design.

With

sudo perf top -t TID

we can annotate on the fly the execution of the thread and see the critical regions involved, as can be seen in the picture below.

Now, looking at the sources (src/gnuradio/gr-dtv/lib/dvbt/dvbt_viterbi_decoder_impl.cc) we see what our code looks like:


// Trace back
for (i = 0, pos = store_pos; i < (ntraceback - 1); i++) {
  // Obtain the state from the output bits
  // by clocking in the output bits in reverse order.
  // The state has only 6 bits
  beststate = ppresult[pos][beststate] >> 2;
  pos = (pos - 1 + ntraceback) % ntraceback;
}

The modulo operation on amd64 is implemented with IDIV and it is the critical point here; If you need to convince yourself look for the latency of the IDIV instructions operating on r64 operands in Agner Fog’s latency tables.

After looking at the initialization of pos we can modify the code as follows


pos = (pos - 1 + ntraceback);
if (pos >= ntraceback) {
  pos -= ntraceback;
}

Nothing fancy, but after recompiling and measuring in our scenario, we achieved a 30% benefit that will do just fine for our demo next week. Whenever we will have the time, we will write about what we did after this, but the magnitude of improvement of this easy tweak was more than suitable for a blog post.

There is a lot more that can be done — feel free to share your advice and experience, we will be happy to learn from you!

Keep in touch,
A.