Next Friday, March 16th, our University is going to have its annual open day event (if you are close to Yverdon-les-Bains it is a great opportunity to see many interesting projects!). We decided to present an SDR demo: we get a live feed from a professional camera, we use a custom developed encoder, and then we transmit the signal to a receiver, then to a decoder, and finally we display it on a large screen. The whole chain is depicted in the picture below.
We decided to start from the great gr-dtv
library available in the latest GNU Radio distribution. After some minor modifications (we added a re-sync mechanism to the receiver side and a modified gr-net
sink able to output specific MPEG2-TS packets) we had a system up and running, but with a slightly insufficient data rate for our purposes. We decided to explore why a little deeper, and this post details how we did it.
We focus here on the receiver (stable/src/gnuradio/gr-dtv/examples/dvbt_rx_demo_8k.py
), but similar techniques and principles can be applied more broadly. The first thing we tried — without great success — is the GNU Radio built-in performance monitor ctrlport, we then switched to more standard Linux monitoring tools.
After starting the design with
/usr/bin/python2 -u stable/src/gnuradio/gr-dtv/examples/dvbt_rx_demo_8k.py
we used top
to get the PID of the process. With this information we looked at which thread was demanding more resources. htop
can be handy for that, if correctly configured to display custom thread names (this can be easily achieved by pressing the F2 key and, in the “Display options” sub-menu, selecting “Show custom thread names”). In our case:
htop -p PID
gave us
As you can see in the figure, the thread 28667 (viterbi decoder) is the most critical one.
Another way to tackle the problem is using the FlameGraph approach and perf
. While running the GNU Radio flowgraph, we can record a trace with
sudo perf record -g -F 99 -p PID
and then extract a great image from the acquired trace with
sudo perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > flame.svg
The full trace we obtained is available here (if you open the file with firefox for instance you can click and zoom in the interesting part of the image). As you can see once more, the Viterbi decoder is the most time-consuming part of the design. perf
can be used to dig even deeper in the design.
With
sudo perf top -t TID
we can annotate on the fly the execution of the thread and see the critical regions involved, as can be seen in the picture below.
Now, looking at the sources (src/gnuradio/gr-dtv/lib/dvbt/dvbt_viterbi_decoder_impl.cc) we see what our code looks like:
// Trace back
for (i = 0, pos = store_pos; i < (ntraceback - 1); i++) {
// Obtain the state from the output bits
// by clocking in the output bits in reverse order.
// The state has only 6 bits
beststate = ppresult[pos][beststate] >> 2;
pos = (pos - 1 + ntraceback) % ntraceback;
}
The modulo operation on amd64 is implemented with IDIV and it is the critical point here; If you need to convince yourself look for the latency of the IDIV instructions operating on r64 operands in Agner Fog’s latency tables.
After looking at the initialization of pos
we can modify the code as follows
pos = (pos - 1 + ntraceback);
if (pos >= ntraceback) {
pos -= ntraceback;
}
Nothing fancy, but after recompiling and measuring in our scenario, we achieved a 30% benefit that will do just fine for our demo next week. Whenever we will have the time, we will write about what we did after this, but the magnitude of improvement of this easy tweak was more than suitable for a blog post.
There is a lot more that can be done — feel free to share your advice and experience, we will be happy to learn from you!
Keep in touch,
A.
2 comments
Hi Alberto, great flow – easy to follow and I get similar results. I’ve also dropped the ctrlport work.
1/ Would you be able to tell me how you would measure time?
2/ How to get the underlying OS included as well?
73s C
Hi Chris,
thanks for stopping by.
I’m unsure I got your questions correctly, but I’ll try to answer you here.
1/ Would you be able to tell me how you would measure time?
In my case, I just measured the mean data rate that was the most pertinent
metric. In general, if you have access to the source code you can use
C++ std::chrono time stamping or, if you are on a C codebase, some
version of rdtsc (but use it with care).
2/ How to get the underlying OS included as well?
perf is capable of following your code in Kernel space as well.
For some more examples looks at the great one liners in this page
http://www.brendangregg.com/perf.html
Best,
A.