Fault-finding at the ends of the earth

Posted on Mon, Apr 29, 2024 in software with tags c++, embedded, arm.

This is a tale from many months ago, working on an embedded ARM target.

In my private journal I wrote:

Today I feel like I saddled up and rode my horse to the literal ends of the earth. I was fault-finding in the setting-up-of-the-universe that happens before your program starts up, and in the tearing-it-down-again that happens after you declare you’re finished.

If you know C++, you might guess that this was a story about static object allocation and deallocation. You’d be right. So, destructors belonging to static-allocated objects. You’d never think they’d run on a bare-metal embedded target.

Well, they can. If your target supports exit() - e.g. if you are running with newlib - then an atexit handler is set up for you, and that will be set up to run the static destructors. If your program then calls exit() (as, say, your on-silicon unit tests might, at the end of a test run) then things are at risk of turning to custard.

You might have enabled an interrupt for some peripheral on the silicon. In order to do anything really useful, the ISR might reference a static object. If you do this, you’d damn well better make sure the object has a static destructor that disables the interrupt, or hilarity is one day going to ensue. You know, the sort of hilarity that involves being savaged by a horde of angry rampaging badgers, or your socks catching fire.

But wait, I hear you say, it called exit! The program no longer exists! Well, sure it doesn’t; but what happens on exit? On this particular ARM target, running tests via a debugger as part of a CI chain, when the atexit handlers have run the process signals final completion with a semihosting call, which is a special flavour of debug breakpoint. It is… not fast. If your interrupt happens regularly, the goblins are going to get you before the pseudo-system-call completes. Your test framework will fail the test executable for hanging, despite somehow passing all of its tests.

There was an actual bug in there, and it was mine. Class X, which contained an RTOS queue and enabled an interrupt, only had a default destructor. On exit, somewhen between static destructors and completion of the semihosting exit call, the ISR fired. It duly failed to insert an item into the now-destroyed queue, so jumped to the internal panic routine. That routine contained a breakpoint and then went nowhere fast, waiting for a debugger command that was never going to arrive — hence the time-out. Maybe it would have been useful to have a library option to skip the static destructors, but I probably wouldn’t have been aware of it ahead of time anyway.

The static destructor ordering fiasco can also be yours for the taking, but thankfully that hadn’t bitten me. Nevertheless, it was a rough day.

Cover image: Cyber Bug Search, by juicy_fish on Freepik