The Potential Pitfalls of “Free” Software: A Firmware Engineer’s Tale
As a seasoned firmware engineer, I've encountered my fair share of perplexing bugs. But few have been as challenging and enlightening as an insidious SDRAM initialization bug I stumbled upon in the free software provided by a prominent chip manufacturer. In this blog post, I'll take you through the journey of how this bug was discovered, the process of unraveling its mysteries, and the eventual triumph of fixing it. The Discovery I was tasked with starting to develop for a new MPU, so I bought three identical evaluation kits. The kits didn't come with a display, but they did have a connector so a display could be added, which I did. I downloaded the MPU manufacturer's suite of software for bare metal since we were not interested in running Linux on this particular MPU. The free suite of software included startup code, many examples using the individual peripherals found on the MPU, as well as drivers for those peripherals. This evaluation kit had external SDRAM and included software to initialize the specific SDRAM used on the kit. Everything seemed straightforward, and I got our software running on the MPU quickly thanks to the included suite of software. Everything seemed to be working well, so I was getting ready to hand off one of the three evaluation kits to another developer. As a quick sanity check I put the same software that was working fine on my evaluation kit onto the second kit and to my dismay, the system behaved erratically. The LCD sometimes showed garbage (seldomly in the same locations), there were lockups at random times, and the kit generally exhibited unpredictable behavior. The bad behavior didn’t show up immediately but would usually happen within a minute of powering up. The Investigation Since I had a third development kit, I put the same code on this kit and noticed similar behavior to the second kit. It appeared that I had been lucky that I chose the initial kit first. I then did a longer-term test on that first kit just to confirm that it didn't have the same issues. But, regardless of how long it was running, this first kit worked perfectly every time. I went back to the second and third kits and confirmed they consistently showed bad behavior at random times, usually within the first minute of powering the system up. Sometimes, the hardest part of debugging a problem is being able to consistently get it to exhibit the problem, so in some respects, I was in a good position to start debugging. Unfortunately, even though the kits were consistently exhibiting bad behavior, it was seldom the same error at the same time. When starting to debug, before making any code changes, I usually create a feature branch in git so I can always return to a state where the error was known to exist. The next step I take is to observe as much as I can about when and how the bug appears. Before blindly debugging, I like to get as much information