The Potential Pitfalls of “Free” Software: A Firmware Engineer’s Tale
As a seasoned firmware engineer, I’ve encountered my fair share of perplexing bugs. But few have been as challenging and enlightening as an insidious SDRAM initialization bug I stumbled upon in the free software provided by a prominent chip manufacturer. In this blog post, I’ll take you through the journey of how this bug was discovered, the process of unraveling its mysteries, and the eventual triumph of fixing it.
The Discovery
I was tasked with starting to develop for a new MPU, so I bought three identical evaluation kits. The kits didn’t come with a display, but they did have a connector so a display could be added, which I did. I downloaded the MPU manufacturer’s suite of software for bare metal since we were not interested in running Linux on this particular MPU. The free suite of software included startup code, many examples using the individual peripherals found on the MPU, as well as drivers for those peripherals. This evaluation kit had external SDRAM and included software to initialize the specific SDRAM used on the kit. Everything seemed straightforward, and I got our software running on the MPU quickly thanks to the included suite of software. Everything seemed to be working well, so I was getting ready to hand off one of the three evaluation kits to another developer.
As a quick sanity check I put the same software that was working fine on my evaluation kit onto the second kit and to my dismay, the system behaved erratically. The LCD sometimes showed garbage (seldomly in the same locations), there were lockups at random times, and the kit generally exhibited unpredictable behavior. The bad behavior didn’t show up immediately but would usually happen within a minute of powering up.
The Investigation
Since I had a third development kit, I put the same code on this kit and noticed similar behavior to the second kit. It appeared that I had been lucky that I chose the initial kit first. I then did a longer-term test on that first kit just to confirm that it didn’t have the same issues. But, regardless of how long it was running, this first kit worked perfectly every time.
I went back to the second and third kits and confirmed they consistently showed bad behavior at random times, usually within the first minute of powering the system up. Sometimes, the hardest part of debugging a problem is being able to consistently get it to exhibit the problem, so in some respects, I was in a good position to start debugging. Unfortunately, even though the kits were consistently exhibiting bad behavior, it was seldom the same error at the same time.
When starting to debug, before making any code changes, I usually create a feature branch in git so I can always return to a state where the error was known to exist. The next step I take is to observe as much as I can about when and how the bug appears. Before blindly debugging, I like to get as much information about the bug as I can get. I spent several minutes trying to find a pattern to the bad behavior, but again, it appeared to be very random in both when and where it would occur.
Once I have observed the bug and gotten a feel for where and what might be causing it, I will usually pull out my jtag debugger and start setting some breakpoints. In this case, since the failures were seemingly so random, I wasn’t sure where to set breakpoints, so I instead just let it run within the debugger and hoped that once it failed, I could look at the memory and hopefully get a clue as to what the culprit was. I did manage to pause the jtag after a failure, but this didn’t provide any “aha” moments, but rather solidified my feeling that the issue I was seeing wasn’t really related to my software, but rather something to do with the SDRAM on the kit, which is where my code was running.
I started looking at the components on the development kits to ensure that they were all the same from kit to kit. From what I could see, all three kits were identical, including the SDRAM. I then started looking into the provided code that was used to initialize the SDRAM. SDRAM initialization is a delicate process, involving precise timing and configuration to ensure the memory is ready for access, and luckily the manufacturer provided us with the exact timing for the specific SDRAM that was on the development kit. Upon cursory inspection, everything looked fine.
Another useful debug approach is to search various support forums to see if anyone else has run into a similar issue. In this case, since it was a relatively new development kit, the chip manufacturer’s customer support forum seemed like the best place to go for help. Unfortunately, the little information that was there was regarding the Linux code as opposed to the bare metal code that we were using. Being an introvert, I hated to admit that it was time to make a call to the manufacturer itself for some assistance.
The Debugging
I contacted a local FAE from the manufacturer, and he came to my office to witness the problem firsthand. He was sympathetic and quickly realized this issue was beyond his expertise, so he helped me contact one of their design engineers via the manufacturer’s help desk. Unfortunately, that’s where things went south, because not only was the turnaround time in answering my questions very long, but the design engineer also became very defensive and basically said that it had to be something in my software, even though I explained to him that my software was working perfectly fine on one of the three development kits. I had assumed the manufacturer would want to get to the bottom of this issue, but between their finger pointing and the painfully long delays between their help desk questions and answers, I realized that they weren’t going to help me, and I had to go back to the job of debugging this myself.
I went back to the SDRAM initialization code and started going through it with a fine-tooth comb. I also needed the datasheet of the SDRAM on the development kits to ensure that the timing was meeting their specifications. The initialization code itself looked fine. Looking at the timing specs, the minimum values specified in the SDRAM datasheet were the same numbers used by the SDRAM initialization code, at least as far as the comments were concerned.
I was getting a bit discouraged but luckily one of my coworkers pointed me to the SDRAM initialization in the Linux codebase to compare it to the bare metal codebase that we were using.
The Linux codebase looked completely different, and it took a while to compare apples to apples, because the Linux timing parameters used absolute values, whereas the bare metal used computed values through a macro. Although the comments declared the same minimum timing values as those in the bare metal code, once I looked at the actual values being used, I noticed that there were three timing parameters that were larger in the Linux codebase by one, specifically the trcd, trp, and trfc timing parameters.
I hadn’t tested the Linux codebase on these three new development kits, so I didn’t know for sure that it would work correctly, but since most of the users were using Linux and no one else had complained about this combination of MPU and SDRAM, I figured it was an easy test to change those three timing parameters in the bare metal code and then run it on one of the “bad” development kits.
The Fix
To my relief, with the new timing values for trcd, trp, and trfc, both of the “bad” kits started working just as solidly and reliably as the first “good” kit. Just to confirm, I put the new timing code in the “good” kit, and it continued to work well, so I was convinced that these SDRAM timing values were the culprit of the intermittent random failures I was seeing on the “bad” kits. I tried changing only one or two of the timings and empirically determined that all three values needed to be changed for these kits to work perfectly over time.
The unfortunate part about the ultimate fix was that I never really understood why the bare metal code had a problem to begin with. As mentioned, the values used in the macro to determine the actual timing values looked to be correct. Honestly, I’m not sure I ever would have changed the timing parameters to their new values without having access to the Linux codebase, which had slightly different values for those three timing parameters, because there was nothing that pointed to them having a problem, other than the empirical “bad” behavior on two of the three development kits.
The Reflection
It is only an assumption, but based on my observations, the timing parameters were right on the hairy edge and were good enough on some MPU/SDRAM/board combinations but were apparently not good enough for all potential combinations. This experience was a stark reminder of the importance of attention to detail in firmware engineering. It also highlighted the value of a thorough understanding of the hardware you’re working with.
The Takeaway
For you, the aspiring firmware engineer, or the seasoned veteran, this tale serves as a lesson in perseverance and the importance of a methodical approach to problem-solving. Bugs like these are not just obstacles; they are opportunities to learn and grow. Ultimately, there are so many peripherals on modern MPU’s that it is almost a necessity that we use already-written code to initialize and use these peripherals. In many cases, that means either using a Linux codebase or some bare metal code provided by the MPU manufacturer, and that can save a lot of time and effort. But, as with most things, especially free things, you should cautiously trust but vigorously verify. Don’t assume that just because it is code provided by the manufacturer that it is going to be without bugs.
Conclusion
This SDRAM initialization bug was a formidable foe, but with a combination of technical acumen and a systematic approach, it was conquered. As you embark on your own firmware engineering journey, remember that every bug tells a story, and within that story lies the potential for personal and professional development.
I hope this recount of my encounter with this SDRAM initialization bug has provided you with insights into the world of firmware engineering. Stay curious, stay diligent, and happy coding!
Author Bio:
Jim Weber is a Senior firmware engineer at Amulet Technologies. With over three decades of experience in design and development, he takes pride in his work and hates bugs, unless of course they’re a feature. In his spare time, Jim has recently become addicted to pickleball and can often be found dinking and driving around his local courts.