Introduction to ESP32 and ESP-IDF Framework

The ESP32 is a popular series of low-cost, low-power microcontrollers with integrated Wi-Fi and dual-mode Bluetooth. These microcontrollers have gained popularity in the IoT world due to their robustness and cost-effectiveness.

To complement the hardware, Espressif developed the ESP-IDF (Espressif IoT Development Framework), a software development framework for the ESP32. ESP-IDF offers tools and libraries to efficiently develop applications for the ESP32 hardware, but perhaps the most crucial aspect of ESP32 and ESP-IDF is their approach to memory management.

The ESP32 microcontrollers feature both internal and external memory options, with internal memory being limited and precious. Managing this limited resource efficiently is key to ensuring smooth and reliable operation of applications, particularly in embedded systems where memory constraints are always a concern.

Memory Management Challenges in Embedded Firmware

Memory management in embedded systems like ESP32 is a complex task. A typical issue revolves around the allocation and use of Internal RAM (IRAM) and External RAM (SPIRAM -external flash memory).

Case Study: Troubleshooting Reboot Issues

My journey into memory management challenges began with a peculiar case where a network of ESP32 devices in an office setup experienced repeated reboots. The network consisted of one gateway (GW) and three nodes. Despite various firmware updates and fixes, the GW began to experience reboots with various errors, mainly stack overflows in certain tasks. Other issues were reports of corrupted flash memory and sudden device resets caused by a Watchdog Timer (WDT):

Memory leak

The biggest problem I was dealing with was a nasty memory leak. I discovered it in the first place by monitoring individual task CPU and memory usage. A certain task that the device uses to display custom LED animations was handling new animation data incorrectly. Freeing of memory allocated for animation data was not handled properly and this would eventually cause the system to reboot. I only became aware of this after executing longer tests which periodically sent animation data to this task.

Internal RAM limitations

While monitoring memory usage on the system, we've also observed that the ESP32's internal RAM was nearly exhausted, with about 98% usage. This was particularly concerning since essential functions and interrupt service routines (ISRs) are typically stored in IRAM for faster access. (IRAM also stands for Instruction RAM and should not be mistaken for Internal RAM -see picture). However, by examining data statistics it showed that I have ample Data RAM (DRAM) to use and this hinted to a possible solution.

The Watchdog (WDT) Conundrum

What frustrated me the most were sudden system resets caused by WDT. The device would just reset itself without a crash and therefore without any clue where the bug originated from. I only knew upon reboot that a WDT triggered a reset on the PRO CPU core of the device.

The ESP32 features two watchdog timers (WDT) for each of its CPU cores to ensure the system resets in case of a failure. However, in this case, the WDT reset always occurred on the PRO CPU, indicating a potential issue with task scheduling or interrupt handling.

I deduced that the APP CPU was stuck in a cache error case. Reboots were caused by a "cache disabled" error. By doing some research on the web, I found that this issue occurs typically when there is some sort of concurrent flash memory access. By force printing of logs before the WDT reset occurred, I managed to get some backtrace showing that the BLE MESH stack (feature of esp-idf) was storing message replay parameters into Non-Volatile Storage (NVS -part of main flash memory).

I could also see that a certain timer interrupt handler was invoking a function FreeRTOS function xQueueSendFromISR() that is supposed to be safe for calling within interrupts marked with IRAM_ATTR flag (stored in instruction memory for quick access and execution of interrupt handlers) but clearly in some way accesses the flash memory. My theory is that this leads to triggering the panic handler which disables all watchdogs except for the system watchdog and subsequently reboots the device.

The Corrupted Flash Issue

I had an issue where the flash memory of the device was getting corrupted rendering the device unusable until you performed erasure of flash memory. It occurred rarely and only on a few devices, but the issue was a serious risk in mass production. Some investigation led me to believe this was a somewhat known issue with a particular series of flash memory chips, produced by XMC, which came with some of ESP32 boards installed in devices. For some reason, even Espressif did not manage to find that the flash chip can unpredictably receive random commands with a chance of disabling the normal functioning of the chip.

Lessons Learned and Solutions

Stopping a memory leak

To effectively resolve the issues with the task of handling LED animations, I had to fundamentally re-evaluate and refine its underlying logic. My primary objective was to establish a foolproof system where all dynamically allocated resources were meticulously tracked and appropriately freed upon task completion.

Originally, my approach to managing shared resources between two asynchronous tasks involved the use of static pointers. While this method provided a means of resource sharing, it inadvertently introduced vulnerabilities, particularly in the form of memory leaks. These leaks occurred due to the inadequate tracking and releasing of the allocated resources, leading to gradual accumulation and eventual system instability.

I shifted my strategy to leverage the FreeRTOS xQueue to address these shortcomings. This change represented a significant improvement in communication management and resource sharing between tasks. The xQueue, a robust queue management feature provided by FreeRTOS, offered us a more structured and reliable way to pass data between tasks. This method not only facilitated better control over shared resources but also helped in systematically tracking and releasing dynamically allocated memory.

By implementing the xQueue, I could ensure that every allocation was properly accounted for and deallocated at the appropriate time. This approach effectively sealed the memory leak, enhancing the stability and reliability of the task. T

Reducing Internal RAM Usage

I was already using the feature where memory for dynamic allocation with malloc() can be allocated from SPI RAM (Serial Peripheral Interface Random Access Memory). SPI RAM is an external memory module interfaced through the SPI bus, expanding the available memory beyond the ESP32’s internal capacity. This type of RAM is especially useful for handling larger or more memory-intensive applications that exceed the capabilities of the microcontroller's onboard memory.

To make effective use of this expanded memory resource, I had to adjust the maximum malloc() size for allocating memory in internal RAM. By doing so, I ensured that more memory allocations would utilize the abundant external SPI RAM instead of the limited internal memory.

Configuring ESP-IDF options to minimize IRAM (Instruction RAM, used primarily for storing executable code and critical functions) usage also proved to be an effective strategy. This approach helped in preserving the valuable IRAM space for essential executable code and ISR (Interrupt Service Routine) handling, while larger and less critical data could be stored in the expansive SPI RAM.

By implementing these changes, I increased the availability of Internal RAM and reduced the risk of exhausting internal memory resources. This adjustment has been crucial in maintaining system stability and efficiency, particularly in scenarios where memory demand is high.

Diagram.png

Escaping watchdog reboots

In this scenario, I addressed the issue of unexpected system resets triggered by the Watchdog Timer (WDT) by implementing a strategic change in how the timer interrupt handler operates. The key to resolving this was preventing the timer interrupt handler from executing while the cache was disabled. This adjustment was crucial because the cache disabled errors were leading to system instability and triggering the WDT resets.

The solution involved modifying the interrupt flag associated with the timer interrupt handler. In the ESP-IDF framework, when you add an interrupt handler, you specify certain flags that define its behavior and priority. Initially, my interrupt handler was using the ESP_INTR_FLAG_IRAM flag. This flag indicates that the interrupt handler is located in IRAM (Instruction RAM) and is safe to execute when the flash cache is disabled. However, this setup contributed to the WDT reset issues I was experiencing, likely due to conflicts arising from cache disabling.

To mitigate this, I changed the interrupt flag from ESP_INTR_FLAG_IRAM to ESP_INTR_FLAG_LOWMED. This flag sets the priority of the interrupt to a level that is between low and medium. By doing so, I reduced the priority of the timer interrupt handler, ensuring that it does not interfere with critical operations or run when it's unsafe to do so - during periods when the cache is disabled.

Flash Memory Stability

To address the problem of XMC chips causing the flash memory to experience corruption, Espressif's support team provided a solution in the form of a firmware vendor patch. This patch locks the Status Register permanently or until power-cycle effectively preventing the setting of write protection, countering the source of the error. Even if the flash chip receives random commands in the future, it can no longer lock itself.

The XMC flash memory chip has three 8-bit wide status registers (SR1, SR2, and SR3) that can be set. I opted for disabling write protection until power-cycle by setting the bits SRP1 in status register 2 to "1" and SRP0 in status register 1 to "0". These Status Register Protect bits are one-time programmable (OTP) bits. With this strategy, you prevent the memory chip from setting write protection during runtime, but still leave the option to write to the status registers if the need arises.

Conclusion

Managing memory in embedded systems like ESP32 requires a careful balance between performance and resource utilization. The ESP-IDF framework provides the tools and flexibility to optimize memory usage, but it demands a thorough understanding of the underlying hardware and software interactions. My investigation into the reboot issues not only resolved specific problems but also provided valuable insights into effective memory management strategies in embedded firmware development.


Mario.jpeg
Written by
Mario Matković

Embedded Software Developer

If you like this article, we're sure you'll love these!