We recently migrated one of our websites to Azure Arm64 VMs. However, as soon as we pushed the infrastructure change in production, we started to observe our server process being restarted infrequently. These restarts may happen within a few seconds sometimes while not occurring for hours at other times. While the redundancy in our setup ensured minimal end-user impact, we wanted to quickly address the issue at hand.
Looking at the logs
A quick look at the logs showed the following error before process restarts:
malloc(): corrupted top sizeAborted (core dumped)
This is a Node.js based Next.js website with nothing memory intensive being performed. So, we were surprised to see a memory related issue. A quick look at the top also suggested we had adequate memory available for our running processes. So, this definitely looked like a memory corruption.
Our next challenge was to identify what caused the memory corruption. On analyzing the logs further, it did not appear that there was a single website url causing this issue.
Reproducing the issue
With this information at hand, we went back to our test environment (which was also running on Azure Arm64 VM) and setup a more detailed logging. We then visited a large number of our website urls to see if we could reproduce the restart.
Eventually, we did find a couple of urls where the Node.js process would exit with the corrupted memory error message.
Identifying the root cause
Once we could reproduce the issue, we narrowed it down the to images loading on these pages. Our images were being served by Next.js next/image library. This library internally leverages the `sharp` package to optimize the images being served.
So, it appeared that for some images (not all), the sharp image optimization logic was resulting in memory corruption causing our Node.js process to exit. Looking at the current & past issues for lovell/sharp on github took us to this issue, which summarized our experience.
Issue details & Fix
On probing further, we understood that the libspng library being used by lovell/sharp had a memory corruption issue when trying to decode a paletted PNG on Arm64. libspng addressed this issue with v0.7.2 which was picked by lovell/sharp within v0.31.0.
On pinning our sharp dependency within the package.json to v0.31.0, we were able to force our next/image to pick-up this version of the sharp library (instead of the older one) for image optimiztaion. With this change, the specific images that were causing Node.js process exit earlier were now being optimized as expected.
Once the change went into production, we watched our production Node.js processes for any restarts. With no restarts observed for a couple of days, we were able to mark the issue as addressed.
Comments