Update

After connecting to the ACA console, finally found out the /tmp is not even mounted on tmpfs, it’s just writing on /overlay, and it’s a simple disk is full issue.. Feeling fked again.

TL;DR

tmpfs is allocated half size of the memory by default. referece

The limit of allocated bytes for this tmpfs instance. The default is half of your physical RAM without swap. If you oversize your tmpfs instances the machine will deadlock since the OOM handler will not be able to free that memory.

If /tmp is full during the write process and it will not trigger OOM since the memory is not full at all.

Introduction

Last weekend, I got serval phone notifications about the success rate of our service is below the critical measurement.

Every new requests failed with the error:

failed to create output directoy: mkdir /tmp/<reducted>: no space left on device

Considering we have in progress deployment, the very first suspection is the code. Is it due to the new version?

However, there is no diff related to the temporary file logic.

Then we noticed the error is only be thrown on 2 nodes, as we have proactive monitor system (which sends real requests to the service, make monitors alive even if there is no request from users.). If the problem is within the code, every node shall throw the same error.

/tmp

As the error message shows, there is no space left in /tmp, what happened with it?

Here we wasted some time, since I did not reconized the /tmp directory is mounted on tmpfs.(actually it maybe not, depends on /etc/fstab, but in most modern distro it shall be tmpfs).

Our service is hosted on Azure (by Azure Container App, Azure Container App Environment). The storage limitation of ACA is not documented (or I didn’t find it.), the only evidence is here.

In addition to a different core size and memory size, each workload profile is allocated a different storage size. This allocated space is used for the runtime. Do not use this storage for your application data. Instead, use a storage mount.

Thus, I spent a lot of time to find out the storage limitation. But a colleague points out it shall be mounted tmpfs, which shall occupied the RAM.

But resource consumption monitors show there was 60% percentage (which is not a normal usage for our service.) usage of RAM, and if the memory is full, the service shall panic.

Kernel configuration

But with above investigation, we can assume the point is about RAM, as the service resource comspution monitor shows 60% usage of RAM. We can assume there is a limitation about /tmp, it can not use 100% RAM. If the limitation is 50%, and another 10% usage is by the service itself. Sounds make sense.

Then I found this:

The limit of allocated bytes for this tmpfs instance. The default is half of your physical RAM without swap. If you oversize your tmpfs instances the machine will deadlock since the OOM handler will not be able to free that memory.

Every thing is explained.

Validation

Actually I did not find document of these on Azure documents, and I tried to cat /etc/fstab in the ACA console. However, I got nothing. I can only assume it all on default settings, which is as above kernel document.

Conclusion

The most quickest action to resolve this is to restart the node. Thanks god the service is stateless, restarting won’t cost too much. Due to limitation on privacy, we can not know what input from user caused this case, why did the service make such huge tmp file. But we can still improve in some ways to avoid this kind of issues.

Improvement

Coding:
- Find out why the service did not remove its tmp file, it’s actually kind of memory leak IMO?
- Every IO operation shall have a time-out, to avoid being hanged. In this case, it not only blocked one request, it also blocks other requests.
- Avoid using /tmp, even if it’s convinent. As there are many threads running together in a node, sharing /tmp is kind of out of control and may fall into deadlock. If /tmp is full, it will not cause OOM (which maybe better as the service will be restarted automatically).
Ops:
- Load balancing: In ideal case, the most free node shall take next incoming request, however currently all nodes are in race to get next request. In worst case, the failed one may take all following requests and return an error to them all.
- Improve the probe endpoint, currently the probe endpoint only indicates the service is running, it may need to consider more factors. Such as CPU,RAM? And success rate, if a node keeps failing requests, it shall be kicked off.
- Scaling: Considering tmpfs can only use 50% of memory, 50% usage of alll memory is dangerous. A scaling rule based on RAM usage shall be added.

A crash caused by tmpfs