I am running a web crawler in a Xen-based VM (on Linode); it makes heavy use of network namespaces to control use of VPNs on a per-crawl-process basis. It's been running mostly continuously for the past six months.
Intermittently - at a rate of once or twice a week - the dispatcher process goes to tear down a network namespace, and I hit a kernel bug. The visible symptoms of the bug are that it becomes impossible to create new network namespaces until the VM is rebooted, and the syslog receives floods of this message, again, until the VM is rebooted:
MMM DD hh:mm:ss XXXXX kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
I have been able to find some public bug reports relating to this message ...
... but they all apply to kernels much older than the Linode stock kernel, which has been 3.16.x for quite some time now.
I am looking for concrete, step-by-step advice on how to work around the bug and/or turn this into an actionable bug report. Note that a regression analysis is not practical as it would take months to produce a result (and I can't afford to be taking the crawler up and down even more than I already am).
The error is telling you that the reference count to the interface is > 0, so something is still using the interface.
The error is generated in the netdev_wait_allrefs function in /net/core/dev.c in the kernel.
There could be a bug in the code of the kernel, but as you said, those references were all much older versions and no one else has reported anything -- which we would expect for something this central. It provides the locking mechanism for not just the lo adapter, but eth, tun, etc.
I would trap the error in the logs and see what process has use of the lo interface.
To do this, use the inotify tools in a bash script to watch the log, and when the error occurs dump a list of processes and what is using the interfaces:
#!/bin/sh LOG="/var/log/netdev.log" while inotifywait -e modify /var/log/kern; do if tail -n1 /var/log/kern | grep unregister_netdevice; then echo `date`: error detected... >> $LOG ss -nlput >> $LOG ps -Af >> $LOG # other commands, send an sms? fi done
You can modify the commands executed when this fires to gather different info, and I assume this is in the /var/log/kern log (not sure of your flavor).
I would suspect that a process has hung leaving the interface in use.