We came across a weird problem on ESX5i. Occasionally, one of our hosts would suddenly be unable to start any VMs – the running VMs were fine, but any attempt to start new ones would fail with an “Unknown internal error”. The first time this happened, I restarted the management agents and finally suspended all VMs and rebooted the server, after which everything was OK again for a few months and then the problem occurred again.
This time I decided to figure out what was going on. The log files in /var/log contain a lot of useful information and I was able to see that the problem was actually caused by ESX being out of disk space on the device used for /var/log. What was happening was that the driver for the Adaptec 5405z RAID controller in the machine was writing a huge log file which was not being rotated, so after a few months it consumed all the disk space.
The workaround was to add a line to the crontab (note: you also have to add a line to /etc/rc.local to readd the line to the crontab, otherwise it’ll be lost on next reboot) which deletes the adaptec log file periodically:
echo "0 0 * * 0 rm /var/log/arcconf.log" >> /var/spool/cron/crontabs/root
0 0 * * 0 rm /var/log/arcconf.log
Seems strange that such a robust product as ESX5 doesn’t protect itself against this situation.
Anyway, the moral of this post is that if ESXi is producing any error messages you can’t interpret, have a look in /var/log.