We have been experiencing random lock-ups on some ltsp thin clients There have been posts about random lock-ups on ltsp mailing lists and on #ltsp irc channel. As it became evident that our setups are not the only ones affected, we decided to dig deeper in to the issue.
Investigating these lock-ups has not been trivial as most of the time the thin clients run fine for weeks. And when they hang everything freezes - display, mouse, keyboard, network. Nothing responds to anything and nothing is sent to syslog. There has also been no pattern to the freezes. Sometimes it's Evolution, sometimes Firefox. Sometimes it's enough to click mouse once to freeze everything. Only thing in common seemed to be that if a particular java applet was running, freezes would happen more frequently.
At first all regular debugging methods were tried - remote syslog, running top through ssh, xrestop, etc - but nothing seemed to be wrong. There was plenty of free memory at all times, no errors logged on syslog and not even X was hogging memory. Updating to Ubuntu 8.10 kernel did not help either. After it became clear that syslog and friends would not help here, it was time to turn to kernel serial debugging. With no prior experience in this art, various howtos around the net were helpful. Thanks also to the helpful people on #ltsp on encouraging words.
The howto's on the net are comprehensive, so I won't go through all the possible options here. Ubuntu's kernel has all the required options compiled in, so no kernel compiling is needed. For more information check for example the Remote Serial Console HOWTO at tldp.org.
So after equipping myself with a basic Belkin usb-serial adapter and a null-modem cable, I was ready for some kernel fun. As the kernel needs the console= parameter to define the serial port where everything is sent, the boot parameters under /var/lib/tftpboot/ltsp/i386/pxelinux.cfg/default needed some changes.
Normally the file looks something like this:After adding the required parameters, it became like this: Now when the thin client boots, pxelinux passes the console= parameters to the kernel and the kernel sends the output to /dev/ttyS0. There needs to be something on the other end listening to the thin clients kernel. This time I connected the usb-serial adapter to my laptop that recognized the adapter as /dev/ttyUSB0. As I remembered from the old days how minicom works, I turned to it. Starting it from command line to setup mode is done with command: The serial port setting need to match the settings passed as kernel parameters: After the settings are correct, the kernel on the other end should start answering to SysRq key commands. Turning up log level is good (ctrl-a f sends break command from minicom) so that the dump shows up automatically when the kernel crashes: Now I was ready to see what happens when the terminal locks, but of course it didn't. Being frustrated over too stable thin client didn't feel right either. So finally after stressing the thin client in all possible ways, it froze and things started happening on the minicom screen: Next task is to figure out what the message from kernel actually means. Googling and searching Launchpad's and kernel.org's bug reports quickly revealed quite a few r8169 related bugs, but the exact cause for this particular freeze is still under investigation. Various patches touch the code in question and the dumps from lock-ups look different with newer kernel versions, but the bug is still there in Ubuntu's kernel 126.96.36.199. We managed to nail some other freezes with the same method already, so this certainly seems to be an effective way of finding out why some ltsp thin clients are freezing.