We have been experiencing random lock-ups on some ltsp thin clients There have been posts about random lock-ups on ltsp mailing lists and on #ltsp irc channel. As it became evident that our setups are not the only ones affected, we decided to dig deeper in to the issue.
Investigating these lock-ups has not been trivial as most of the time the thin clients run fine for weeks. And when they hang everything freezes - display, mouse, keyboard, network. Nothing responds to anything and nothing is sent to syslog. There has also been no pattern to the freezes. Sometimes it's Evolution, sometimes Firefox. Sometimes it's enough to click mouse once to freeze everything. Only thing in common seemed to be that if a particular java applet was running, freezes would happen more frequently.
At first all regular debugging methods were tried - remote syslog, running top through ssh, xrestop, etc - but nothing seemed to be wrong. There was plenty of free memory at all times, no errors logged on syslog and not even X was hogging memory. Updating to Ubuntu 8.10 kernel did not help either. After it became clear that syslog and friends would not help here, it was time to turn to kernel serial debugging. With no prior experience in this art, various howtos around the net were helpful. Thanks also to the helpful people on #ltsp on encouraging words.
The howto's on the net are comprehensive, so I won't go through all the possible options here. Ubuntu's kernel has all the required options compiled in, so no kernel compiling is needed. For more information check for example the Remote Serial Console HOWTO at tldp.org.
So after equipping myself with a basic Belkin usb-serial adapter and a null-modem cable, I was ready for some kernel fun. As the kernel needs the console= parameter to define the serial port where everything is sent, the boot parameters under /var/lib/tftpboot/ltsp/i386/pxelinux.cfg/default needed some changes.
Normally the file looks something like this:
DEFAULT vmlinuz ro initrd=initrd.img quiet splash IPAPPEND 2
DEFAULT vmlinuz ro initrd=initrd.img console=tty console=ttyS0,115200n8 IPAPPEND 2
minicom -o -s
+-----------------------------------------------------------------------+ | A - Serial Device : /dev/ttyUSB0 | | B - Lockfile Location : /var/lock | | C - Callin Program : | | D - Callout Program : | | E - Bps/Par/Bits : 115200 8N1 | | F - Hardware Flow Control : No | | G - Software Flow Control : No | | | | Change which setting? | +-----------------------------------------------------------------------+
ctrl-a f 9
[35774.932022] Modules linked in: ipv6 wmi video output sbs sbshc pci_slot container battery ac lp padlock_aes crypto_blkcipher aes_generic sg snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_mpu401_uart snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event evdev snd_seq snd_timer snd_seq_device snd pata_acpi ata_generic serio_raw parport_pc pata_via button parport snd_page_alloc i2c_viapro libata soundcore psmouse i2c_core shpchp scsi_mod pci_hotplug via_agp dock floppy agpgart pcspkr af_packet usbhid hid uhci_hcd ehci_hcd via_rhine mii usbcore r8169 thermal processor fan unionfs squashfs nbd fbcon tileblit font bitblit softcursor fuse
[35774.932022]
[35774.932022] Pid: 0, comm: swapper Not tainted (2.6.27-7-generic #1)
[35774.932022] EIP: 0060:[<c017708b>] EFLAGS: 00000246 CPU: 0
[35774.932022] EIP is at handle_IRQ_event+0x1b/0x80
[35774.932022] EAX: 0000000b EBX: de9ac8c0 ECX: de9ac8c0 EDX: de9ac8c0
[35774.932022] ESI: 0000000b EDI: c049a330 EBP: c04a7dec ESP: c04a7ddc
[35774.932022] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[35774.932022] CR0: 8005003b CR2: b6e44000 CR3: 1d961000 CR4: 00000690
[35774.932022] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[35774.932022] DR6: ffff0ff0 DR7: 00000400
[35774.932022] [<c0178d43>] handle_level_irq+0x83/0x100
[35774.932022] [<c037e61d>] ? _spin_lock+0xd/0x10
[35774.932022] [<c0106c15>] do_IRQ+0x45/0x80
[35774.932022] [<c0105003>] common_interrupt+0x23/0x30
[35774.932022] [<df865561>] ? rtl8169_rx_interrupt+0x11/0x4e0 [r8169]
[35774.932022] [<df868530>] rtl8169_poll+0x30/0xc0 [r8169]
[35774.932022] [<c02f4825>] net_rx_action+0xf5/0x230
[35774.932022] [<c0137682>] __do_softirq+0x92/0x120
[35774.932022] [<c013776d>] do_softirq+0x5d/0x60
[35774.932022] [<c01378e5>] irq_exit+0x55/0x90
[35774.932022] [<c0106c1a>] do_IRQ+0x4a/0x80
[35774.932022] [<c0179c5f>] ? rcu_needs_cpu+0x3f/0x50
[35774.932022] [<c0105003>] common_interrupt+0x23/0x30
[35774.932022] [<c011aba5>] ? native_safe_halt+0x5/0x10
[35774.932022] [<c010aaed>] default_idle+0x5d/0x60
[35774.932022] [<c010288d>] cpu_idle+0x7d/0x140
[35774.932022] [<c036edc3>] rest_init+0x53/0x60
[35774.932022] =======================
Happy debugging!