How to Fix Linux Kernel Crashes in Systems With USB2 SMARTBoards

tl;dr use Linux 3.6 or later

Those who are still reading this, this story describes how we debugged and got solved initially mysterious problem of oopsing kernels. It is written hoping others would find it useful and we would remember how to do something similar next time.

Starting point

The story started already a year ago when we got first reports from schools about freezing computers. Soon we found out that the freezing computers had SMART Boards (interactive whiteboards made by SMART Technologies) connected to them through USB.

The software that comes with SMART Boards (Notebook and drivers) is proprietary and no source code is available. Because there was no source code to look at, Veli-Matti did some debugging with strace and patched libusb back then to find out what the driver binary was actually doing. It turned out that it was frantically scanning the usb bus, connecting to the board, writing, polling and finally closing the connection. And it looping this over and over multiple times per second.

Luckily searching revealed that “a lsusb bomb” is a good way to trigger all kinds of usb problems. After we realised that USB2 support has been problematic with SMART Boards, we realised that by disabling USB2 support the problem went away. Our problem was not the same as this article in the knowledge base, but it helped us:

http://smarttech.com/us/Support/Browse+Support/Support+Documents/KB1/131884.aspx

Later we found out that we were not the only ones having kernel problems with SMART Boards:

http://lkml.indiana.edu/hypermail/linux/kernel/1112.1/00007.html

After we disabled USB2 support in the problematic machines, life was good again for some time. We reported the problem and waited for a fix that never came. It was time for a real fix.

Study goals

Develop a reliable method for reproducing the issue with specific hardware, i.e. in our labs.
Develop a reliable method for reproducing the issue with generic hardware, so that others (e.g. kernel developers) could reproduce it.
Identify the root cause and pinpoint the exact code section.
Gather as much information as possible and report findings to kernel developers.
Optionally develop a patch which fixes the issue and send it upstream.

Capturing the oops

So, it was starting to be quite obvious, that the system instability was somehow caused by SMARTBoards and/or the proprietary software controlling them. The next step was to capture the oops to get a bit of a solid ground for studying the problem further.

The easiest way to capture the oops is via serial console. See Veli-Matti’s earlier post on how to do it in LTSP, but the basic setup is this:

Connect your host to your serial terminal with a serial cable. Use an USB-adaptor in your terminal if required.
Add console=ttyS0,115200n81 to your kernel’s boot parameters.
Run serial terminal emulator (minicom, screen, cutecom, etc.) with corresponding serial parameters: 115200bps (9600bps did not work reliably), no parity bit, 8 bits per character, and 1 bit long stops.

And this is what we got:

[  182.520363] BUG: unable to handle kernel paging request at 00efe000
[  182.526733] IP: [<c11381d1>] __kmalloc+0xb1/0x1f0
[  182.531505] *pdpt = 000000002ed88001 *pde = 0000000000000000
[  182.537339] Oops: 0000 [#1] SMP
[  182.540628] Modules linked in: nls_utf8 isofs snd_hda_codec_hdmi snd_hda_codec_realtek bnep rfcomm bluetooth parport_pc ppdev snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event i915 snd_seq drm_kms_helper drm snd_timer snd_seq_device snd coretemp dcdbas hid_generic microcode soundcore i2c_algo_bit snd_page_alloc psmouse mac_hid serio_raw mei video lp parport usbhid hid e1000e
[  182.578242]
[  182.579752] Pid: 4482, comm: usboops Not tainted 3.5.4-opinsys #2 Dell Inc. OptiPlex 790/0D28YY
[  182.588589] EIP: 0060:[<c11381d1>] EFLAGS: 00210206 CPU: 0
[  182.594138] EIP is at __kmalloc+0xb1/0x1f0
[  182.598280] EAX: 00000000 EBX: 00efe000 ECX: 0001e0a7 EDX: 0001e0a6
[  182.604617] ESI: f3002200 EDI: 00efe000 EBP: eecffe80 ESP: eecffe4c
[  182.610951]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  182.616409] CR0: 8005003b CR2: 00efe000 CR3: 36bf5000 CR4: 000407f0
[  182.622744] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  182.629078] DR6: ffff0ff0 DR7: 00000400
[  182.632956] Process usboops (pid: 4482, ti=eecfe000 task=f18cb280 task.ti=eecfe000)
[  182.640695] Stack:
[  182.642729]  000220af c1421199 000000c0 00efe000 0001e0a6 00022001 c142b66a 000000d0
[  182.650666]  00000004 0001e0a7 f690f660 f19e4900 f3200a00 eecfff10 c142b66a 00000004
[  182.658606]  00000000 00000000 00000000 00000004 0009a45d eedeb0c0 eecffefc f690f670
[  182.666541] Call Trace:
[  182.669015]  [<c1421199>] ? usb_alloc_urb+0x19/0x40
[  182.673950]  [<c142b66a>] ? usbdev_do_ioctl+0x165a/0x1c10
[  182.679410]  [<c142b66a>] usbdev_do_ioctl+0x165a/0x1c10
[  182.684697]  [<c142bc20>] ? usbdev_do_ioctl+0x1c10/0x1c10
[  182.690166]  [<c142bc2d>] usbdev_ioctl+0xd/0x10
[  182.694749]  [<c1157aa2>] do_vfs_ioctl+0x82/0x5b0
[  182.699506]  [<c108ef6b>] ? ktime_get_ts+0xeb/0x120
[  182.704441]  [<c12bf8f0>] ? copy_to_user+0x40/0x60
[  182.709287]  [<c115869a>] ? poll_select_copy_remaining+0xca/0x110
[  182.717100]  [<c108f0f4>] ? getnstimeofday+0x54/0x120
[  182.723854]  [<c115803f>] sys_ioctl+0x6f/0x80
[  182.729890]  [<c104b5a2>] ? sys_gettimeofday+0x32/0x70
[  182.736743]  [<c15a619f>] sysenter_do_call+0x12/0x28
[  182.743394] Code: 01 00 00 8b 06 64 03 05 e4 7f 90 c1 8b 50 04 8b 18 85 db 89 5d d8 0f 84 28 01 00 00 8b 7d d8 8d 4a 01 8b 46 14 89 4d f0 89 55 dc <8b> 04 07 89 45 e0 89 c3 89 f8 8b 3e 64 0f c7 0f 0f 94 c0 84 c0
[  182.765014] EIP: [<c11381d1>] __kmalloc+0xb1/0x1f0 SS:ESP 0068:eecffe4c
[  182.772621] CR2: 0000000000efe000
[  182.786426] ---[ end trace 8c5bb233276a5431 ]---

But the problem was that it was quite difficult to reproduce. Oopsing was certain, but the time window varied a lot. Sometimes it happened on freshly booted systems almost immediately when SMARTBoard was used and sometimes it took more than a day. Without an easily reproducible method, determining whether a newer kernel would fix the problem is almost impossible. We had to come up with a method which would make the kernel oops reliably and instantly. Enter lsusb_bomb.sh:

#!/bin/sh                                                                                                                                                                                                                                   
set -eu

lsusb_loop() {
    for _ in $(seq 1 10000)
    do
        lsusb -v >/dev/null 2>&1
    done
}

for _ in $(seq 1 15)
do
    lsusb_loop&
done

Veli-Matti’s previous debugging work on this issue had revealed that bombing the USB bus with numerous simultaneous lsusb-processes made the whole system really unstable. And confirmed that the bug still existed in newer kernels. Running lsusb_bomb.sh on a system which had SMARTBoard plugged in and proprietary SMARTBoard software running caused the kernel to oops in a couple of minutes. The first goal of the study was achieved.

Emulating SMARTBoard with USB mouse and a custom program

At this point we had a way to crash kernel from userspace when SMARTBoard was connected to USB bus and proprietary SMART Board processes were running. But to be able to get help from the kernel community, we had to come up with more generic way to reproduce the issue. Sending kernel mailing list a bug report describing a problem which occurs only when a special device is being controlled by a proprietary software did not sound a good idea back then and it does not sound good now either. If you are able to show kernel developers how to easily reproduce the bug you are reporting, you are more likely to get feedback.

Because we were able to crash the kernel from userspace, we knew we should be able to reproduce the problem reliably with any hardware, given just the right conditions. We deciced to spy the proprietary SMARTBoard driver process by wiretapping the communication line between the process and the kernel. Sounds thrilling, but it was actually the most boring part of the debugging process. We just added several syslog() calls to different places inside libusb and replaced the libusb object the SMART Board Service was using with our own tracing version of it.

Based on the captured trace, we wrote a short program, usboops.c, which was emulating the proprietary SMART Board driver well enough (many communication details were ignored, because they were not relevant from the crashing point of view) with an execption that it was designed to communicate with an ordinary USB mouse (actually any USB HID device with an interrupt IN enpoint could be used):

#include <usb.h>

/* Check `lsusb -v' for correct values on your machine. An USB mouse is                                                                                                                                                                      * a good choice. */
static const int ID_VENDOR = 0x093a;
static const int ID_PRODUCT = 0x2510;
static const int B_INTERFACE_NUMBER = 0;
static const int B_ENDPOINT_ADDRESS = 0x81; /* Any interrupt IN endpoint. */
static const int W_MAX_PACKET_SIZE = 4;

int main(void)
{
        struct usb_bus *bus;
        struct usb_device *device;
        usb_dev_handle *dev_handle = NULL;

        usb_init();
        usb_find_busses();
        usb_find_devices();
        for (bus = usb_get_busses(); bus; bus = bus->next) {
                for (device = bus->devices; device; device = device->next) {
                        if (device->descriptor.idVendor == ID_VENDOR
                            && device->descriptor.idProduct == ID_PRODUCT) {
                                dev_handle = usb_open(device);
                                break;
                        }
                }
        }
        usb_detach_kernel_driver_np(dev_handle, B_INTERFACE_NUMBER);
        usb_claim_interface(dev_handle, B_INTERFACE_NUMBER);

        while (1) {
                char buf[W_MAX_PACKET_SIZE];
                usb_interrupt_read(dev_handle, B_ENDPOINT_ADDRESS,
                                   buf, sizeof(buf), 1);
        }

        return 0;
}

And usboops.c proved to be extremely efficient. We managed to crash all test systems in matter of seconds. We found out that v3.6 did not suffer from the problem but all stable releases were crashing. Only requirements were an USB mouse and usboops.c. The second goal of our study was achieved.

Patch hunting

We were getting closer to the solution. Based on our tests, it seemed that v3.6 (a0d271cbfed1dd50278c6b06bead3d00ba0a88f9) was not suffering from the issue, but v3.5.4 (d61ed4631511b08d2e14924eab16a9ddaed44df6) definitely was. Good news was that v3.6 seemed to fix the problem, but we had to pinpoint the commit which seemed to fix the problem, because:

None of our production systems were using Linux 3.6 at that time and backporting newer kernel to our systems would have broken many things. If we could pinpoint the commit which is the first to fix the problem, then we might be able to extract the fix and backport it to older stable releases.
It was not 100% sure that v3.6 fixed the problem. It might have been that our test case (usboops.c + mouse) was just failing.

Bad news was there was 11130 commits separating v3.5.4 and v3.6. Luckily, there’s a magnificient tool for hunting bugs/fixes: git bisect. Bisecting, compiling and testing -loop took only less than 4 hours (log₂(11130) < 14), and lead us to a conclusion that the following commit would be The Fix:

commit df2022553dd8d34d49e16c19d851ea619438f0ef
Author: Alan Stern <stern@rowland.harvard.edu>
Date:   Wed Jul 11 11:22:26 2012 -0400

    USB: EHCI: use hrtimer for interrupt QH unlink

    This patch (as1577) adds hrtimer support for unlinking interrupt QHs
    in ehci-hcd.  The current code relies on a fixed delay of either 2 or
    55 us, which is not always adequate and in any case is totally bogus.
    Thanks to internal caching, the EHCI hardware may continue to access
    an interrupt QH for more than a millisecond after it has been unlinked.

    In fact, the EHCI spec doesn't say how long to wait before using an
    unlinked interrupt QH.  The patch sets the delay to 9 microframes
    minimum, which ought to be adequate.

    Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

We had achieved the third goal.

Reporting

Now we were ready to report our finding to the kernel community. We had an easily reproducible bug. Check. We had bisected the history and found a commit which seemed to fix the bug but which was not part of any stable releases. Check. The bug hadn’t been reported earlier by anyone else. Check. Then, writing and reporting the bug was just a matter of being accurate and following instructions.

And the result is here: http://www.spinics.net/lists/linux-usb/msg71648.html

We got the confirmation we were asking for only 41 minutes and 15 seconds later. Crazy.

The fourth goal was achieved.

Epilogue

We never had to try achieving the fifth goal. Alan had done it for us. And to be honest, we were lucky. It probably would have been to big bite to chew at this point, but the adventure was fruitful, interesting and educational. Our first fixed kernel bug is still lying somewhere in the future, waiting for us to smash it with our virtual flyswatter. But, Bug, where ever you are, beware! We are coming after you.

Research and Development