[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: system hang + kernel oops because of high usb load? (seems to be reproducable) (followup)

>regarding why run_timer_list loops so long with interrupts disabled,
>my guess is that someone swamps the system with timers or that
>the lists somehow got corrupt.
>Try adding a loopcounter in run_timer_list and a function that print
>the timerlists etc. and call that from the watchdog_bite_hook()
>in arch/cris/kernel/traps.c

I think Johan's idea is good. Lets do some deep diving into the code
(boring internal kernel stuff ahead):

>Trace; c0049c50 <default_idle+0/c>

CPU is idle

>Trace; c004b34e <sIRQ17_interrupt+18/2e>

We receive a Ethernet packet

>Trace; c004b830 <do_IRQ+82/86>

do_IRQ calls the appropriate IRQ handler. The code 
(asm/cris/kernel/irq.c) looks something like this:

irq_enter(); /* Keep track of IRQ handling in progress */
if (!fast_IRQ) /* Ethernet IRQs are not fast */
handle interrupt;
irq_exit(); /* IRQ handling no longer in progress */

if (softIRQs pending)

>Trace; c000a5b4 <do_softirq+58/9c>

The packet has been queued onto the kernel's receviequeue and we
now handle stuff like timers, packets from the receivequeue etc. 
In kernel/softirq.c the code basically does this:

if (pending_jobs)
  enable IRQs;
  while (more pending)
  disable IRQs;

So clearly the IRQs are on when tasklet_hi_action is called

>Trace; c000a7c0 <tasklet_hi_action+62/80>

Traverses a list but doesn't touch IRQ enable/disable

>Trace; c000a884 <bh_action+24/58>

Checks that no IRQ is being handler and calls timer_bh

Trace; c000d358 <timer_bh+a0/a4>

>Trace; c000d738 <run_timer_list+e6/142>

Disables IRQs and traverses the timer list. It does something like

  while (more timers expired)
    fn(data); /* Call timer callback */

  while(timers added by timer callbacks)
    add timer on pending list

The enable_IRQ should allow a timer interrupt to occur between each
timer callback.

So as far as I can see the only point where timer IRQs can be blocked
for long is while adding queued timers on the pending list and I can't
see why I driver should add thousands of timers.

>Trace; c000d752 <run_timer_list+100/142>

This is just after the first while loop but before the second. I can't
really see why the watchdog should hit there. I have to think about this
for a while. Meanwhile it is a good idea to add the printout of the
list to check if it seams correcupted. Print out the timer callbacks and
check which driver the point to.