[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Kernel oops using DMA on MCM


After fixing memory mirroring problem by connecting two pins on the MCM,
today I received a new version of the PCB with the problem fixed, and now we
are on the next round of problems :)

We are mainly using the system to drive a synchronous serial port with DMA.
When I tried to use the driver that I had developed for the developer board
(which had a ETRAX 100LX v1 CPU) on the MCM chip (which has a v2 CPU),
sooner or later I get some kernel oops. This driver does full-duplex
continuous DMA transmit and send using a synchronous serial port of the
Etrax 100LX CPU. It had been rigorously tested under the developer board,
and was working reliably.

What is interesting is, it appears as if the kernel is crashing outside of
the interrupt handlers, tasklets, user-to-kernel read/write functions, and
open/close functions. I assume this is the case as I enclosed these
functions with printk's and it looks like the kernel crashes outside these
functions.  So I am suspecting the DMA transfer is doing something wrong. 

Analysing the kernel oops log using ksymoops does not help unfortunately as
the stack trace is definitely incorrect. According to the stack trace, it
appears as if the kernel is crashing when trying to execute "execve" system
call, and is getting a mmu bus fault. However, the application in question
is just the "cat" process which is reading from the device file at that
time. So my assumption is, whatever is causing the kernel to crash,
overwrites the stack memory, possibly causing the kernel to really call the
"execve" function. I have included the error output and ksymoops output to
this mail.

There does not seem to be a general problem related with executing new
processes; after some reasonable amount of testing, spawning processes do
not seem to cause a problem at all unless they use the driver. All processes
that use the driver, however, are randomly causing kernel crashes.

After browsing through some posts in the list, it seems that there is a
cache bug which may be a reason for what is happening. Is this bug new in
Etrax 100LX v2? If so this would increase its odds. I looked at the ethernet
driver for the workaround, but it seems there are two workarounds. One is
flushing the CPU L1 cache before returning descriptors, the other is
aligning some data so that cache bug does not occur. These did not really
help to understand what is the actual bug though. I read the errata.txt in
the web site but could not find any info on the bug. Can you please give me
some information on what is the bug?

Do you think there is anything different between the two versions of the CPU
that can cause the problem?

Below is the output from the debug port when the kernel crashes:

<4>Oops: 0000
<4>IRP: 416d341a SRP: c0033552 DCCR: 000004a8 USP: 9ffffbec MOF: 00000000
<4> r0: 35568000  r1: c079bc48   r2: 00000003  r3: 00012580
<4> r4: c079bc48  r5: c079bc08   r6: c0661da4  r7: 00000812
<4> r8: c0660000  r9: 9fffe934  r10: 00000000 r11: 00000000
<4>r12: 00000000 r13: 3556a000 oR10: 00000000
<4>R_MMU_CAUSE: 416d303c
<4>Process cat (pid: 67, stackpage=c0660000)
<4>Stack from 9ffffbec:
<4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>Call Trace: 
<4>Stack from c0661b30:
<4>       c0008398 c0661c70 c004639a c00464fe 00000000 00000000 416d2000
<4>       c00a3104 00000000 c0661c2c c00465c2 c0661c2c c066a314 c0008398
<4>       c0660000 00000812 c0661da4 c079bc08 c079bc48 00012580 c0661c2c
<4>Call Trace: [<c0008398>] [<c004639a>] [<c00464fe>] [<c00465c2>]
[<c0008398>] [<c004961e>] [<c0014e8a>] 
<4>       [<c0049418>] [<c00461ba>] [<c0049418>] [<c00461ba>] [<c0033552>]
[<c003420e>] [<c0033aea>] [<d4dc000a>] 
<4>       [<c0014e8a>] [<c0026312>] [<c0025836>] [<c002650e>] [<c00455f0>]
<4>Code:  Bad IP value.


And below is the ksymoops output:

Warning (Oops_read): Code line not seen, dumping what data is available

>>EIP; 416d341a Before first symbol   <=====

>>IRP; 416d341a Before first symbol
>>SRP; c0033552 <padzero+48/4a>
>>DCCR; 000004a8 Before first symbol
>>USP; 9ffffbec Before first symbol
>>IRP; 416d341a Before first symbol
>>SRP; c0033552 <padzero+48/4a>
>>DCCR; 000004a8 Before first symbol
>>USP; 9ffffbec Before first symbol
>>MOF; 00000000 Before first symbol
>>r0; 35568000 Before first symbol
>>r1; c079bc48 <_end+6a7f28/70c2e0>
>>r3; 00012580 Before first symbol
>>r4; c079bc48 <_end+6a7f28/70c2e0>
>>r5; c079bc08 <_end+6a7ee8/70c2e0>
>>r6; c0661da4 <_end+56e084/70c2e0>
>>r7; 00000812 Before first symbol
>>r8; c0660000 <_end+56c2e0/70c2e0>
>>r9; 9fffe934 Before first symbol
>>r13; 3556a000 Before first symbol

Trace; c0008398 <printk+0/14c>
Trace; c004639a <show_stack+0/90>
Trace; c00464fe <show_registers+d4/146>
Trace; c00465c2 <die_if_kernel+34/46>
Trace; c0008398 <printk+0/14c>
Trace; c004961e <do_page_fault+202/2b0>
Trace; c0014e8a <do_generic_file_read+3a8/3ae>
Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
Trace; c00461ba <mmu_bus_fault+28/30>
Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
Trace; c00461ba <mmu_bus_fault+28/30>
Trace; c0033552 <padzero+48/4a>
Trace; c003420e <load_elf_binary+724/9fc>
Trace; c0033aea <load_elf_binary+0/9fc>
Trace; d4dc000a <END_OF_CODE+145c000a/????>
Trace; c0014e8a <do_generic_file_read+3a8/3ae>
Trace; c0026312 <search_binary_handler+52/fe>
Trace; c0025836 <copy_strings+0/1ae>
Trace; c002650e <do_execve+150/1aa>
Trace; c00455f0 <sys_execve+2a/42>
Trace; c00460be <system_call+50/58>

1 warning issued.  Results may not be reliable


Many thanks, and thanks again for the superb support you've provided for the
PCB connection problem.


Mahmut Fettahlioglu
Senior Software Engineer

Open Access Pty Ltd
PO Box 301
Crows Nest NSW 1585
Phone		02 9978 7009
Fax		02 9978 7099
Email		<mahmut.fettahlioglu@xxxxxxx.au>
This email is intended only for the use of the individual or entity
named above and may contain information that is confidential and
privileged. If you are not the intended recipient, you are hereby
notified that any dissemination, distribution or copying of this
email is strictly prohibited. If you have received this email in
error, please notify us immediately by return email or telephone 
02 9978 7009 and destroy the original message.