[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Kernel oops using DMA on MCM



Hi,

>We are mainly using the system to drive a synchronous serial port with DMA.
>When I tried to use the driver that I had developed for the developer board
>(which had a ETRAX 100LX v1 CPU) on the MCM chip (which has a v2 CPU),

There are some minor differences in the synchronous serial port between R1
and R2. But I don't think that is related to your problem. Are the OOPSes 
identical each time or does tey look different?

>According to the stack trace, it appears as if the kernel is crashing when 
>trying to execute "execve" system call, and is getting a mmu bus fault.

I agree that it seams bogus but it is strange that it is a quite logical 
sequence of operations that is printed in the trace.

>One is flushing the CPU L1 cache before returning descriptors, the other is
>aligning some data so that cache bug does not occur. These did not really
>help to understand what is the actual bug though. I read the errata.txt in
>the web site but could not find any info on the bug. Can you please give me
>some information on what is the bug?

Short version: If the input DMA buffers is in the cache you may 
               corrupt data at other "random" addresses.
		   The bug is present in all ETRAX 100LX revisions.
Long version at the end of the email if you really want to know.

To make sure that the DMA buffers two meassures must be taken:
1. Make sure that the buffers doesn't share cache line with any
other data. This is achieved by aligning the buffer to cacheline
boundary and the size a multiple of the cacheline size.
2. Make sure that the buffer is not present in cache when DMA runs.
This is achieved by calling prepare_rx_descriptor when descriptors
are returned to the DMA.

In the Ethernet driver this is rather complicated to increase
performance. In most other drivers it is simple to add the
workaround.

For the synchronous serial port I suggest:

1. Align buffers by modifying struct sync_port:
char in_buffer[IN_BUFFER_SIZE] __attribute__ ((aligned(32)));
2. In start_dma_in call prepare_rx_descriptor just before
   *port->input_dma_first = virt_to_phys(&port->in_descr2); and
   *port->input_dma_first = virt_to_phys(&port->in_descr1);
e.g. prepare_rx_descriptor(&port->in_descr2);

Now the really long and boring bug description for the people
with too much time:

The problem occurs when the CPU performs a cached memory write that 
reaches over two cache lines, the first cache line is present in the
cache and the second cache line gives a dirty miss. If the DMA at 
the same time writes to the cache, the first dword flushed by the 
CPU miss may get corrupted. The problem is only related to DMA data
input buffers. It does not occur with DMA descriptors (not even when
the DMA writes the status field), and it does not occur with output 
DMA. All input DMA channels are affected.
Possible workarounds:
1. Make sure that the DMA data input buffers are not present in the 
cache. This can be acheived by accessing an address n * 8 k away from
the buffer position before the buffer is entered into the DMA receive
list. One address must be accessed for each cache line. 
2. Have the DMA input buffers in non-cached memory.
3. Avoid CPU writes that stretch over cache line borders where the
first cache line is clean and the second cache line is dirty.
Probably this solution is too difficult to be used in practice. 

PS. The Axis office will be pretty empty between december 23 and
January 7 (lots of holidays at xmas in Sweden) but you can always
try to email us anyway. DS

/Mikael

-----Original Message-----
From: owner-dev-etrax@xxxxxxx.com]On">mailto:owner-dev-etrax@xxxxxxx.com]On
Behalf Of Fettahlioglu, Mahmut
Sent: Wednesday, December 18, 2002 9:09 AM
To: dev-etrax
Subject: Kernel oops using DMA on MCM


Hi,

After fixing memory mirroring problem by connecting two pins on the MCM,
today I received a new version of the PCB with the problem fixed, and now we
are on the next round of problems :)

We are mainly using the system to drive a synchronous serial port with DMA.
When I tried to use the driver that I had developed for the developer board
(which had a ETRAX 100LX v1 CPU) on the MCM chip (which has a v2 CPU),
sooner or later I get some kernel oops. This driver does full-duplex
continuous DMA transmit and send using a synchronous serial port of the
Etrax 100LX CPU. It had been rigorously tested under the developer board,
and was working reliably.

What is interesting is, it appears as if the kernel is crashing outside of
the interrupt handlers, tasklets, user-to-kernel read/write functions, and
open/close functions. I assume this is the case as I enclosed these
functions with printk's and it looks like the kernel crashes outside these
functions.  So I am suspecting the DMA transfer is doing something wrong. 

Analysing the kernel oops log using ksymoops does not help unfortunately as
the stack trace is definitely incorrect. According to the stack trace, it
appears as if the kernel is crashing when trying to execute "execve" system
call, and is getting a mmu bus fault. However, the application in question
is just the "cat" process which is reading from the device file at that
time. So my assumption is, whatever is causing the kernel to crash,
overwrites the stack memory, possibly causing the kernel to really call the
"execve" function. I have included the error output and ksymoops output to
this mail.

There does not seem to be a general problem related with executing new
processes; after some reasonable amount of testing, spawning processes do
not seem to cause a problem at all unless they use the driver. All processes
that use the driver, however, are randomly causing kernel crashes.

After browsing through some posts in the list, it seems that there is a
cache bug which may be a reason for what is happening. Is this bug new in
Etrax 100LX v2? If so this would increase its odds. I looked at the ethernet
driver for the workaround, but it seems there are two workarounds. One is
flushing the CPU L1 cache before returning descriptors, the other is
aligning some data so that cache bug does not occur. These did not really
help to understand what is the actual bug though. I read the errata.txt in
the web site but could not find any info on the bug. Can you please give me
some information on what is the bug?

Do you think there is anything different between the two versions of the CPU
that can cause the problem?

Below is the output from the debug port when the kernel crashes:
--------------------------------------------------------

<4>Oops: 0000
<4>IRP: 416d341a SRP: c0033552 DCCR: 000004a8 USP: 9ffffbec MOF: 00000000
<4> r0: 35568000  r1: c079bc48   r2: 00000003  r3: 00012580
<4> r4: c079bc48  r5: c079bc08   r6: c0661da4  r7: 00000812
<4> r8: c0660000  r9: 9fffe934  r10: 00000000 r11: 00000000
<4>r12: 00000000 r13: 3556a000 oR10: 00000000
<4>R_MMU_CAUSE: 416d303c
<4>Process cat (pid: 67, stackpage=c0660000)
<4>
<4>Stack from 9ffffbec:
<4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 
<4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 
<4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 
<4>Call Trace: 
<4>Stack from c0661b30:
<4>       c0008398 c0661c70 c004639a c00464fe 00000000 00000000 416d2000
c01bc55c 
<4>       c00a3104 00000000 c0661c2c c00465c2 c0661c2c c066a314 c0008398
c004961e 
<4>       c0660000 00000812 c0661da4 c079bc08 c079bc48 00012580 c0661c2c
c0656104 
<4>Call Trace: [<c0008398>] [<c004639a>] [<c00464fe>] [<c00465c2>]
[<c0008398>] [<c004961e>] [<c0014e8a>] 
<4>       [<c0049418>] [<c00461ba>] [<c0049418>] [<c00461ba>] [<c0033552>]
[<c003420e>] [<c0033aea>] [<d4dc000a>] 
<4>       [<c0014e8a>] [<c0026312>] [<c0025836>] [<c002650e>] [<c00455f0>]
[<c00460be>] 
<4>Code:  Bad IP value.

--------------------------------------------------------

And below is the ksymoops output:
--------------------------------------------------------

Warning (Oops_read): Code line not seen, dumping what data is available


>>EIP; 416d341a Before first symbol   <=====

>>IRP; 416d341a Before first symbol
>>SRP; c0033552 <padzero+48/4a>
>>DCCR; 000004a8 Before first symbol
>>USP; 9ffffbec Before first symbol
>>IRP; 416d341a Before first symbol
>>SRP; c0033552 <padzero+48/4a>
>>DCCR; 000004a8 Before first symbol
>>USP; 9ffffbec Before first symbol
>>MOF; 00000000 Before first symbol
>>r0; 35568000 Before first symbol
>>r1; c079bc48 <_end+6a7f28/70c2e0>
>>r3; 00012580 Before first symbol
>>r4; c079bc48 <_end+6a7f28/70c2e0>
>>r5; c079bc08 <_end+6a7ee8/70c2e0>
>>r6; c0661da4 <_end+56e084/70c2e0>
>>r7; 00000812 Before first symbol
>>r8; c0660000 <_end+56c2e0/70c2e0>
>>r9; 9fffe934 Before first symbol
>>r13; 3556a000 Before first symbol

Trace; c0008398 <printk+0/14c>
Trace; c004639a <show_stack+0/90>
Trace; c00464fe <show_registers+d4/146>
Trace; c00465c2 <die_if_kernel+34/46>
Trace; c0008398 <printk+0/14c>
Trace; c004961e <do_page_fault+202/2b0>
Trace; c0014e8a <do_generic_file_read+3a8/3ae>
Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
Trace; c00461ba <mmu_bus_fault+28/30>
Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
Trace; c00461ba <mmu_bus_fault+28/30>
Trace; c0033552 <padzero+48/4a>
Trace; c003420e <load_elf_binary+724/9fc>
Trace; c0033aea <load_elf_binary+0/9fc>
Trace; d4dc000a <END_OF_CODE+145c000a/????>
Trace; c0014e8a <do_generic_file_read+3a8/3ae>
Trace; c0026312 <search_binary_handler+52/fe>
Trace; c0025836 <copy_strings+0/1ae>
Trace; c002650e <do_execve+150/1aa>
Trace; c00455f0 <sys_execve+2a/42>
Trace; c00460be <system_call+50/58>


1 warning issued.  Results may not be reliable

--------------------------------------------------------

Many thanks, and thanks again for the superb support you've provided for the
PCB connection problem.

Regards,

Mahmut
 
----------------------------------------------------------------------------
---------------------
Mahmut Fettahlioglu
Senior Software Engineer

Open Access Pty Ltd
PO Box 301
Crows Nest NSW 1585
 
Phone		02 9978 7009
Fax		02 9978 7099
Email		<mahmut.fettahlioglu@xxxxxxx.au>
----------------------------------------------------------------------------
---------------------
This email is intended only for the use of the individual or entity
named above and may contain information that is confidential and
privileged. If you are not the intended recipient, you are hereby
notified that any dissemination, distribution or copying of this
email is strictly prohibited. If you have received this email in
error, please notify us immediately by return email or telephone 
02 9978 7009 and destroy the original message.