[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Kernel oops using DMA on MCM



Hi,

An addition to Mikaels comments:

The cache bug discussed is present in all ETRAX 100LX versions. The
occurence of this bug is however dependent on a lot of factors. 
One important factor is the speed of the memory. The bug is more likely
to occur with a faster memory. If your problem is related to the cache
bug, the difference in memory speed could be the reason why you see
the problems on the MCM but not on the developer board.

Per Zander

On Wed, 18 Dec 2002, Mikael Starvik wrote:

> Hi,
> 
> >We are mainly using the system to drive a synchronous serial port with DMA.
> >When I tried to use the driver that I had developed for the developer board
> >(which had a ETRAX 100LX v1 CPU) on the MCM chip (which has a v2 CPU),
> 
> There are some minor differences in the synchronous serial port between R1
> and R2. But I don't think that is related to your problem. Are the OOPSes 
> identical each time or does tey look different?
> 
> >According to the stack trace, it appears as if the kernel is crashing when 
> >trying to execute "execve" system call, and is getting a mmu bus fault.
> 
> I agree that it seams bogus but it is strange that it is a quite logical 
> sequence of operations that is printed in the trace.
> 
> >One is flushing the CPU L1 cache before returning descriptors, the other is
> >aligning some data so that cache bug does not occur. These did not really
> >help to understand what is the actual bug though. I read the errata.txt in
> >the web site but could not find any info on the bug. Can you please give me
> >some information on what is the bug?
> 
> Short version: If the input DMA buffers is in the cache you may 
>                corrupt data at other "random" addresses.
> 		   The bug is present in all ETRAX 100LX revisions.
> Long version at the end of the email if you really want to know.
> 
> To make sure that the DMA buffers two meassures must be taken:
> 1. Make sure that the buffers doesn't share cache line with any
> other data. This is achieved by aligning the buffer to cacheline
> boundary and the size a multiple of the cacheline size.
> 2. Make sure that the buffer is not present in cache when DMA runs.
> This is achieved by calling prepare_rx_descriptor when descriptors
> are returned to the DMA.
> 
> In the Ethernet driver this is rather complicated to increase
> performance. In most other drivers it is simple to add the
> workaround.
> 
> For the synchronous serial port I suggest:
> 
> 1. Align buffers by modifying struct sync_port:
> char in_buffer[IN_BUFFER_SIZE] __attribute__ ((aligned(32)));
> 2. In start_dma_in call prepare_rx_descriptor just before
>    *port->input_dma_first = virt_to_phys(&port->in_descr2); and
>    *port->input_dma_first = virt_to_phys(&port->in_descr1);
> e.g. prepare_rx_descriptor(&port->in_descr2);
> 
> Now the really long and boring bug description for the people
> with too much time:
> 
> The problem occurs when the CPU performs a cached memory write that 
> reaches over two cache lines, the first cache line is present in the
> cache and the second cache line gives a dirty miss. If the DMA at 
> the same time writes to the cache, the first dword flushed by the 
> CPU miss may get corrupted. The problem is only related to DMA data
> input buffers. It does not occur with DMA descriptors (not even when
> the DMA writes the status field), and it does not occur with output 
> DMA. All input DMA channels are affected.
> Possible workarounds:
> 1. Make sure that the DMA data input buffers are not present in the 
> cache. This can be acheived by accessing an address n * 8 k away from
> the buffer position before the buffer is entered into the DMA receive
> list. One address must be accessed for each cache line. 
> 2. Have the DMA input buffers in non-cached memory.
> 3. Avoid CPU writes that stretch over cache line borders where the
> first cache line is clean and the second cache line is dirty.
> Probably this solution is too difficult to be used in practice. 
> 
> PS. The Axis office will be pretty empty between december 23 and
> January 7 (lots of holidays at xmas in Sweden) but you can always
> try to email us anyway. DS
> 
> /Mikael
> 
> -----Original Message-----
> From: owner-dev-etrax@xxxxxxx.com]On">mailto:owner-dev-etrax@xxxxxxx.com]On
> Behalf Of Fettahlioglu, Mahmut
> Sent: Wednesday, December 18, 2002 9:09 AM
> To: dev-etrax
> Subject: Kernel oops using DMA on MCM
> 
> 
> Hi,
> 
> After fixing memory mirroring problem by connecting two pins on the MCM,
> today I received a new version of the PCB with the problem fixed, and now we
> are on the next round of problems :)
> 
> We are mainly using the system to drive a synchronous serial port with DMA.
> When I tried to use the driver that I had developed for the developer board
> (which had a ETRAX 100LX v1 CPU) on the MCM chip (which has a v2 CPU),
> sooner or later I get some kernel oops. This driver does full-duplex
> continuous DMA transmit and send using a synchronous serial port of the
> Etrax 100LX CPU. It had been rigorously tested under the developer board,
> and was working reliably.
> 
> What is interesting is, it appears as if the kernel is crashing outside of
> the interrupt handlers, tasklets, user-to-kernel read/write functions, and
> open/close functions. I assume this is the case as I enclosed these
> functions with printk's and it looks like the kernel crashes outside these
> functions.  So I am suspecting the DMA transfer is doing something wrong. 
> 
> Analysing the kernel oops log using ksymoops does not help unfortunately as
> the stack trace is definitely incorrect. According to the stack trace, it
> appears as if the kernel is crashing when trying to execute "execve" system
> call, and is getting a mmu bus fault. However, the application in question
> is just the "cat" process which is reading from the device file at that
> time. So my assumption is, whatever is causing the kernel to crash,
> overwrites the stack memory, possibly causing the kernel to really call the
> "execve" function. I have included the error output and ksymoops output to
> this mail.
> 
> There does not seem to be a general problem related with executing new
> processes; after some reasonable amount of testing, spawning processes do
> not seem to cause a problem at all unless they use the driver. All processes
> that use the driver, however, are randomly causing kernel crashes.
> 
> After browsing through some posts in the list, it seems that there is a
> cache bug which may be a reason for what is happening. Is this bug new in
> Etrax 100LX v2? If so this would increase its odds. I looked at the ethernet
> driver for the workaround, but it seems there are two workarounds. One is
> flushing the CPU L1 cache before returning descriptors, the other is
> aligning some data so that cache bug does not occur. These did not really
> help to understand what is the actual bug though. I read the errata.txt in
> the web site but could not find any info on the bug. Can you please give me
> some information on what is the bug?
> 
> Do you think there is anything different between the two versions of the CPU
> that can cause the problem?
> 
> Below is the output from the debug port when the kernel crashes:
> --------------------------------------------------------
> 
> <4>Oops: 0000
> <4>IRP: 416d341a SRP: c0033552 DCCR: 000004a8 USP: 9ffffbec MOF: 00000000
> <4> r0: 35568000  r1: c079bc48   r2: 00000003  r3: 00012580
> <4> r4: c079bc48  r5: c079bc08   r6: c0661da4  r7: 00000812
> <4> r8: c0660000  r9: 9fffe934  r10: 00000000 r11: 00000000
> <4>r12: 00000000 r13: 3556a000 oR10: 00000000
> <4>R_MMU_CAUSE: 416d303c
> <4>Process cat (pid: 67, stackpage=c0660000)
> <4>
> <4>Stack from 9ffffbec:
> <4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
> 00000000 
> <4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
> 00000000 
> <4>       00000000 00000000 00000000 00000000 00000000 00000000 00000000
> 00000000 
> <4>Call Trace: 
> <4>Stack from c0661b30:
> <4>       c0008398 c0661c70 c004639a c00464fe 00000000 00000000 416d2000
> c01bc55c 
> <4>       c00a3104 00000000 c0661c2c c00465c2 c0661c2c c066a314 c0008398
> c004961e 
> <4>       c0660000 00000812 c0661da4 c079bc08 c079bc48 00012580 c0661c2c
> c0656104 
> <4>Call Trace: [<c0008398>] [<c004639a>] [<c00464fe>] [<c00465c2>]
> [<c0008398>] [<c004961e>] [<c0014e8a>] 
> <4>       [<c0049418>] [<c00461ba>] [<c0049418>] [<c00461ba>] [<c0033552>]
> [<c003420e>] [<c0033aea>] [<d4dc000a>] 
> <4>       [<c0014e8a>] [<c0026312>] [<c0025836>] [<c002650e>] [<c00455f0>]
> [<c00460be>] 
> <4>Code:  Bad IP value.
> 
> --------------------------------------------------------
> 
> And below is the ksymoops output:
> --------------------------------------------------------
> 
> Warning (Oops_read): Code line not seen, dumping what data is available
> 
> 
> >>EIP; 416d341a Before first symbol   <=====
> 
> >>IRP; 416d341a Before first symbol
> >>SRP; c0033552 <padzero+48/4a>
> >>DCCR; 000004a8 Before first symbol
> >>USP; 9ffffbec Before first symbol
> >>IRP; 416d341a Before first symbol
> >>SRP; c0033552 <padzero+48/4a>
> >>DCCR; 000004a8 Before first symbol
> >>USP; 9ffffbec Before first symbol
> >>MOF; 00000000 Before first symbol
> >>r0; 35568000 Before first symbol
> >>r1; c079bc48 <_end+6a7f28/70c2e0>
> >>r3; 00012580 Before first symbol
> >>r4; c079bc48 <_end+6a7f28/70c2e0>
> >>r5; c079bc08 <_end+6a7ee8/70c2e0>
> >>r6; c0661da4 <_end+56e084/70c2e0>
> >>r7; 00000812 Before first symbol
> >>r8; c0660000 <_end+56c2e0/70c2e0>
> >>r9; 9fffe934 Before first symbol
> >>r13; 3556a000 Before first symbol
> 
> Trace; c0008398 <printk+0/14c>
> Trace; c004639a <show_stack+0/90>
> Trace; c00464fe <show_registers+d4/146>
> Trace; c00465c2 <die_if_kernel+34/46>
> Trace; c0008398 <printk+0/14c>
> Trace; c004961e <do_page_fault+202/2b0>
> Trace; c0014e8a <do_generic_file_read+3a8/3ae>
> Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
> Trace; c00461ba <mmu_bus_fault+28/30>
> Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
> Trace; c00461ba <mmu_bus_fault+28/30>
> Trace; c0033552 <padzero+48/4a>
> Trace; c003420e <load_elf_binary+724/9fc>
> Trace; c0033aea <load_elf_binary+0/9fc>
> Trace; d4dc000a <END_OF_CODE+145c000a/????>
> Trace; c0014e8a <do_generic_file_read+3a8/3ae>
> Trace; c0026312 <search_binary_handler+52/fe>
> Trace; c0025836 <copy_strings+0/1ae>
> Trace; c002650e <do_execve+150/1aa>
> Trace; c00455f0 <sys_execve+2a/42>
> Trace; c00460be <system_call+50/58>
> 
> 
> 1 warning issued.  Results may not be reliable
> 
> --------------------------------------------------------
> 
> Many thanks, and thanks again for the superb support you've provided for the
> PCB connection problem.
> 
> Regards,
> 
> Mahmut
>  
> ----------------------------------------------------------------------------
> ---------------------
> Mahmut Fettahlioglu
> Senior Software Engineer
> 
> Open Access Pty Ltd
> PO Box 301
> Crows Nest NSW 1585
>  
> Phone		02 9978 7009
> Fax		02 9978 7099
> Email		<mahmut.fettahlioglu@xxxxxxx.au>
> ----------------------------------------------------------------------------
> ---------------------
> This email is intended only for the use of the individual or entity
> named above and may contain information that is confidential and
> privileged. If you are not the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> email is strictly prohibited. If you have received this email in
> error, please notify us immediately by return email or telephone 
> 02 9978 7009 and destroy the original message.
>