[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Kernel oops using DMA on MCM



Thank you Per. This could explain why I started to get these oops'es now. I
am going to give it a shot.

Mahmut

> -----Original Message-----
> From: Per Zander [mailto:per.zander@xxxxxxx.com]
> Sent: Wednesday, 18 December 2002 20:31
> To: Mikael Starvik
> Cc: 'Fettahlioglu, Mahmut'; dev-etrax
> Subject: RE: Kernel oops using DMA on MCM
> 
> 
> Hi,
> 
> An addition to Mikaels comments:
> 
> The cache bug discussed is present in all ETRAX 100LX versions. The
> occurence of this bug is however dependent on a lot of factors. 
> One important factor is the speed of the memory. The bug is 
> more likely
> to occur with a faster memory. If your problem is related to the cache
> bug, the difference in memory speed could be the reason why you see
> the problems on the MCM but not on the developer board.
> 
> Per Zander
> 
> On Wed, 18 Dec 2002, Mikael Starvik wrote:
> 
> > Hi,
> > 
> > >We are mainly using the system to drive a synchronous 
> serial port with DMA.
> > >When I tried to use the driver that I had developed for 
> the developer board
> > >(which had a ETRAX 100LX v1 CPU) on the MCM chip (which 
> has a v2 CPU),
> > 
> > There are some minor differences in the synchronous serial 
> port between R1
> > and R2. But I don't think that is related to your problem. 
> Are the OOPSes 
> > identical each time or does tey look different?
> > 
> > >According to the stack trace, it appears as if the kernel 
> is crashing when 
> > >trying to execute "execve" system call, and is getting a 
> mmu bus fault.
> > 
> > I agree that it seams bogus but it is strange that it is a 
> quite logical 
> > sequence of operations that is printed in the trace.
> > 
> > >One is flushing the CPU L1 cache before returning 
> descriptors, the other is
> > >aligning some data so that cache bug does not occur. These 
> did not really
> > >help to understand what is the actual bug though. I read 
> the errata.txt in
> > >the web site but could not find any info on the bug. Can 
> you please give me
> > >some information on what is the bug?
> > 
> > Short version: If the input DMA buffers is in the cache you may 
> >                corrupt data at other "random" addresses.
> > 		   The bug is present in all ETRAX 100LX revisions.
> > Long version at the end of the email if you really want to know.
> > 
> > To make sure that the DMA buffers two meassures must be taken:
> > 1. Make sure that the buffers doesn't share cache line with any
> > other data. This is achieved by aligning the buffer to cacheline
> > boundary and the size a multiple of the cacheline size.
> > 2. Make sure that the buffer is not present in cache when DMA runs.
> > This is achieved by calling prepare_rx_descriptor when descriptors
> > are returned to the DMA.
> > 
> > In the Ethernet driver this is rather complicated to increase
> > performance. In most other drivers it is simple to add the
> > workaround.
> > 
> > For the synchronous serial port I suggest:
> > 
> > 1. Align buffers by modifying struct sync_port:
> > char in_buffer[IN_BUFFER_SIZE] __attribute__ ((aligned(32)));
> > 2. In start_dma_in call prepare_rx_descriptor just before
> >    *port->input_dma_first = virt_to_phys(&port->in_descr2); and
> >    *port->input_dma_first = virt_to_phys(&port->in_descr1);
> > e.g. prepare_rx_descriptor(&port->in_descr2);
> > 
> > Now the really long and boring bug description for the people
> > with too much time:
> > 
> > The problem occurs when the CPU performs a cached memory write that 
> > reaches over two cache lines, the first cache line is present in the
> > cache and the second cache line gives a dirty miss. If the DMA at 
> > the same time writes to the cache, the first dword flushed by the 
> > CPU miss may get corrupted. The problem is only related to DMA data
> > input buffers. It does not occur with DMA descriptors (not even when
> > the DMA writes the status field), and it does not occur with output 
> > DMA. All input DMA channels are affected.
> > Possible workarounds:
> > 1. Make sure that the DMA data input buffers are not present in the 
> > cache. This can be acheived by accessing an address n * 8 k 
> away from
> > the buffer position before the buffer is entered into the 
> DMA receive
> > list. One address must be accessed for each cache line. 
> > 2. Have the DMA input buffers in non-cached memory.
> > 3. Avoid CPU writes that stretch over cache line borders where the
> > first cache line is clean and the second cache line is dirty.
> > Probably this solution is too difficult to be used in practice. 
> > 
> > PS. The Axis office will be pretty empty between december 23 and
> > January 7 (lots of holidays at xmas in Sweden) but you can always
> > try to email us anyway. DS
> > 
> > /Mikael
> > 
> > -----Original Message-----
> > From: owner-dev-etrax@xxxxxxx.com]On">mailto:owner-dev-etrax@xxxxxxx.com]On
> > Behalf Of Fettahlioglu, Mahmut
> > Sent: Wednesday, December 18, 2002 9:09 AM
> > To: dev-etrax
> > Subject: Kernel oops using DMA on MCM
> > 
> > 
> > Hi,
> > 
> > After fixing memory mirroring problem by connecting two 
> pins on the MCM,
> > today I received a new version of the PCB with the problem 
> fixed, and now we
> > are on the next round of problems :)
> > 
> > We are mainly using the system to drive a synchronous 
> serial port with DMA.
> > When I tried to use the driver that I had developed for the 
> developer board
> > (which had a ETRAX 100LX v1 CPU) on the MCM chip (which has 
> a v2 CPU),
> > sooner or later I get some kernel oops. This driver does full-duplex
> > continuous DMA transmit and send using a synchronous serial 
> port of the
> > Etrax 100LX CPU. It had been rigorously tested under the 
> developer board,
> > and was working reliably.
> > 
> > What is interesting is, it appears as if the kernel is 
> crashing outside of
> > the interrupt handlers, tasklets, user-to-kernel read/write 
> functions, and
> > open/close functions. I assume this is the case as I enclosed these
> > functions with printk's and it looks like the kernel 
> crashes outside these
> > functions.  So I am suspecting the DMA transfer is doing 
> something wrong. 
> > 
> > Analysing the kernel oops log using ksymoops does not help 
> unfortunately as
> > the stack trace is definitely incorrect. According to the 
> stack trace, it
> > appears as if the kernel is crashing when trying to execute 
> "execve" system
> > call, and is getting a mmu bus fault. However, the 
> application in question
> > is just the "cat" process which is reading from the device 
> file at that
> > time. So my assumption is, whatever is causing the kernel to crash,
> > overwrites the stack memory, possibly causing the kernel to 
> really call the
> > "execve" function. I have included the error output and 
> ksymoops output to
> > this mail.
> > 
> > There does not seem to be a general problem related with 
> executing new
> > processes; after some reasonable amount of testing, 
> spawning processes do
> > not seem to cause a problem at all unless they use the 
> driver. All processes
> > that use the driver, however, are randomly causing kernel crashes.
> > 
> > After browsing through some posts in the list, it seems 
> that there is a
> > cache bug which may be a reason for what is happening. Is 
> this bug new in
> > Etrax 100LX v2? If so this would increase its odds. I 
> looked at the ethernet
> > driver for the workaround, but it seems there are two 
> workarounds. One is
> > flushing the CPU L1 cache before returning descriptors, the other is
> > aligning some data so that cache bug does not occur. These 
> did not really
> > help to understand what is the actual bug though. I read 
> the errata.txt in
> > the web site but could not find any info on the bug. Can 
> you please give me
> > some information on what is the bug?
> > 
> > Do you think there is anything different between the two 
> versions of the CPU
> > that can cause the problem?
> > 
> > Below is the output from the debug port when the kernel crashes:
> > --------------------------------------------------------
> > 
> > <4>Oops: 0000
> > <4>IRP: 416d341a SRP: c0033552 DCCR: 000004a8 USP: 9ffffbec 
> MOF: 00000000
> > <4> r0: 35568000  r1: c079bc48   r2: 00000003  r3: 00012580
> > <4> r4: c079bc48  r5: c079bc08   r6: c0661da4  r7: 00000812
> > <4> r8: c0660000  r9: 9fffe934  r10: 00000000 r11: 00000000
> > <4>r12: 00000000 r13: 3556a000 oR10: 00000000
> > <4>R_MMU_CAUSE: 416d303c
> > <4>Process cat (pid: 67, stackpage=c0660000)
> > <4>
> > <4>Stack from 9ffffbec:
> > <4>       00000000 00000000 00000000 00000000 00000000 
> 00000000 00000000
> > 00000000 
> > <4>       00000000 00000000 00000000 00000000 00000000 
> 00000000 00000000
> > 00000000 
> > <4>       00000000 00000000 00000000 00000000 00000000 
> 00000000 00000000
> > 00000000 
> > <4>Call Trace: 
> > <4>Stack from c0661b30:
> > <4>       c0008398 c0661c70 c004639a c00464fe 00000000 
> 00000000 416d2000
> > c01bc55c 
> > <4>       c00a3104 00000000 c0661c2c c00465c2 c0661c2c 
> c066a314 c0008398
> > c004961e 
> > <4>       c0660000 00000812 c0661da4 c079bc08 c079bc48 
> 00012580 c0661c2c
> > c0656104 
> > <4>Call Trace: [<c0008398>] [<c004639a>] [<c00464fe>] [<c00465c2>]
> > [<c0008398>] [<c004961e>] [<c0014e8a>] 
> > <4>       [<c0049418>] [<c00461ba>] [<c0049418>] 
> [<c00461ba>] [<c0033552>]
> > [<c003420e>] [<c0033aea>] [<d4dc000a>] 
> > <4>       [<c0014e8a>] [<c0026312>] [<c0025836>] 
> [<c002650e>] [<c00455f0>]
> > [<c00460be>] 
> > <4>Code:  Bad IP value.
> > 
> > --------------------------------------------------------
> > 
> > And below is the ksymoops output:
> > --------------------------------------------------------
> > 
> > Warning (Oops_read): Code line not seen, dumping what data 
> is available
> > 
> > 
> > >>EIP; 416d341a Before first symbol   <=====
> > 
> > >>IRP; 416d341a Before first symbol
> > >>SRP; c0033552 <padzero+48/4a>
> > >>DCCR; 000004a8 Before first symbol
> > >>USP; 9ffffbec Before first symbol
> > >>IRP; 416d341a Before first symbol
> > >>SRP; c0033552 <padzero+48/4a>
> > >>DCCR; 000004a8 Before first symbol
> > >>USP; 9ffffbec Before first symbol
> > >>MOF; 00000000 Before first symbol
> > >>r0; 35568000 Before first symbol
> > >>r1; c079bc48 <_end+6a7f28/70c2e0>
> > >>r3; 00012580 Before first symbol
> > >>r4; c079bc48 <_end+6a7f28/70c2e0>
> > >>r5; c079bc08 <_end+6a7ee8/70c2e0>
> > >>r6; c0661da4 <_end+56e084/70c2e0>
> > >>r7; 00000812 Before first symbol
> > >>r8; c0660000 <_end+56c2e0/70c2e0>
> > >>r9; 9fffe934 Before first symbol
> > >>r13; 3556a000 Before first symbol
> > 
> > Trace; c0008398 <printk+0/14c>
> > Trace; c004639a <show_stack+0/90>
> > Trace; c00464fe <show_registers+d4/146>
> > Trace; c00465c2 <die_if_kernel+34/46>
> > Trace; c0008398 <printk+0/14c>
> > Trace; c004961e <do_page_fault+202/2b0>
> > Trace; c0014e8a <do_generic_file_read+3a8/3ae>
> > Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
> > Trace; c00461ba <mmu_bus_fault+28/30>
> > Trace; c0049418 <handle_mmu_bus_fault+b4/b8>
> > Trace; c00461ba <mmu_bus_fault+28/30>
> > Trace; c0033552 <padzero+48/4a>
> > Trace; c003420e <load_elf_binary+724/9fc>
> > Trace; c0033aea <load_elf_binary+0/9fc>
> > Trace; d4dc000a <END_OF_CODE+145c000a/????>
> > Trace; c0014e8a <do_generic_file_read+3a8/3ae>
> > Trace; c0026312 <search_binary_handler+52/fe>
> > Trace; c0025836 <copy_strings+0/1ae>
> > Trace; c002650e <do_execve+150/1aa>
> > Trace; c00455f0 <sys_execve+2a/42>
> > Trace; c00460be <system_call+50/58>
> > 
> > 
> > 1 warning issued.  Results may not be reliable
> > 
> > --------------------------------------------------------
> > 
> > Many thanks, and thanks again for the superb support you've 
> provided for the
> > PCB connection problem.
> > 
> > Regards,
> > 
> > Mahmut
> >  
> > 
> --------------------------------------------------------------
> --------------
> > ---------------------
> > Mahmut Fettahlioglu
> > Senior Software Engineer
> > 
> > Open Access Pty Ltd
> > PO Box 301
> > Crows Nest NSW 1585
> >  
> > Phone		02 9978 7009
> > Fax		02 9978 7099
> > Email		<mahmut.fettahlioglu@xxxxxxx.au>
> > 
> --------------------------------------------------------------
> --------------
> > ---------------------
> > This email is intended only for the use of the individual or entity
> > named above and may contain information that is confidential and
> > privileged. If you are not the intended recipient, you are hereby
> > notified that any dissemination, distribution or copying of this
> > email is strictly prohibited. If you have received this email in
> > error, please notify us immediately by return email or telephone 
> > 02 9978 7009 and destroy the original message.
> > 
>