[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Pwr Down testing with jffs1 and major problem.[resent with corrected sub.]



Oops! Same message as before but with correct subject.
The message had nothing to do with oops and jffs2!
Sorry!

-----Original Message-----
From: Vipin Malik
To: 'David Woodhouse '; Vipin Malik
Cc: 'jffs-dev '
Sent: 3/2/01 11:22 PM
Subject: RE: jffs2 testing and oops!

Ok, maybe not so dramatic and major :), but a serious problem
none the less. I'm surprised that it hasn't cropped up before.

Background:
I've been done a lot of power down reliability testing with
jffs1 (aka jffs) on AMD CFI NOR flash.

My intent was to determine if jffs on mtd was reliable enough
to be used in a production system.
During the course of the testing I've been quite frustrated
with some problems of jffs "failing the grade" pretty poorly.

I have run a total of 4 tests so far. They fared as follows:

Test1: VFS refused to mount jffs as the root fs. Died with a "kernel
panic, out of memory with no killable processes" after 69 power cycles.

Test2: Got an unrecoverable error from the garbage collection thread
complaining that the "head offset" was on a non aligned offset. This
was after 7 power cycles!

Test3: Got an error from the GC thread that the free space accounting
was screwed and that the requested erasable size + head offset > end of
flash!
This was after 52 power cycles!

Test4: Same as Test #1. Kernel panic after 11 cycles. Kernel out
of memory with no killable processes while trying to mount
the jffs as the root fs.

I traced the non-aligned offset problem to some free space handling
in the jffs_scan_flash() routine during the mount. I think that I
have a fix in place for that. More testing will reveal if I really
do or not :)

Now the following is the big one. I speculated that the kernel panic
was caused by a non ending (or very long) loop that was leaking
memory during the mount process (most probably in jffs_scan_flash()).

In some conversions with David Woodhouse, he agreed with this
theory, except that he could not see how that could happen.

If code works the way it is supposed to, then he was right, there
was no infinite loop in the code.

But then I saw it happen (in the verbose debug log).

This was caused when a 0xffffffff is found in the flash. If you
look at the code, the same address is read again. But this time
was read as NOT A 0xffffffff!!!! This would cause a jffs_fm dirty
node to be allocated for 0 bytes of dirt on the flash!

This process would then repeat for ever, thus leaking kernel
memory without progressing down the flash address space.

Hardware bug, timing problem you say! That's what I thought too.

So I wrote a "dd script" that dd'd a small amount of flash block
twice and did a "diff" on that. It failed :( I had a timing problem.

I tripple checked my code, looked at the solder under magnifying
glass, maxed out the wait states to flash. No luck.

I tried my other bank of flash memory. Success!

Ok, so I blew some chips. Bummer, but let's restart the power fail
test.

That was test #4. Kernel panic due to running out of free space.
Bummer :(

I fired up my test script again. This time it failed on the "good bank"
of flash memory too! I had destroyed the other bank also.
Dang the company that had made the eval board! Dumb hardware guys!

Driving home depressed in the cold and the rain, a thought struck me.
What if, there is no hardware problem and this is what is really
happening?!

Ok, to cut a long story short (yeah, it's too late for that now you say
:)
My flash chips are *really* flipping the bits that I read from
the same memory location. This is NOT a timing problem or a hardware
problem. And it's NOt the same bit either or at the same location.

What I suspect is happening is that power is failing in the middle
of a sector erase. The next time that power is restored, the sector
is left in the state when power failed but revergs to read mode.
In flash memories, sectors are
erased by injecting charge into a floating gate (or something to 
that effect). This charge transfer is not an abrupt process.
What if power fails when just enough charge has been transferred
to the floating gate, to bring in the "linear" region of the sense
amplifiers (you get the idea). This may cause the read (or sense)
amplifier to be in an unstable state or even oscillate.

There was a very simple test for it. Just erase the sector where
I was seeing the problem and it should go away.

And it did. On both banks of memory!

Sorry for the long winded explanation. It took me a week to go through
all the tests and this process and find the answer.
You only had to read a few minutes worth :) Live with it.

Why I say, major problem with flash file systems, is because 
I have not seen any code that addresses this problem. Additionally
this would affect jffs1 and jffs2 as well as any other file system
on any NOR flash memory (I don't know enough about NAND types).

I don't have a fix in place yet, but will try out something soon.
I guess, detect something like this and just mark the entire
erase sector as "dirty". Don't bother to read the data contents
a double word at a time to figure out what's in there.
The gc thread would then just erase the sector as part of the 
normal gc process. Once erased, the sector is then as good as new.

Will keep you guys posted.


Vipin

-----Original Message-----
From: David Woodhouse
To: Vipin Malik
Cc: jffs-dev
Sent: 3/2/01 5:19 PM
Subject: Re: jffs2 testing and oops!

On Fri, 2 Mar 2001, Vipin Malik wrote:

>> Just for fun (and the fact that I am completely frustrated with jffs1
>> wrt pwr fail testing) I mounted jffs2 and started my "checkfs"
program
>> that just writes out 100 files with random data+CRC in a round robin
>> fashion (no power fail yet).

>Sorry you're frustrated. We ought to be able to improve the behaviour
of
>jffs1. I'm very grateful for the JFFS2 testing though.


To unsubscribe from this list: send the line "unsubscribe jffs-dev" in
the body of a message to majordomo@xxxxxxx.com