[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: $subject

On Fri, 2 Mar 2001, Vipin Malik wrote:

> This was caused when a 0xffffffff is found in the flash. If you
> look at the code, the same address is read again. But this time
> was read as NOT A 0xffffffff!!!! 


> There was a very simple test for it. Just erase the sector where
> I was seeing the problem and it should go away.
> And it did. On both banks of memory!

Heh. I actually found this in testing JFFS2 a day or so ago, too. I was 
triggering the 'ofs 0x%08x has already been seen.' complaint in 
jffs2_scan_eraseblock, because the same thing was happening - first 
reading 0xffffffff then jffs2_scan_empty was reading something different
and deciding that the length of empty flash was zero. 

> Why I say, major problem with flash file systems, is because 
> I have not seen any code that addresses this problem. Additionally
> this would affect jffs1 and jffs2 as well as any other file system
> on any NOR flash memory (I don't know enough about NAND types).

In jffs2 there's going to be no valid data in the offending block; only
dirt, so we just stick it on the erase_pending_list and it gets erased
shortly thereafter. This code appears to be working.

Thinks.... not sure what happens if there are still readable obsoleted 
nodes in the offending block. I'll check.

In jffs1 you should also be able to erase it immediately, but it's just 
a little more difficult to determine that's the case. The initial scan 
will see it as 'dirt' in the middle of the free area, and that's why 
everything gets confused.

> I don't have a fix in place yet, but will try out something soon.
> I guess, detect something like this and just mark the entire
> erase sector as "dirty". Don't bother to read the data contents
> a double word at a time to figure out what's in there.
> The gc thread would then just erase the sector as part of the 
> normal gc process. Once erased, the sector is then as good as new.

Yes, sounds reasonable, except I think you ought to do the erase 
immediately, and you need to make sure it doesn't confuse the scan code 
into thinking the head or tail of the log is at one end of the offending 
block when it shouldn't be. If we account lots of space which is actually 
free as dirty space just because it's between the offending node and the 
real head or tail of the log, we may run out of free space totally and 
kill the filesystem.

It may require turning the scan into a two-stage process - first building
the jffs_fm list of nodes, and then in the second pass going through it
looking for the largest expanse of space which is either clean or
erasable, doing the necessary erases and setting up the rest of the data
structures to make that the free area between the head and the tail
of the log.

The scary thing is that I don't think we can detect the case where the
block is mostly erased and there's only one or two words which might be
doing this, and they happened to return 0xffffffff when we read from them.
Except by re-erasing all empty blocks when we mount :)

To avoid this, it's advisable to call the MTD power management suspend()
call before powering down the device in normal operation, and unless it's
absolutely necessary, only continue with the power down if it/when returns

Obviously we should deal as well as possible with the case where that
hasn't happened, so the testing is very useful.


To unsubscribe from this list: send the line "unsubscribe jffs-dev" in
the body of a message to majordomo@xxxxxxx.com