[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

rescue code / partition table bugs

I'm working in conjunction with Andrey Filippov of Elphel, Inc.  on
software for his Elphel cameras using the 100LX chip.
For years (according to Andrey), there has been a problem that after
creating a new flash image and loading it into the Axis 100LX based
device, it would warm boot fine, but on the next cold boot it would
die... by placing uart puts into the middle of the code i could see
that it hangs in the middle of what is normally (when debugs are not
present) a very tight loop that zeroes out the BSS for the 
decompressor (to decompress Linux).  I hoped that the latest axis code
(which includes dram initialization fixes) would fix that, but the
problem still exists.  (The work around for the problem is to do
anything that changes the checksum of the partition that contains the
partition table, linux, and read-only filesystem with most programs
... this is a clue!).

In trying to track down that problem i discovered the following bugs
in the rescue program (funny name... it is really a boot stage
program), and in the mkptable script and the code that it generates.
(To review: the "rescue" program in the first flash partition inspects
the 2nd flash partition which must begin with some special code and a
partition table that identifies all remaining partitions. The special
code basically just jumps around the partition table when the "rescue"
code jumps to the first location in the 2nd partition (which it does,
at least in our configuration). The special code and the partition
table are created by a perl script called mkptable.)

Beginning with mkptable, here is my fix (incorrect code generation
commented, followed by corrected code generation):

if ($arch eq "cris" || $arch eq "crisv10") {
        #---->           /* As it was... WRONG WRONG WRONG! -Ted */
        # pctp( 0x0f, 0x05 ); # NOP 0
        # pctp( 0x25, 0xf0 ); # DI   2
        # pctp( 0xed, 0xff ); # BA   4
        # pctp( 0x00, 0x00 ); # BA offset 6
        # pctp( 0x0f, 0x05 ); # NOP 8
        #---->           /* fixed -Ted */
        pctp( 0x0f, 0x05 ); # NOP 0
        pctp( 0xf0, 0x25 ); # DI   2
        pctp( 0xff, 0xed ); # BA   4
        pctp( 0x00, 0x00 ); # BA offset 6
        pctp( 0x0f, 0x05 ); # NOP 8
        #----> end

This code is what the rescue code jumps to.  Note that NOP is
correctly coded (0x050f ... this is little endian), but the bytes of
DI are backwards.  According to the etrax manual, 0x25f0 is DI, but
here is opcode 0xf025 which is a short branch instruction (BWF) ...
conditional on the "P" bit being set with a branch offset of 24
negated... which if taken will go to a location before this flash
partion... If the P bit is not set (or even if it is) we execute the
next instruction.  The op-code for BA is 0xedff but instead we have
0xffed, which if i read the manual right is MOVE.D PC,(R13+) ... 
which trashes both R13 and some memory location, but may have little
effect sometimes at least.  Following that is where the branch offset
is placed later in the code generation; it is always 0x0058 for us. 
This just happens to be BCC .+2+0x58 which (if the carry bit is clear)
will take us just about where we want to be, only minus a few
instructions...  since the first instructions following are typically
irrelevent then noone is the wiser. If the carry bit is not clear...
here i give up the chase.

After fixing mkptable to generate the correct code, i next discovered
that the rescue code has a sanity check that expects to see the
partition table code to begin with NOP and DI as incorrectly coded...
so i had to fix that also:

// Ted:WAS:  #define NOP_DI 0xf025050f
#define NOP_DI 0x25f0050f

I also noted in the rescue code
(os/linux/arch/cris/boot/rescue/head.S) that it computes r7 as either
-1 (no bootable partition found) or the offset of the bootable
partition... note the word offset (which appears in the comments); to
find the actual address one would have to add PTABLE_START to it,
however this doesn't happen.  Since in our case we always boot from
the default partition (the one with the partition table in it) i just
commented out sufficient code to force the default case always to
occur...  a better solution could be easily found of course.

Since making these changes (and doing a flashitall to fix my rescue
code), i have relinked the kernel several times and reloaded into
flash without problem.  This doesn't prove that i've actually fixed
anything, since the problem only occurred on perhaps 25 percent of the
time anyway and was for all pratical purposes "random".  Never the
less, it seems like that these miscodings could have caused the
problem (or some other problem for that matter).

Anyone else making these changes should be aware that they must
reflash the rescue partition as well as the partition contaiing the
partition table.

-Ted Merrill
emBuild Software Design (www.embuild.com)

p.s. Is there any easy way to search the contents of the archives of
this dev-etrax mailing list? It would be acceptable to me if there was
a way to easily download the entire archive to my computer where i
could do the search...