Dear list,

I need your help in order to solve one of the strangest
and most complicated problems existing in this universe.

First of all I'd like to mention that I'm using FreeBSD
nearly exclusively (along with Solaris and other UNIXes)
for many years and I never had any problem similar to
this. In fact, I never had *any* problem that required
external help. But now, I'm lost. I don't know what to
try, so I would be glad about any suggestion you could
give me.

I'm familiar with FreeBSD, shell scripting and C. My
skills cover the usual "admin things".

The accident that happended to me is some very stange
thing, strange in regards of why the usual means of
solving sich a problem don't seem to fit. In fact, I'm
the second (!) person on earth who encountered this
problem, as far as my investigations revealed. So I'm
not sure if it's solvable at all.

In order to explain what it's about, I'd like to follow
this path:

1. What initially happened?
(impact)

2. How does the problem occur?
(examination)

3. What seems to be the reason?
(diagnosis)

4. What did I try to solve the problem?
(treatment)

5. What kind of solution should be possible?
(prognosis)

This should help to explain my problem properly. If there's
more to know, please ask me. I'll try to answer as precisely
as I can. And don't mind my bad English, it's not my native
language. It's a long story, sorry.





So here I'll go...





1. What initially happened?
---------------------------

First of all, we're talking about this device:

ad0: 114473MB at ata0-master UDMA100

The installation has been a FreeBSD 5.4-p something on
a 2 GHz P4 machine with 768 MB SDR-SDRAM, working perfectly
for many years now. The disk contained some partitions
(ad0s1a as /, ad0s1d as /var, ad0s1e as /usr and ad0s1f as
/home), formatted as UFS 2 with Soft Updates (except for /).

While doing some web development (running: xterms with
Midnight Commander and its editor, and Opera), the system
suddenly stopped working, it froze. Some seconds later, it
rebootet. The last message on VT 0 was something like this,
if I remember correctly:

cannot free some inode: already free
automatic reboot

When the system came up again, I relied on fsck_ffs solving
all possible problems, as I knew it from the past. The
result: Many defects in the file system contents, most of
them didn't matter (can reinstall), but it wouldn't make
the /home partition completely accessible again. I could
copy the content from the archive and all the other users'
home directories (luckily), but under no circumstances I
could access my own (!) home directory again.

HEART ATTACK!!!

Of course, I didn't have a good backup (the last one was
many years old). This is because I never encountered any
problems, so I got lazy. Okay, that seems to be the revenge
now. When you don't do your backups, something will happen.
If you do your backups, nothing will happen, and you won't
need them at all. That's their purpose. I'm sure you're
familiar with this wisdom. :-)

We're talking about documentation, mail archives, sources
of programming and various projects here, data collections
created in many years of hard work. So it's understandable
why I want to get the stuff back as complete as possible,
that would be great.





2. How does the problem occur?
------------------------------

The problem occured at system startup when running fsck_ffs.

** /dev/ad1s1f
** Last Mounted on /home
** Phase 1 - Check Blocks and Sizes
1035979 BAD I=259127
UNEXPECTED SOFT UPDATE INCONSISTENCY

1101472 DUP I=260035
UNEXPECTED SOFT UPDATE INCONSISTENCY

[...]

1117681 DUP I=260039
UNEXPECTED SOFT UPDATE INCONSISTENCY

1117682 DUP I=260039
UNEXPECTED SOFT UPDATE INCONSISTENCY

EXCESSIVE DUP BLKS I=260039
CONTINUE? yes

[...]

3774433638169537379 BAD I=260051
UNEXPECTED SOFT UPDATE INCONSISTENCY

7021223365635213949 BAD I=260051
UNEXPECTED SOFT UPDATE INCONSISTENCY

8030898235988077411 BAD I=260051
UNEXPECTED SOFT UPDATE INCONSISTENCY

7310315658325879925 BAD I=260051
UNEXPECTED SOFT UPDATE INCONSISTENCY

EXCESSIVE BAD BLKS I=260051
CONTINUE? yes

[...]

1485568 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485569 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485570 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485571 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485572 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485573 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485574 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

1485575 DUP I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

5707022222514874728 BAD I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

8091332836184380774 BAD I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

8598589197767749681 BAD I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

[...]

3631363939722683732 BAD I=290557
UNEXPECTED SOFT UPDATE INCONSISTENCY

EXCESSIVE BAD BLKS I=290557
CONTINUE? yes

INCORRECT BLOCK COUNT I=290557 (3104 should be 736)
CORRECT? yes

fsck_ffs: bad inode number 306176 to nextinode

As it's obvious, fsck_ffs fails in phase 1. No recovery is
done.

In my opinion, this indicates a major defect of the file
system. Maybe many defects, one worse than the other. If
fsck_ffs can't repair it, it must be really bad.

Okay, I took the opportunity to take a new hard disk where I
already had installed FreeBSD 7. Why? Because other partitions
had damages, too. On /dev/ad0s1a, /, nothing significant
happened, but for example on /dev/ad0s1e, /usr, the whole
X11R6/ subtree disappeared, and lost+found/ filled up with
many directory fragments. So I could not use the system
anymore.

I put in the new disk as ad0 and the former ad0 disk as
ad1 and retried the fsck_ffs check where fsck_ffs from
version 5 failed with fsck_ffs from version 7. NB that no
matter by which other name I called fsck_ffs, be it fsck_ufs
or fsck_4.2bsd, the problem would stay the same.

In order to do some tests, I made an 1:1 copy of the
defective partition. This is a wise step, because I can't
accidently damage important data, and when I messed up a
copy, I can pull a new one.

FreeBSD's dd program did the job well. It ran approx. 4
hours without any error message. The defect(s) of the
disk partition are replicated 1:1 in the image.

% cd ~/rescue
% dd if=/dev/ad1s1f of=ad1s1f.dd bs=1m
86566+1 records in
86566+1 records out
90772014080 bytes transferred in 15156.804004 secs (5988862 bytes/sec)

File size of ad1s1f.dd seemed to be good, the partition
contained in this file was correctly recognized:

% file ad1s1f.dd
ad1s1f.dd: Unix Fast File system [v2] (little-endian) last mounted on /mnt,
last written at Wed Jul 2 18:51:06 2008,
clean flag 0,
readonly flag 0,
number of blocks 44322272,
number of data blocks 42925108,
number of cylinder groups 472,
block size 16384,
fragment size 2048,
average file size 16384,
average number of files in dir 64,
pending blocks to free 0,
pending inodes to free 0,
system-wide uuid 0,
minimum percentage of free blocks 8,
TIME optimization

Of course, I tried to mount and access the partition's copy
using the vnode mechanism for memory disks:

% sudo mdconfig -a -t vnode -u 10 -f ad1s1f.dd
% mount -o ro /dev/md10 mnt/

Fine, mount worked, so I could see what's on the disk.

+<-/export/home/poly/rescue/mnt------v>+
| Name | Size | MTime |
|/.. |UP--DIR| |
|/.snap | 512|Dec 21 2004|
|/archiv | 512|Feb 27 2006|
|/backup | 512|Sep 23 2005|
|/gast | 1024|Aug 25 2005|
|/lost+found | 2048|Jul 1 10:15|
|/markus | 512|Nov 20 2003|
|/root | 1024|Apr 18 16:17|
|/surf | 1024|Feb 17 2005|
| .fsck_snapshot | 86567M|Jun 30 20:47|
|?poly | 0|Jan 1 1970| <===
+--------------------------------------+
|/.. |
+--------------------------------------+
poly@r55:~/rescue/mnt% [^]
1Help 2Menu 3View 4Edit 5Copy 6RenMov 7Mkdir 8Delete 9PullDn 10Quit

Within the Midnight Commander, the name of the home directory
has been marked with red color and a leading question mark.
Do you recognize the timestamp? Strange. Furthermore, I could
not change into this directory.

% cd mnt/poly
mnt/poly: Not a directory.

% file mnt/poly
mnt/poly: cannot open `mnt/poly' (Bad file descriptor)

But I didn't give up hope yet. The data from within the home
directory seemed to be present. The corresponding inodes don't
seem to be marked as unused. I think this is what "orphan
inodes" are called?

Where do I take this idea from? There's an interesting match
of the disk occupation percentage I found out when trying
some df and dh examinations:

% df -h
Filesystem Size Used Avail Capacity Mounted on
/dev/md10 82G 75G 716M 99% /export/home/poly/rescue/mnt

At this point, a strange situation already occurs: The disk
is 82 GB, 75 GB are used, but less than 1 GB is free. So
there's something missing?

I remember that at the point the disk got mad there were only
approx. 700 MB free on /home. This matches the numbers above,
But where's the rest?

% sudo du -sch mnt
du: mnt/poly: Bad file descriptor
du: mnt/archiv/cr/clips.w32/s01.wmv: Bad file descriptor
du: mnt/archiv/cr/clips.w32/s02.wmv: Bad file descriptor
52G mnt
52G total

The disk is 82 GB, 75 GB are used, and the data structures
that are still present make up 52 GB. So there must be
approx. 20 GB somewhere. This could be the content of my
home directory, the important data, my life, the universe,
and everything. :-)

Furthermore, you'll see two further "Bad file descriptor"
warnings inside the archive directory. They don't matter,
but they surely indicate that more than just the inode of
my home directory died. So more problems can occur while
proceeding.

Of course, checking the partition's copy with dd, directly
or via the md device, gives the same error message as
already mentioned.

There was a file /.fsck_snapshot of the partition's respective
size. This file could be mounted, too, and within it there
was a very old copy of my home directory. The snapshot has
been taken at the time when I initially installed and configured
this system, so it was very old, too old.





3. What seems to be the reason?
-------------------------------

The reason seems to be that the inode describing my home
directory doesn't exist anymore. This explains why its name
is is still there (stored in the inode describing the root
directory), but no further information about the file type
(here: directory) and its respective content is available.

But after all, this does not explain why fsck_ffs can't
repair the partition any more, nor can any other program.
Here my troubles understanding what happened start.





4. What did I try to solve the problem?
---------------------------------------

As I already mentioned, FreeBSD's fsck_ffs is unable to repair
the partition.

fsck_ffs: bad inode number 306176 to nextinode

Using FreeBSD's clri, I tried to clear the inodes that I
thought would cause the problem of fsck_ffs:

% sudo mdconfig -a -t vnode -u 10 -f ad1s1f.dd
% clri 306176 /dev/md10
% sync

This didn't work at all.

I've tried other versions of fsck_ffs, too, running on my
main machine or another one, from FreeBSD 5, 6 and 7. The
only difference was a FreeBSD 5 system where fsck_ffs crashed
within phase 1 with this message:

fsck_ffs: cannot alloc 1073796864 bytes for inoinfo

It seems that this particular machine didn't have enough RAM
installed.

And no matter if I checked the original partition or the
copy I made with dd, the problem would always be the same.

So then I tried an alternative to FreeBSD's dd, hoping that
some "magical translation" would happen. My first choice was
ddrescue from the ports:

% ddrescue -d -r 3 -n /dev/ad1s1f ad1s1f.ddr logfile
Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued: 0 B, errsize: 0 B, errors: 0
Current status
rescued: 90772 MB, errsize: 0 B, current rate: 6815 kB/s
ipos: 90772 MB, errors: 0, average rate: 6723 kB/s
opos: 90772 MB
Finished

The file ad1s1f.ddr was exactly the same as ad1s1f.dd, so no
gain of hope here.

Another idea was to copy data from the original disk using
FreeBSD's fetch program - fetch -rR. Nope.

Even FreeBSD's recoverdisk, done from the partition or its
copy, just brought up another 1:1 copy including the problem.

% recoverdisk ad1s1f.dd ad1s1f.rd
start size block-len state done remaining % done
90771030016 984064 984064 0 90772014080 0 100.00000
Completed

After this, I tried some "hardcore stuff": The Sleuth Kit
from the ports, and first its dls program:

% dls -v -f ufs -i raw ad1s1f.dd > ad1s1f.dls
File system is corrupt (ffs_group_load: Group 12 descriptor offsets too large at 1129104)

Allthough it didn't help me either, the error message is to
be considered interesting: "Group 12 descriptor offsets too
large at 1129104", but sadly, I don't know how to interpret
this. Is 1129104 an inode? If yes: it's not allocated. What
group is meant? Cylinder group? Maybe you could tell me.

Another program from The Sleuth Kit, fls, allowed me to see
some content of the partition. In fact, it even showed data
that wasn't accessible, so it's within the range of the files
that need to be restored.

% fls -i raw -r ad1s1f.dd

[...]

d/- * 259072(realloc): poly
+ d/d * 3438592(realloc): 2003-05-17

[...]

+++ d/d 5840896: brazil
++++ r/r 5840897: kate_bush_-_brazil.mp3
++++ r/r 5840898: shangrila_towers.mp3
++++ r/r 5840899: singing_telegram.mp3
++++ r/r 5840900: the_first_noel.mp3
Segmentation fault (core dumped)

So I checked:

% fsdb -r ad1s1f.dd
ad1s1f.dd is not a disk device
CONTINUE? [yn] y
** ad1s1f.dd
Editing file system `ad1s1f.dd'
Last Mounted on /export/home/poly/rescue/mnt
fsdb (inum: 2)> inode 3438592
current inode: directory
I=3438592 MODE=40700 SIZE=512
BTIME=Nov 30 14:31:57 2007 [0 nsec]
MTIME=Jun 26 05:06:14 2008 [0 nsec]
CTIME=Jun 26 05:06:14 2008 [0 nsec]
ATIME=Jul 1 21:13:05 2008 [0 nsec]
OWNER=poly GRP=staff LINKCNT=2 FLAGS=0 BLKCNT=4 GEN=4803f917
fsdb (inum: 3438592)> ls
slot 0 ino 3438592 reclen 12: directory, `.'
slot 1 ino 447497 reclen 12: directory, `..'
slot 2 ino 3438593 reclen 24: regular, `.sylpheed_mark'
slot 3 ino 283193 reclen 12: regular, `1'
slot 4 ino 289966 reclen 12: regular, `2'
slot 5 ino 289970 reclen 12: regular, `3'
slot 6 ino 3438620 reclen 24: regular, `.sylpheed_cache'
slot 7 ino 290363 reclen 12: regular, `4'
slot 8 ino 290366 reclen 12: regular, `5'
slot 9 ino 290385 reclen 12: regular, `6'
slot 10 ino 290444 reclen 368: regular, `7'
fsdb (inum: 3438592)> inode 259072
current inode 259072: unallocated inode
fsdb (inum: 259072)> quit
***** FILE SYSTEM STILL DIRTY *****
*** FILE SYSTEM MARKED DIRTY
*** BE SURE TO RUN FSCK TO CLEAN UP ANY DAMAGE
*** IF IT WAS MOUNTED, RE-MOUNT WITH -u -o reload

Allthough the directory's name "2003-05-17" indicates that
it should hold pictures from the cam/ subtree, it's content
seems to be a Sylpheed MH mail directory. According to fls's
output, inodes 3438592 and 259072 have been reallocated.
And remember 259072? This has been my home directory, I think.

Another program from the ports, scan_ffs, would only confirm
what I already knew:

% scan_ffs -lv /dev/md10
block 128 id 3f67c4e6,354efde8 size 44322272
block 160 id 3f67c4e6,354efde8 size 44322272
X: 177289088 0 4.2BSD 2048 16384 0 # /export/home/poly/rescue/mnt
block 12032 id 616e732e,c0690070 size 44322272
block 12416 id 3f67c4e6,354efde8 size 44322272
block 13248 id 6e73746a,c3577600 size 44322272
block 376512 id 3f67c4e6,354efde8 size 44322272
block 752864 id 3f67c4e6,354efde8 size 44322272
block 1129216 id 3f67c4e6,354efde8 size 44322272
block 1505568 id 3f67c4e6,354efde8 size 44322272
[...]

The 4.2BSD partition is still there and intact, okay.

The program testdisk, as well available from the ports, seems
to have the same purpose. But a lost partition is not the real
problem, I think.

Another approach I found would to be to avoid looking at the
file system at all, instead trying to parse the disk "byte-wise"
and look for magic bytes. A tool to do so is magicrescue from
the ports.

% magicrescue -r /usr/local/share/magicrescue/recipes -d mr_output /dev/md10
Read error on /dev/md10 at 102400 bytes: Invalid argument

It didn't work on the memory disk, but fortunately on the dd
copy:

% magicrescue -r /usr/local/share/magicrescue/recipes -d mr_output ad1s1f.dd

The files recovered by this program contained many different
types, such as JPG images or MP3 files. Furthermore, files from
within the inaccessible home directory had been restored. This
is another hint that the data should still be there. But sadly,
the file structures could not be retrieved, so I got lots of
stuff into one directory.

>From the manual of the program ffs2recov from the ports I found

out that it's possible to create an inode where you can explicitely
specify name and number. So I tried:

% cd ~/rescue
% ffs2recov -c 259072 -n poly ad1s1f.dd

This caused a file called "poly" in the ~/rescue directory.
Okay, not what I wanted to get. So I tried something really
stupid:

% cd ~/rescue
% sudo mdconfig -a -t vnode -u 10 -f ad1s1f.dd
% mount -o rw /dev/md10 mnt/
% cd mnt
% ffs2recov -c 259072 -n poly ad1s1f.dd
% sync

panic: ffs_write: type 0xc5d37e04 0 (0,16384)
Dumping 136 MB: 121 105 89 73 57 41 (CTRL-C to abort) 25 9
Dump complete
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...

[...]

ad0: 305245MB at ata0-master UDMA100
ad1: 305245MB at ata0-slave UDMA100
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: FAILURE - READ_DMA status=51 error=84 LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: FAILURE - READ_DMA status=51 error=84 LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=64
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=64
ad1: FAILURE - READ_DMA status=51 error=84 LBA=64
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: FAILURE - READ_DMA status=51 error=84 LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: FAILURE - READ_DMA status=51 error=84 LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: FAILURE - READ_DMA status=51 error=84 LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=0
ad1: FAILURE - READ_DMA status=51 error=84 LBA=0

savecore: reboot after panic: ffs_write: type 0xc5d37e04 0 (0,16384)

You can imagine my heartbeat going up to 200 at this moment! :-)
Fortunately, no data was lost. I've got no idea what happened, but
I'm sure my approach was wrong. The system would not react in this
way without a proper reason.

And NB that the ad0 and ad1 you see are completely different
things, the original 120 GB Seagate disk is on the shelf. This
is the new FreeBSD 7 system is put on ad0, and ad1 is reserved
for backup purposes. Why does it complain that much? Okay, don't
mind, it's not important now.





5. What kind of solution should be possible?
--------------------------------------------

In general, there would be two options:

a) Modify fsck_ffs so it will work.

b) Modify the file system so fsck_ffs will work.

Of course, I've got no good clue how to do this in particular.
Let me first describe what I did to fsck_ffs.

I first took a look at fsck_ffs's source code. Well... it's
not that I did understand very much of it, sadly, but I could
find the position where the error

fsck_ffs: bad inode number 306176 to nextinode

came from: it was /usr/src/sbin/fsck_ffs/inode.c line 319:

if (inumber != nextino++ || inumber > lastvalidinum)
errx(EEXIT, "bad inode number %d to nextinode", inumber);

Oh how I love disjunctions in exit conditions! :-) So I made
a change to this part, just to see what would happen. (And:
Yes, I know, "trial & error" is not a programming concept.)
I used a copy of the subtrees sbin/fsck_ffs/ + sbin/mount/
and sys/ufs/ffs/ + sys/ufs/ufs/ from /usr/src/, then issued the
command "make" from within ~/rescue/sbin/fsck_ffs/, which would
give me an executable fsck_ffs in this directory. I copied it
to ~/rescue and tested it with the 1:1 dd copy.

if(inumber != nextino++) {
printf("--- condition: inumber != nextino++\n");
printf("--- inumber=%d nextino(++)=%d lastinum=%d\n", inumber, nextino, lastinum);
errx(EEXIT, "bad inode number %d to nextinode", inumber);
}
if(inumber > lastvalidinum) {
printf("--- condition: inumber > lastvalidinum\n");
printf("--- inumber=%d lastvalidinum=%d, lastinum=%d\n", inumber, lastvalidinum, lastinum);
errx(EEXIT, "bad inode number %d to nextinode", inumber);
}

This was the result:

% ./fsck_ffs -yf ad1s1f.dd

[...]

--- condition: inumber > lastvalidinum
--- inumber=306176 lastvalidinum=306175, lastinum=306176
fsck_ffs: bad inode number 306176 to nextinode

So what's up with inode 306176?

When invoking fsdb on this inode, I could see the content of a
directory, and ils from The Sleuth Kit revealed that it seems
to be a directory within the inaccessible home directory.

slot 150 ino 306176 reclen 20: directory, `hellraiser'
slot 1566 ino 306176 reclen 12: directory, `.'

Strange, isn't it? Finally, I decided to comment out the whole
part. I found fsck_ffs complaining in fsutil.c line 139:

if (inum > maxino)
errx(EEXIT, "inoinfo: inumber %d out of range", inum);

So I put in another "checkpoint" there:

printf("---> %d\n", inum);
if (inum > maxino) {
printf("--- condition: inum > maxino\n");
printf("--- inum=%d maxino=%d\n", inum, maxino);
errx(EEXIT, "inoinfo: inumber %d out of range", inum);
}

The result was this:

% ./fsck_ffs -yf ad1s1f.dd

[...]

THE FOLLOWING DISK SECTORS COULD NOT BE READ:
177638368, 177638369, 177638370, 177638371, 177638372,
177638373, 177638374, 177638375, 177638376, 177638377,
177638378, 177638379, 177638380, 177638381, 177638382,
177638383, 177638384, 177638385, 177638386, 177638387,
177638388, 177638389, 177638390, 177638391, 177638392,
177638393, 177638394, 177638395, 177638396, 177638397,
177638398, 177638399, 177638400, 177638401, 177638402,
177638403, 177638404, 177638405, 177638406, 177638407,
177638408, 177638409, 177638410, 177638411, 177638412,
177638413, 177638414, 177638415, 177638416, 177638417,
177638418, 177638419, 177638420, 177638421, 177638422,
177638423, 177638424, 177638425, 177638426, 177638427,
177638428, 177638429, 177638430, 177638431, 177638432,
177638433, 177638434, 177638435, 177638436, 177638437,
177638438, 177638439, 177638440, 177638441, 177638442,
177638443, 177638444, 177638445, 177638446, 177638447,
177638448, 177638449, 177638450, 177638451, 177638452,
177638453, 177638454, 177638455, 177638456, 177638457,
177638458, 177638459, 177638460, 177638461, 177638462,
177638463, 177638464, 177638465, 177638466, 177638467,
177638468, 177638469, 177638470, 177638471, 177638472,
177638473, 177638474, 177638475, 177638476, 177638477,
177638478, 177638479, 177638480, 177638481, 177638482,
177638483, 177638484, 177638485, 177638486, 177638487,
177638488, 177638489, 177638490, 177638491, 177638492,
177638493, 177638494, 177638495,
--- condition: inum > maxino
--- inum=11116545 maxino=11116544
fsck_ffs: inoinfo: inumber 11116545 out of range

Seemed to be an important condition. :-) So what's this again?
The answer was in setup.c line 209:

maxino = sblock.fs_ncg * sblock.fs_ipg;

Is there some information retrieved incorrectly from the file
system's superblock causing all the trouble? Well, I did try
checks with fsck_ffs with refering to alternate superblocks,
but no luck.

Or does it mean that there are 11116544 inodes on the partition?
This would imply that (not mentioning directories) 11 millions
of files can be created - or are stored on this disk totally?

At this point, I decided to give up this way of "fixing"; most
of the conditions seem to be well intended, the defect on the
disk must be that bad that fsck_ffs can't handle it anymore.

And now for the file system. As it is already clear, the inode
of the home directory is gone. So an idea would be to create
a new inode, with the same name and number as it should be.
Good idea? No, obviously not. I tried it in two different
ways, with no luck.

So that seems to be insufficient. I do understand it: The
inode number created would only be a kind of "link entry"
inside the root directory which points to further information.
But where should the new home directory entry know about its
content?

>From the friendly FreeBSD questions mailing list I even learned

that there's no way to predict the inode numbers. If I assume
a directory D with its inode number i(D), within D a file F
with its inode number i(F), I can't claim i(D) < i(F), so
I can't expect any special inode number.

I think there's more to establish an intact directory structure,
not just a simple "make inode with name". The directory needs
to be populated correctly, but therefore, I would need to
know which files are inside it. So it would be neccessary
to pick all possible inode numbers and look what's behind
them. This means I would need to "walk back" the .. paths
to see which one finally leads to the home directory, and
then put the 1st instance directory name (or inode number
instead of the name, because the name is lost) into one of
the directory slots; do I call them correctly?

As far as I've already learned, when "walking back" the path
from a file deep within a directory structure, every inode
contains a field "where it comes from", let's say, where CWD
and .. are (as an inode number). Let's assume we're at the
inode 259301 refering to a file bla.txt. Then something like
this structure should exist:

bla.txt dingens/ foo/ poly/ /
259301 -----> 259285 -----> 259140 -----> 259072 -----> 2

This would be /home/poly/foo/dingens/bla.txt on ad0s1f (where /
is then mounted as /home).

When I can assume that every inode still knows "where it came
from", what would be a useful tool to build poly/ (12345) again?
I think I'll need to construct its content again, because just
by creating poly/ as 12345, where does the filesystem know from
what's the content of poly/? Is the term "directory slots" I
came across related to that topic? Which sources could give good
hints?

For any considerations, I'll assume that only the inode of
my home directory is gone. I can't tell for sure that it will
be this way, it's possible that other inodes have died, too.
I can't assert it won't be the case.

In general, I think what's needed is a way to reconnect the
"orphan" inodes to "normal" inodes again so they can be accessed.
Because the home directory's inode is gone, any information
about the files and directories on its 1st level is gone, too.
So these would not be restored with their original names, but
with the inode number as names, just like fsck_ffs would do it
with its lost+found/ mechanism. All data within the directories
from the 1st level would of course still have their names because
these inodes are present.

I'm thinking about something like this:

Formerly: / poly/ foo/ bar.c
baz/ boing/ boo.h
boom.h
bla.c
.xchat/ xchat.conf
.fetchmailrc

After restore: / poly/ #123456/ bar.c
#123789/ boing/ boo.h
boom.h
#124785
#127854/ xchat.conf
#128745

^^^^^^^
There are tools that can help to "restore" the 1st level,
for example FreeBSD's file command. There aren't many files
where a problem should occur: File names can usually be
recognized from the data they contain (source, note,
configuration file etc.), and directories can be recognized
by the names of the files they contain. Of course, that's
the thing that would happen if fsck_ffs would work as
initially intended. When I see it, I will remember what
the correct names were.





So these were my first thoughts about this problem. I hope
you can help me with some ideas, concepts or suggestions,
or documents or source files worth studying. I don't expect
you to solve my problem, I'm not greedy. :-)



--
Polytropon
>From Magdeburg, Germany

Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
_______________________________________________
freebsd-fs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"