Disk recovery in linux p1

2014-12-28

So I'm still wrestling with my broken disk. I decided to start writing my steps for posterity, random googlers, and a few masochists that enjoy reading this stuff.

The situation is that I'm stuck with an SSD that's only half a year old. It started causing random issues at an arbitrary point (was watching a movie on the computer, nothing strange). After the reboot the machine was stuck in grub rescue mode.

I've managed to recover it once, somehow. At this point I really thought I had just fixed it. But it relapsed pretty quickly and now seems to simply ignore any recover attempts I throw at it.

This is a Kubuntu (Ubuntu+KDE) installation, single boot (this machine has never seen Windows). The SSD is only an OS drive, so main data is on other disks. To make matters worse the home dir is encrypted with standard Ubuntu encryption. This means you can't just mount the drive to access information in /home, you'll need a few mount steps and access to other parts of the system in order to access it.

However, my main problem right now seems to be that the superblock is... well, gone. At first this didn't trigger any serious flags for me since I wasn't aware of this term. I knew about partition tables and MBR but can't remember having to deal with the superblock before. I just figured it was a bigger block size or something. Well, no.

So the superblock is part of the hard drive that contains your partition table(s). These tables are like an index to offsets of actual files on your drive. These files may start at arbitrary points in the disk drive, may or may not be back to back to each other. And without this table you don't know where each file starts making the search essentially and nearly literally like looking for a needle in a haystack.

When these problems first started I ran the boot-repair-disk which any forum post points to. The first thing it asks for is to backup stuff, and so I did. The second thing it does if you pick the automatic route is install an mbr on every disk it can find. Great, I can boot off my data drives now :/

Anyways, ever since after I've tried the boot-repair-disk route I've not been able to boot from the drive. It always puts me back in a grub rescue prompt. When I try to ls, it tells me three drives. Two of them ext4 and one of unknown type. For some reason it puts the OS drive last in this list, but that seems irrelevant.

I've ran the linux program testdisk which seems to be able to recover the partition table. Or at least I can access and recover files through it. However, this limits me to the system files since the home dir is encrypted and testdisk does not seem to care about that (merely lists the dirs as ? and ignores them otherwise). At least they're not deemed lost, I guess.

I've tried recovering the drive through testdisk but while mostly successfull (but reporting about 7% file copy failures) it only gives me the linux core files that I don't care about. Using photorec, a similar and bundled tool, it will indiscriminatively recover any file from the disk. But it will lose the actual file names and just dump them with random names (f12234515.txt or whatever). It seems to cover the encrypted files but when I aborted the process after an hour, it still had at least 10 hours to go, probably more. Plus I'm not convinced I can even use the files anyways so it seemed a useless process.

The main problem I'm facing right now is that testdrive can "write" back the partition disk it recovered, but it needs a reboot to take effect and after reboot nothing seems to change. I get the error: no such device: 58ABF29C... error and the grub rescue prompt every time.

I've discovered it's difficult to figure out the actual problem here. But at this point I'm pretty sure the disk is just broken. In fact I've already got a replacement (an Intel SSD, seems more robust). But I still want to try and recover some minor files that are stored in my home dir.

I've made an image dump through testdrive, and while searching on how to mount that, I discovered I could have just used a standard program called dd to directly copy data from the disk and clone it. Anyways, I've now got an image of the drive to work on and try to get to work.

Right now I am trying to mount the image. I have a single file of about 480GB stored on one of the data disks. Using mount you can actually mount anything, a drive or a file is irrelevant (in fact... oh nevermind). Reading this blog post gave me the idea to just try and mount the drives at arbitrary sectors. A sector being the smallest read/write boundary (I believe it's actually impossible to store 1 byte files such that they only take up 1 byte physically, unless you concat them first). The superblock should start at a sector and one sector is 512 bytes on my drive.

I've tried some of the default superblock offsets (even the backups, but I'm not so sure there still are any) but they wouldn't stick. So now I'm resorting to a brute force approach, where I use a simple bash script to just try and mount the disk image at any sector boundary (example from here):

Code:
#!/bin/bash
bsz=512 # or 1024, 2048, 4096 higher = faster

for i in {2..10000000}; do
echo "---( $i )---"
mount -o offset=$(($bsz*$i)) -t ext4 image.dd /media/foo
if [ $? == 0 ]; then
echo Found!
break
fi
done
echo Exit

But it's ran for a few hundered thousand times now without success. I'm pretty sure the superblock starts somewhere early on the disk as it's the first partition. So it's back to the drawing board for me...

Maybe it's not ext4 but ext3 or even 2? Why can testdisk find the partition tables but can't I use it. Or even find it myself. Surely I'm overlooking something where testdisk is telling me exactly where to find it (proper offsets/blocks/etc). Learning a lot about mounting drives, at least, but still very annoyed.

---

The script above lead to nothing. It's at i=550000 and going strong. In the meantime I've had more luck with testdisk. The response in this thread suggested to try and set type to none, rather than ext4 or whatever. I've done that and started a full search and found the partition again. But I think it's the same as the quick search finds with the attempts before.

At least I'm fairly confident now that my data is not lost. And since I'm working on the image rather than the physical disk, I'm also fairly certain I can proceed with ditching the original disk and install replace it with the new disk, while maintaining access to the data through the image, for a recovery later. But not yet.

The testdisk superblock output gives a hint at a fsck.ext4 command to execute in order to fix stuff. I tried this before but it didn't work. This time, low and behold, it does work. On the image no less.

Code:
sudo fsck.ext4 -n -b 98304 -B 4096 sdb/image.dd

This is what I ran, having the image in a file image.dd in a relative dir sdb. The -n says "no" to all questions, since I don't want to change anything just yet. The numbers come from the advanced & superblock screen in testdisk directly;

Code:

Partition Start End Size in sectors
superblock 98304, blocksize=4096 []
superblock 163840, blocksize=4096 []
superblock 229376, blocksize=4096 []
superblock 294912, blocksize=4096 []
etc..

At first I was worried this was just bogus output because of the empty brackets. Especially since the fsck attempts kept failing. I figured the brackets would contain some information on success and this just meant fail. But no, I still don't know what the brackets signify though. There is no hint or legenda for it. But like I said, I was able to check the system with these parameters.

Actually, I have the same problem for the partition overview screen of testdisk. It says something like

Code:
Partition Start End Size in sectors
P ext4 0 0 1 54200 171 51 870733824


Now Partition="ext4", Start="0 0 1", End="54200 171 51" and Size in sectors="870733824". Since it's a 480gb drive we can confirm that the size in sectors times the size of a sector (512 bytes) amounts to the main partition size (870733824 * 512 / 1024 / 1024 / 1024 = 415gb (and change). The remaining size was swap and whatever. So that seems to add up. It is kind of odd that it doesn't find the other partitions, though, but I'm not too bothered with that. The reason is that I can list the files of the found partition and they are the ones of a linux root.

The start/end numbers are a little more sketchy. It would have made sense if this was "cylinders, heads, sectors", but then the numbers don't add up (and the last end number would have to be equal to the last column, if the partition started at 0, or nearly equal for starting at 1). So I don't really know what they mean, yet. Searching for this doesn't help as it's too generic a term to search for and the tutorials I do find don't mention what the numbers mean, either. Let's hope they're not super relevant for me... (they probably are)

Thing is; I can run fsck.ext4 on my image, but not on my disk. Maybe because it is an image and already extracted? Sigh, I don't really want to screw the image by writing fixes to it because extraction takes ages, but it seems I have no other choice. On the other hand, it's probably best to mess with an image rather than the actual disk.

On a side note; I wonder whether the name "fsck" was picked deliberately, as I often see it as a masked way of typing "fuck". Apparently this isn't news...

It's quite annoying to get a bunch of errors without being able to tell what they actually mean. Trying to run fsck with the -p switch doesn't help because it screams "UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY". Yah. Great. Errors like "Inode xxxxx has imagic flag set. Clear?" or "Inode xxxxx has a extra size (65535) which is invalid. Fix?". The fsck should I know? Googling for these is impossible as you'll only run into file system trouble threads. Man page doesn't mention it. Wiki doesn't either. Siiiighh... There is at least one consistency; most search results on that scream end up guessing the hardware is broken. Which ends up being the voted/picked answer. Of course, that doesn't help me either.

So after running fsck with -y flag on the image, and doing it again just in case, ... and again because it was still "fixing thing", ... and not doing it again because, well, there's probably something very wrong. Afterwards I tried mounting it anyways, despite fsck still reporting certain errors (though far less than initially). And low and behold, it mounted.

The lost+found dir is splattered with some 14k files with files whose filename consists of just a number. This is where fsck puts recovered files/dirs during its repair process. I'm just hoping this didn't affect anything in the home dir. Doing an ls -al in the root shows that /home is now a link to "libsane-abaton.so.1.0.24". Wruh-oh. That can't be good news for me.

I'm trying a decryption next, anyways. This guide helps. In the end I realize that this is a dead end. The home dir is fubared and there's nothing I can do with it. To decrypt it I need access to it in the first place.

Ooookay. Time to ditch the image and use dd to create a clone of the disk. Then remove the disk, replace it with the new one, and start installing stuff onto it. I'll turn my attention to the clone later. I've had enough of it. I grepped the lost+found on the image for certain file contents I knew had to be in /home and did not turn up anything. So it is useless as I have no access to the home dir. At this point I'm not entirely sure anymore whether I can still access it at all. However, I need to move on. This machine is my work machine and it needs to function. The only reason I was able to take this much time to try and recover the data in the first place was because of the holidays.

So. I'm cloning the drive with dd and storing the image for posterity (and later analyzing). Then I'll replace the disk and do a clean install (having no choice). I'll take a look at the image later and try some magic fu on it. But I'm not holding my breath anymore. I'm relying on dd to make a 1:1 copy of the disk and I'm hoping the bad sectors don't screw me over (like is suggested in this thread).

That thread had two interesting gems too: One is that you can easily use gzip to create a zipped image. And you can use kill to send a message to dd to get some progress, in the same terminal where you started it, anyways. Which is better than the nothing you get otherwise;

Code:
sudo dd if=/dev/sda | gzip -c > ~/sda.image &
ps -a | grep dd
sudo kill -USR1 <pid>

Apparently it's different for osx so take care there :p Also take very good care with if vs of because swapping them may wipe your disk. That disk should not be mounted in write mode, by the way. Unmounted or mounted as read only (mount -o ro,remount /dev/sda iirc).

For now I have to give up on this recovery attempt. While I'm quite sure I can get this working, I simply can't afford it to have this machine down any longer. The drive is not wiped, though, and I'll have a clone of it as an image anyways so I can get back to recovering it later. On my own time. Ok, everything is my own time but you know what I mean.