How to troubleshoot corrupt text file in Linux?

8.7k Views Asked by At

Something strange happened to one of my XML files in UTF-8 encoding. As a result, my Ubuntu 14.04 desktop thinks that it is a binary file and any editor displays it as all full of "strange" characters. Here is my case:

k6ps@laptop520:~/Allalaadimised/File_problem$ ll
kokku 308
drwxrwxr-x 2 k6ps k6ps   4096 dets  15 11:02 ./
drwxr-xr-x 5 k6ps k6ps  20480 dets  15 10:58 ../
-rw-r--r-- 1 k6ps k6ps 134587 dets  15 10:58 bad_file.xml
-rw-r--r-- 1 k6ps k6ps 131930 dets  15 10:58 good_file.xml
k6ps@laptop520:~/Allalaadimised/File_problem$ file -bi good_file.xml 
application/xml; charset=utf-8
k6ps@laptop520:~/Allalaadimised/File_problem$ file -bi bad_file.xml 
application/octet-stream; charset=binary
k6ps@laptop520:~/Allalaadimised/File_problem$ head -n 3 good_file.xml 
<?xml version="1.0" encoding="UTF-8"?>
<logbook>
<threadset name="First">
k6ps@laptop520:~/Allalaadimised/File_problem$ head -n 3 bad_file.xml 
|I��+ˮ���|+��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�"     )��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��]֊ՙ�z�")��l��

.. and a lot more characters like these. When i open the file in vi editor or Scite, i get lots of chars like these:

|I^[ ß+Ë®ýþö|+ÆÜ^Pl<8c>ò]Ö<8a>Õ<99><98>z»"^\)<9d>Ãl<8c>ò]Ö<8a>Õ<99><98>z»"^
\)<9d>Ãl<8c>ò]Ö<8a>Õ<99><98>z»"^\)<9d>Ãl<8c>ò]Ö<8a>Õ<99><98>z»"^\)<9d>Ãl<8c>ò]Ö<8a>Õ<99>
<98>z»"^\)<9d>Ãl<8c>ò]Ö<8a>Õ<99><98>z»"^\)<9d>Ãl<8c>ò]Ö<8a>Õ<99><98>z»"^<98>z»"^\)<9d 

... and at the bottom it says:

"bad_file.xml" [Incomplete last line][converted] 138 lines, 214920 characters

Hexdump output:

k6ps@laptop520:~/Allalaadimised/File_problem$ hexdump -C bad_file.xml | head -n 15
00000000  7c 49 1b a0 df 2b cb ae  fd fe f6 7c 2b c6 dc 10  ||I...+.....|+...|
00000010  6c 8c f2 5d d6 8a d5 99  98 7a bb 22 1c 29 9d c3  |l..].....z.".)..|
*
00001000  5a ea 54 45 9b f8 9e ce  16 35 89 bd 8f 08 cb 82  |Z.TE.....5......|
00001010  6c 8c f2 5d d6 8a d5 99  98 7a bb 22 1c 29 9d c3  |l..].....z.".)..|
*
00002000  29 b8 f0 21 4a ea 00 19  28 46 53 c5 d1 73 f5 a9  |)..!J...(FS..s..|
00002010  6c 8c f2 5d d6 8a d5 99  98 7a bb 22 1c 29 9d c3  |l..].....z.".)..|
*
00003000  5c 56 80 41 f9 ef 98 3c  e3 7e 7c ee 3a 20 94 82  |\V.A...<.~|.: ..|
00003010  6c 8c f2 5d d6 8a d5 99  98 7a bb 22 1c 29 9d c3  |l..].....z.".)..|
*
00004000  ad cc 1c 5f 40 22 8b f6  9b bb aa ea 45 de 21 ee  |..._@"......E.!.|
00004010  6c 8c f2 5d d6 8a d5 99  98 7a bb 22 1c 29 9d c3  |l..].....z.".)..|
*

I've tried to open the file with various editors and change encoding, convert with iconv, but no luck so far. Unfortunately i'm very inexperienced at system-level issues, so could anybody please give some suggestions what could i try to recover text from that file?

k6ps

2

There are 2 best solutions below

5
On

The way to strip out anything that's not a printable character is

tr -d -c '\11\12\15\40-\176' < bad_file.xml > cleanup.xml

This will create a file called cleanup.xml with everything that isn't a printable ASCII character removed from the file. You should then be able to examine it a little more easily to see if it contains any useful text.

This will delete (-d) anything not in the specified set (-c). The set contains \11, \12 and \15, which are tab, LF and CR, and then everything from \40 to \176, which are all the printable characters.

But in your case, I'm afraid the answer is to give up. In your hexdump output, the * indicates a repeated line; so you've got a line that looks like garbage, followed by this line

6c 8c f2 5d d6 8a d5 99  98 7a bb 22 1c 29 9d c3  |l..].....z.".)..|

repeated many times, and then the same pattern again. Your file has gone. Sorry about that.

Actually that might not quite be true: there might be old copies of the file lying around somewhere unmutilated, or if you're on a journalling filesystem, you might be able to recover the file, but those are separate issues, and probably best asked at Super User.

(I won't mention backups, because it's likely to annoy, and not achieve very much...)

0
On

following the advice of @urzeit to try strings bad_file.xml I have make cat myfile.txt in the command line, and it have just worked to reconstitute my text's file on the command line. In my case, it seems only the "°" and "¤" characters has been transformed in question mark's character. If the experience confirms that, I regard this operation as an essentially 99,9% success.