Dumping html source using w3m gives unexpected characters/symbols

1.4k Views Asked by At

As a new user of w3m I am trying to do something basic like:

w3m -dump_source nytimes.com > nytimes.html

The output produced gives crazy characters and symbols. However, when I browse using w3m nytimes, it loads properly, and I can even view the HTML using v.

Further when I tried:

w3m -dump_extra nytimes.com > nytimes.html

I get all the extra info associated with the site perfectly, except for the HTML source.

Any help would be appreciated.

1

There are 1 best solutions below

2
On

By default, w3m requests compressed output from the server by sending the following HTTP header:

Accept-Encoding: gzip, compress, bzip, bzip2, deflate

The value of the header may vary depending on the version of w3m, but the fact is that the latest versions of the program request compressed output from the host using Accept-Encoding header. You can find out the exact headers with the following command:

w3m -dump_source -reqlog nytimes.com > /dev/null

The request and response headers will be logged to ~/.w3m/request.log file.

You can request uncompressed version by overriding the header as follows:

w3m -dump_source nytimes.com -o accept_encoding='identity;q=0'

Or even

w3m -dump_source nytimes.com -o accept_encoding='*;q=0'

Alternatively, decompress the output via pipe:

w3m -dump_source nytimes.com | gunzip -f

The -f option causes gunzip to copy the input data without change to the standard output, if the input data is not in a format recognized by gunzip. According to the documentation, you should also pass --stdout option, but the piped command should print the result to standard output even without this option.

Note, the server may respond with content compressed in bzip2. In this case, you can pipe the output through bunzip2 -f command.