Why does the size of any HTML page become 6 bytes greater after saving it to a file?

86 Views Asked by snailontheslope At 03 March 2024 at 10:27

I use Python's Requests module to download HTML pages. For each URL I execute this statement response = requests.get(URL), so the result of any GET request is written to the response variable.

I execute this statement to find out the number of bytes of the downloaded HTML page: len(response.text). My idea is to only save an HTML to the hard drive if there isn't any page with the same name on my hard drive or if there is a page with the same name, but the sizes are different. I execute Path(filepath).stat().st_size to find the size of the file on my hard drive if the file exists. The problem arises here.

For some reason, for any downloaded page, the size of the file is always 6 bytes greater than the result of a call to the len() function with the text attribute of the response object. If len() returns 7282, then st_size is 7288; if len() returns 7216, then st_size is 7222 and so on. I don't understand the reason of such behavior. I could add 6 bytes to the result of len() to compare the sizes. I guess, it'll work, but then I won't know the actual reason. It seems like a hack.

I tried to use curl command to download the page, the result is the same. Magical 6 bytes are added. I've checked 10 different pages, the difference of 6 bytes stays the same.

Original Q&A

There are 1 best solutions below

GuiEpi On 03 March 2024 at 10:39

Try saving the file in binary mode wb using response.content instead of response.text. This avoids any newline conversions and should not add any extra bytes for encoding:

with open(filepath, 'wb') as file:
    file.write(response.content)

Compare the file size now to see if the discrepancy persists.

Why does the size of any HTML page become 6 bytes greater after saving it to a file?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in LINUX

Related Questions in GET

Related Questions in FILESIZE

Trending Questions

Popular # Hahtags

Popular Questions