I use Python's Requests module to download HTML pages.
For each URL I execute this statement response = requests.get(URL),
so the result of any GET request is written to the response variable.
I execute this statement to find out the number of bytes of the downloaded HTML page: len(response.text). My idea is to only save an HTML to the hard drive if there isn't any page with the same name on my hard drive or if there is a page with the same name, but the sizes are different. I execute Path(filepath).stat().st_size to find the size of the file on my hard drive if the file exists. The problem arises here.
For some reason, for any downloaded page, the size of the file is always 6 bytes greater than the result of a call to the len() function with the text attribute of the response object. If len() returns 7282, then st_size is 7288; if len() returns 7216, then st_size is 7222 and so on. I don't understand the reason of such behavior. I could add 6 bytes to the result of len() to compare the sizes. I guess, it'll work, but then I won't know the actual reason. It seems like a hack.
I tried to use curl command to download the page, the result is the same. Magical 6 bytes are added. I've checked 10 different pages, the difference of 6 bytes stays the same.
Try saving the file in binary mode
wbusingresponse.contentinstead ofresponse.text. This avoids any newline conversions and should not add any extra bytes for encoding:Compare the file size now to see if the discrepancy persists.