How can I export cached files saved in a browser using CacheStorage?

3k Views Asked by At

I have a website which uses the CacheStorage API to save various files using a Service Worker. For reasons beyond my control, lots of these files have been lost from the server they get loaded from. However, I have just realised that several hundred of the files have been cached locally in a browser which had accessed the site lots over a period of years (Luckily the site hadn't been clearing up the cache after itself properly). I can preview the files using chrome's dev tools, but when I click "download" it attempts to download a copy from the server (which no longer exists), rather than giving me the locally cached version.

What's the simplest way to do a one-off export of these files (bearing in mind there's a few hundred of them)? I have full access to the computer the browser is running on, and the domain that the site / service worker is running on. It doesn't need to be a pretty solution, as once the files are restored I plan to learn plenty of lessons to prevent something similar happening in future.

2

There are 2 best solutions below

0
lucas On BEST ANSWER

Responses added to the CacheStorage API are stored on disk. For example, chrome on Mac OSX stores them in ~/Library/Application Support/Google/Chrome/Default/Service Worker/CacheStorage. Inside this directory, there is a directory for each domain, and within those, separate directories for each particular cache used by that domain. The names of these directories (at both levels) don't appear to be human-readable, so you may need to search the contents to find the specific cache you're looking for.

Within the directory for each cache, every response is saved in a different file. These are binary files and contain various bits of info, including the URL requested (near the top) and the HTTP response headers (towards the end). Between these, you'll find the body of the HTTP response.

The exact logic for extracting the bodies and saving them to files usable elsewhere will vary based URL schemas, file formats etc. This bash script worked for me:

#!/bin/bash

mkdir -p export
for file in *_0
do
    output=`LC_ALL=C sed -nE 's%^.*/music/images/artists/542x305/([^\.]*\.jpg).*%\1%p;/jpg/q' $file`
    if [ -z "$output" ]
    then
        echo "file $file missing music URL"
        continue
    fi

    if [[ $(LC_ALL=C sed -n '/x-backend-status.*404/,/.*/p' $file) ]]
    then
        echo "$file returned a 404"
        continue
    fi

    path="export/$output"

    cat $file | LC_ALL=C sed -n '/music\/images\/artists/,$p' | LC_ALL=C sed 's%^.*/music/images/artists/542x305/[^\.]*\.jpg%%g' | LC_ALL=C sed -n '/GET.*$/q;p' > $path
    echo "$file -> $path"
done
1
M Somerville On

The CacheStorage API can be accessed from normal web page JavaScript, as well as a service worker, so if you create a web page on the server that accesses window.caches, you should be able to fetch things out of the cache and do whatever you want. Once you have cache.keys() you could loop over that and use match() which returns the response for that request. You could then print them out for copy and paste (presumably not ideal), POST each one to a server that saves them, or similar.

Here is some normal JS I have on traintimes.org.uk; only to display a list of offline pages, but it could presumably fetch the actual cache entries if it needed.

<script>
// Open the page cache
caches.open("pages")
    // Fetch its keys (cached requests)
    .then(cache => cache.keys())
    // We only want the URLs of each request
    .then(reqs => reqs.map(r => r.url))
    // We want most recent one first (reverse is in-place)
    .then(urls => (urls.reverse(), urls))
    // We don't care about the domain name
    .then(urls => urls.map(u => u.replace(/^.*?uk/, '')))
    // We want them to be clickable links
    .then(urls => urls.map(u => [
        '<a href="', u, '">',
        u.replace(/\?cookie=[^;&]*/, ''),
        '</a>'].join("")))
    // We want them to be visible on the page
    .then(urls =>
        document.getElementById('offline-list').innerHTML =
            '<li>' + urls.join('</li><li>') + '</li>'
    );
</script>