I'm scraping pages from a website, munging them, then compiling them into a ebook. I'm using Git for both the code and the HTML content.
I have to make manual edits to some pages, and they're often updated upstream. This leaves me with the problem of how to retain my local edits when the site updates.
For example, I download v1 of page A, I delete an invalid "", and commit my changes; later I download v2 of page A, which has new content, but still features "". I want to merge the new content into my copy of page A, but also apply my local changes.
I suspect I'll need to manually resolve conflicts sometimes, but on the whole this should be automatic.
I've experimented with merge strategies, rebasing, and other approaches to no avail. What am I missing?
EDIT:
To help clarify my problem:
git init
wget -O page.html https://example.com/
git add page.html
git commit -a -m "w0"
git checkout -b ebook
sed -i -e 's/http:/https:/' page.html
git commit -a -m "e1"
git checkout master
git merge ebook
wget -O - https://example.com/ | sed -e 's/may/may not/' > page.html
git commit -a -m w1
git checkout ebook
git merge master
At the end the last local edit is preserved but the first lost. I know I'm doing something stupid, but...
I would maintain a branch that tracks the original web pages only, let's call it
web
. Every time you download an update, commit it to theweb
branch. Then you need aebook
branch for your changes. After updating theweb
branch, merge it into yourebook
branch, resolving any conflicts that arise.ebook
is initially created as a branch off of the initialweb
.Scenario: Let's assume you started with W0 as the initial state on the web server, then you made local changes in commits E1 and E2. Then the web server was updated to W1, which you merge in to
ebook
to get E3.That would give you a history that looks like this:
When you download the next update to web, W2, you'll get this commit graph, assuming you also had E4 as additional reformatting changes required because of W1:
When you merge W2 into E4 to get E5, Git should apply only the changes between W1 and W2 to E4, which should do what you want.
Note: this process only ever merges from
web
intoebook
, never fromebook
intoweb
. Merging fromebook
back intoweb
would undo the desired effect, as discussed in the comments below this answer.