I'm looking for some help with a rule (a regex) in Varnish that I'm using to ensure that, e.g., UTM tags don't create a new cached entry for every unique query I receive. Here's my rule:
if (req.url ~ "(\?|\&)(utm|gclid|fbclid|mc)(_|=)") {
set req.url = regsub(req.url, "\?.*", "");
}
This works fine. The problem comes if there is another query in the string that I do not want to get rid of. For example: if the request is to https://example/page/?fbclid=45435, then the rule works fine. Irrespective of what's in the fbclid part, the same page is loaded from the backend. But if the fbclid part comes after another query—if, e.g., the request is to https://example/page/?app=346&fbclid=45435—then, obviously, it breaks the first part of the string as well, and returns the /page/ without the important query ever being processed. (Naturally, I have told Varnish not to cache requests that go to ?app=).
I am not especially good in this area, so I want to make sure that I'm thinking about this the right way. If I were to change my rule so that it didn't look for the &, then the example I gave above would be fixed. But it would also mean that the UTM tags stuck around after the query I wanted to keep, and were passed to the back end to be cached.
Basic question: what's the best way for me to strip out the utm, gclid, fbclid, and mc tags irrespective of where they show up in the string, without getting rid of whatever other queries are in there too?
Thanks!
Your question is similar to Varnish - use the cache when UTM_, gclid and other campaign params are used, otherwise pass if other querystring present
Here's how I would typically strip campaign parameters from the URL:
Please set this code to see if it works. Maybe also read the question I referenced at the top.