I want to create a function to determine if a well-formed url was dynamically generated (according to this article https://www.webopedia.com/TERM/D/dynamic_URL.html)
My first attempt was to check if any of these characters appear in the url:
ex:
def is_dynamic_url(url):
for ch in ["?", "&", "%", "+", "=", "$", "cgi-bin", ".cgi"]:
if ch in url:
return True
Is this sufficient or are there edge cases I am not considering?
You can't determine if a URL is 'dynamic' from the characters in the string. Web servers have moved way beyond CGI scripts served from a hard-coded URL path. Even when the article was current, it was never more than a weak heuristic.
A URL is simply an address for a resource; the acronym stands for Universal Resource Locator. When the URL starts with
http:
orhttps:
, you have a URL for a web page, but URLs can address far more than just web pages.For the type of URLs that article talks about, a client (your browser, say) will use the first portion, between
//
and/
to connect to a specific server to exchange messages using the HTTP standard. The client sends everything after the host information (the path component) to the server. For this question. the full URL shown in the browser ishttps://stackoverflow.com/questions/53230441/do-i-need-a-regex-to-determine-if-a-url-is-a-dynamic-url
, so the browser uses an encrypted connection to the server namedstackoverflow.com
, and sends it a request to serve the/questions/53230441/do-i-need-a-regex-to-determine-if-a-url-is-a-dynamic-url
path.How the server responds is entirely up to the server. A HTTP server is essentially a black box in this exchange. It can do whatever it likes with the information, and within the broad confines of the HTTP standard, it can produce a response by any means it likes.
In the very early days of the web, the HTTP server would only ever map the path given directly to a filesystem. For example, based on the server configuration the path
/foo/bar/baz.html
would be mapped to the filename/var/data/www/foo/bar/baz.html
file, and if it existed the server would read the contents of that file and return those contents back to the client together some metadata, and that was it. If you wanted to customise this process, you either wrote your own HTTP server or used some kind of extension mechanism specific to the web server. The NCSA web server had a different mechanism from the Netscape server which differed from the Apache HTTP server, etc. Not many sites needed this kind of processing, computers powerful enough to run databases were expensive, and programming such exotic behaviour took a lot of time and specialist knowledge.Then the NCSA HTTP server implementation created a standard for delegating a HTTP request to arbitrary programs (such as scripts), called the Common Gateway Interface, or CGI, and because everything was still centered on mapping URL paths to files, web site administrators were expected to put CGI programs in a dedicated directory, usually named CGI-bin. A path starting with that name would then be mapped to such a configured location and instead of reading files found there and serving them back, the file would instead be executed and the result that the file produced was passed back. For a while that was the most common way to build a website that didn't consist of just static files.
And to let you pass information from the client to the server, the most common way to configure a CGI program is to use additional information in the URL, such as the query string (the part starting at
?
if there is one). The standard HTTP server of yore did not let you alter the URL path for a CGI script much, but would pass through the query string unaltered. So adding?foo=1&bar=2
is a good way to configure such a script.And that's the kind of URL that article refers to; it gives you a simple heuristic for judging if a URL might map to a CGI script and so might be called dynamic. It was never meant to be a hard and fast rule that you can teach a computer to look for though.
These days, we have moved far, far beyond CGI scripts. Modern HTTP web servers make it really easy to map every request, regardless of path to a (set) of long-running processes, or are themselves directly handling requests via embedded programming language support. For example, Stack Overflow itself is built using the ASP.NET framework, which runs directly in the Microsoft IIS HTTP server. Every page you see on this site is 'dynamic' in that it shows you information that is combined from different sources (databases, configuration files, templates stored on disk, etc.). The
/questions/53230441/do-i-need-a-regex-to-determine-if-a-url-is-a-dynamic-url
path is dissected by the Stack Overflow application map to dedicated pieces of code configured to handle patterns in the URL. A path that starts with/questions/
and a series of digits, followed by/
and more text, results in database queries for information on the question with number 53230441.It's trivial these days to build such a site yourself. Take a look at a simple web framework like Flask for example. With Python and the Flask library installed, I can put
into a file named
site.py
, execute the commandFLASK_APP=site flask run
and point my browser to the URLhttp://localhost:5000/
and see the textHello, World!
appear, or loadhttp://localhost:5000/Han
instead and seeHello, Han!
in the browser. Those are dynamic URLs too!Note: I haven't even touched on using JavaScript in the web browser here, which adds a whole new level of dynamism, where the client is now smart and can change the behaviour of web pages, load additional URLs in the background and keep changing web page content all the time.
All this means that you can’t tell much, if anything, from just the characters in a URL anymore as to what it’ll produce or if that result was built ”dynamically”.