I'm writing a PHP application that accepts an URL from the user, and then processes it with by making some calls to binaries with system()
*. However, to avoid many complications that arise with this, I'm trying to convert the URL, which may contain Unicode characters, into ASCII characters.
Let's say I have the following URL:
https://täst.de:8118/news/zh-cn/新闻动态/2015/
Here two parts need to be dealt with: the hostname and the path.
- For the hostname, I can simply call
idn_to_ascii()
. - However, I can't simply call
urlencode()
over the path, as each of the characters that need to remain unmodified will also be converted (e.g.news/zh-cn/新闻动态/2015/ -> news%2Fzh-cn%2F%E6%96%B0%E9%97%BB%E5%8A%A8%E6%80%81%2F2015%2F
as opposed tonews/zh-cn/%E6%96%B0%E9%97%BB%E5%8A%A8%E6%80%81/2015/
).
How should I approach this problem?
*I'd rather not deal with system()
calls and the resulting complexity, but given that the functionality is only available by calling binaries, I unfortunately have no choice.
The following can be used for this transformation: