Splitting up an IDN URL in PHP

469 Views Asked by At

I'm trying to take an IDN URL along the lines of http://exämple.se/path or https://äxämple.se/anotherpath?foo=bar&baf=bas so that I get the components of it like so:

[0] http(s)://
[1] äxämple.se
[2] /anotherpath?foo=bar&baf=bas

My first thought was "I'll just use parse_url!". Well, except it doesn't do IDN domains so no luck.

Next I tried a bunch of my own regex tricks but somehow failed to get any useful output (some of them working to a degree but still painfully lacking.

Finally I tried various other peoples' regex patterns but none of them seemed to work right for me (work right = captured anything useful, one captured the whole url as its "protocol" part, most others I ran across captured nothing or were clearly functionally identical to ones I'd tried).

And of course, why am I doing this? I want to run idn_to_ascii on the domain name before piecing the URL back together and storing it in a db.

So, what am I doing wrong here? Is my approach completely wrong or is there some magic invocation of preg_match which will fix my problem?

Edit: Preferably I'd like a solution which doesn't involve downloading a blob of code someone else wrote (like say, a custom class named something like ParseIDNUrl weighing in at 100kB)

2

There are 2 best solutions below

1
On BEST ANSWER

parse_url should work fine. Using PHP 5.3.4 I've been able to extract just the domain part:

print parse_url('http://äxämple.se/foobar', PHP_URL_HOST);

Maybe you'll need to tweak encodings:

print utf8_decode(parse_url('http://äxämple.se/foobar', PHP_URL_HOST));

Output I've got is:

äxämple.se

Hope that helps!

0
On

I am sorry I didn't read your post at 100%.

Here's the regex I could find here : Properly Matching a IDN URL

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))