parse_url() PHP works strange

4.2k Views Asked by At

I'm trying to get host from url using parse_url. But in some queries i get empty results. Here is my function:

function clean_url($urls){
    $good_url=array();
    for ($i=0;$i<count($urls);$i++){
        $url=parse_url($urls[$i]);

       //$temp_string=str_replace("http://", "", $urls[$i]);
       //$temp_string=str_replace("https://", "", $urls[$i]);
       //$temp_string=substr($temp_string, 0,stripos($temp_string,"/"));
       array_push($good_url, $url['host']);
    }
    return $good_url;
}

Input array:

Array ( 
    [0] => https://en.wikipedia.org/wiki/Data 
    [1] => data.gov.ua/ 
    [2] => e-data.gov.ua/ 
    [3] => e-data.gov.ua/transaction 
    [4] => https://api.jquery.com/data/ 
    [5] => https://api.jquery.com/jquery.data/ 
    [6] => searchdatamanagement.techtarget.com/definition/data 
    [7] => www.businessdictionary.com/definition/data.html  
    [8] => https://data.world/ 
    [9] => https://en.oxforddictionaries.com/definition/data 
)

Results array with empty results

Array ( 
    [0] => en.wikipedia.org 
    [1] => 
    [2] => 
    [3] => 
    [4] => api.jquery.com 
    [5] => api.jquery.com 
    [6] => 
    [7] => 
    [8] => data< 
    [9] => en.oxforddictionaries.com 
)
6

There are 6 best solutions below

1
On

Some of those $urls that are being parsed do not have schemes which is causing parse_url to recognise the hosts as paths.

For example, parsing the url data.gov.ua/ returns data.gov.ua/ as the path. Adding a scheme e.g. https to that url so it's https://data.gov.ua/ will allow parse_url to recognise data.gov.ua/ as the host.

0
On

I executed your script and got a php issue:

Notice: Undefined index: host

So, the variable $url['host'] does not exists ... If I var_dump the output in this case, there is the content returned:

array (size=3)
  'scheme' => string 'https' (length=5)
  'host' => string 'en.wikipedia.org' (length=16)
  'path' => string '/wiki/Data' (length=10)

array (size=1)
  'path' => string 'data.gov.ua/' (length=12)

( ! ) Notice: Undefined index: host


array (size=1)
  'path' => string 'e-data.gov.ua/' (length=14)

( ! ) Notice: Undefined index: host

As you can see, the url are interpreted as a path.

Outputs:

  1. $urls[] = 'data.gov.ua/'; Error. Not a valid URL
  2. $urls[] = '//data.gov.ua/'; Valid.
  3. $urls[] = 'http://data.gov.ua/'; Valid.

Tips: Use // if your don't know if it's http or https.

By the way, you can simplify your code :p

function clean_url(array $urls) {
    $good_url = [];
    foreach( $urls as $url ) {
        // add a chech on the start of the url.

        $parse = parse_url($url);

        if( isset($url['host']) )
            array_push($good_url, $url['host']);
        else
            $good_url[] = 'Invalid Url'; // for example, or triger error.
    }
    return $good_url;
}

See foreach and isset

0
On

The general format of a URL is:

scheme://hostname:port/path?query#fragment

Each part of the URL is optional, and it uses the delimiters between them to determine which parts have been provided or omitted.

The hostname is the part of the URL after the // prefix. Many of your URLs are missing this prefix, so they don't have a hostname.

For instance, parse_url('data.gov.ua/') returns:

Array
(
    [path] => data.gov.ua/
)

To get what you want, it should be parse_url('//data.gov.ua/'):

Array
(
    [host] => data.gov.ua
    [path] => /
)

This frequently confuses programmers because browsers are very forgiving about typing incomplete URLs in the location field, they have heuristics to try to decide if something is a hostname or a path. But APIs like parse_url() are more strict about it.

0
On


Some time ago I developed a solution to a similar problem.
I made some changes to my original code to meet your specification.
It is functional but not very elegant.

function clean_url($urls)
{
    $good_url=array();
    for ($i=0;$i<count($urls);$i++){
        $domain=$urls[$i];

        $domain = str_replace("www.","",$domain);
        $domain = str_replace("https://","",$domain);
        $domain = str_replace("http://","",$domain);
        $domain=explode("/", $domain);

       array_push($good_url, $domain[0]);
    }
    return $good_url;
}

$urls=array( 
"0" => "https://en.wikipedia.org/wiki/Data" ,
"1" => "data.gov.ua/" ,
"2" => "e-data.gov.ua/",
"3" => "e-data.gov.ua/transaction",
"4" => "https://api.jquery.com/data/",
"5" => "https://api.jquery.com/jquery.data/" ,
"6" => "searchdatamanagement.techtarget.com/definition/data" ,
"7" => "www.businessdictionary.com/definition/data.html"  ,
"8" => "https://data.world/",
"9" => "https://en.oxforddictionaries.com/definition/data");

echo "<pre>";
print_r(clean_url($urls));
echo "</pre>";

Best regards,

0
On

It was wrong http schema. I'm added http:// to all urls and it's workd

0
On

I made this simple function, which gaves me url (for name) and full url (for a hrefs)

public static function parseUrl($target_url)
{
    $url = "";
    $url_full = "";

    if (!empty($target_url)) {
        $parser = @parse_url($target_url);
        if (!empty($parser['host'])) {
            $url = $parser['host'];
            if (!empty($parser['scheme'])) {
                $url_full = $parser['scheme'] . "://" . $parser['host'];
            } else {
                $url_full = "//" . $parser['host'];
            }
        } else {
            if (!empty($parser['path'])) {
                return self::parseUrl("//".$parser['path']);
            }
        }
    }

    return array('url' => $url, 'url_full' => $url_full);
}

which goes pretty well with example

Array
(
    [url] => en.wikipedia.org
    [url_full] => https://en.wikipedia.org
)
Array
(
    [url] => data.gov.ua
    [url_full] => //data.gov.ua
)
Array
(
    [url] => e-data.gov.ua
    [url_full] => //e-data.gov.ua
)
Array
(
    [url] => e-data.gov.ua
    [url_full] => //e-data.gov.ua
)
Array
(
    [url] => api.jquery.com
    [url_full] => https://api.jquery.com
)
Array
(
    [url] => api.jquery.com
    [url_full] => https://api.jquery.com
)
Array
(
    [url] => searchdatamanagement.techtarget.com
    [url_full] => //searchdatamanagement.techtarget.com
)
Array
(
    [url] => www.businessdictionary.com
    [url_full] => //www.businessdictionary.com
)
Array
(
    [url] => data.world
    [url_full] => https://data.world
)
Array
(
    [url] => en.oxforddictionaries.com
    [url_full] => https://en.oxforddictionaries.com
)

So you can use then:

<a href="{$url['url_full']}" target="_blank">{$url['url']}</a>