I've written a products syncing script between a local server running a merchant application and a remote web server hosting the store's eshop...
For the full sync option I need to sync about 5000+ products, with their images etc... Even with the size variations (where different product sizes - for example shoes) of the same product that share the same product image, I need to check the existence of around 3500 images...
So, for the first run, I uploaded through FTP all product images except for a couple of them, and let the script run to check if it would upload those couple of missing images...
The problem is that the script ran for 4 hours which is unacceptable... I mean, I didn't re-upload every image... It just checked every single image to determine whether it'd skip or upload it (through ftp_put()).
I was performing the check like this:
if (stripos(get_headers(DESTINATION_URL . "{$path}/{$file}")[0], '200 OK') === false) {
which is pretty fast, but obviously not fast enough for the sync to run for a logical amount of time...
How do you people handle such situations where you have to check the existence of a HUGE amount of remote files?
As a last resort, I've left to use the ftp_nlist() to download a list of the remote files and then write an algorithm to more or less do a file compare between the local and remote files...
I tried it, and it takes ages, literally 30+ mins, for the recursive algorithm to build the filelist... You see, my files are not in one single folder... The whole tree spans across 1,956 folders, and the filelist consists of 3,653 product image files and growing... Also note that I didn't even use the size "trick" (used in conjunction with ftp_nlist()) to determine whether a file is a file or a folder, but rather used the newer ftp_mlsd() which explicitly returns a type param that holds that info... You can read more here: PHP FTP recursive directory listing
curl_multi is probably the fastest way. unfortunately curl_multi is rather difficult to use, an example helps a lot imo. checking urls between 2x 1gbps dedicated servers in 2 different datacenters in Canada, this script manage to check around 3000 urls per second, using 500 concurrent tcp connections (and it can be made even faster by re-using curl handles instead of open+close)