I have FTP access to 1 directory that holds all images for all products of the vendor. 1 product has multiple images: variations in size and variations in display of the product.
There is no "list" (XML, CSV, database..) by which I am able to know "what's new". For now the only way I see is to grab all filenames and compare them with the ones in my DB.
The last check counted 998.283 files in that directory. 1 product has multiple variations and there is no documentation of how they are named.
I did an initial grab of the filenames, compared them with my products and saved in database table for "images" with their filenames and date modified (from file).
The next step is to check for "new ones".
What I am doing now is:
// get the file list /
foreach ($this->getFilenamesFromFtp() as $key => $image_data) {
// I extract data from filenames (product code, size, variation number, extension..) so I can store them in table and later use that as reference (ie. I want to use only large images of variation, not all sizes
$data=self::extractDataFromImage($image_data);
// checking if filename already exists in DB images
// if there is DB entry (TRUE) it will do nothing, and if there is none it will continue with insertion in DB
if($this->checkForFilenameInDb($data['filename'])){
}
else{
$export_codes=$this->export->getProductIds();
// check if product code is in export table - that is do we really need this image
if($this->functions->in_array_r($data['product_code'],$export_codes)){
self::insertImageDataInDb($data);
} // end if
} // end if check if filename is already in DB
} // end foreach
and my method getFilenamesFromFtp()
looks like this:
$filenames = array();
$i=1;
$ftp = $this->getFtpConfiguration();
// set up basic connection
$conn_id = ftp_ssl_connect($ftp['host']);
// login with username and password
$login_result = ftp_login($conn_id, $ftp['username'], $ftp['pass']);
ftp_set_option($conn_id, FTP_USEPASVADDRESS, false);
$mode = ftp_pasv($conn_id, TRUE);
ftp_set_option($conn_id, FTP_TIMEOUT_SEC, 180);
//Login OK ?
if ((!$conn_id) || (!$login_result) || (!$mode)) { // || (!$mode)
die("FTP connection has failed !");
}
else{
// I get all filenames and store them in array
$files=ftp_nlist($conn_id, ".");
// I count the number of files in array = the number of files on FTP
$nofiles=count($files);
foreach($files as $filename){
// the limit I implemented while developing or testing, but in production (current mode) it has to run without limit
if(self::LIMIT>0 && $i==self::LIMIT){ //!empty(self::LIMIT) &&
break;
}
else{
// I get date modified from from file
$date_modified = ftp_mdtm($conn_id, $filename);
// I create new array for filenames and date modified so I can return it and store it in DB
$filenames[]= array(
"filename" => $filename,
"date_modified" => $date_modified
);
} // end if LIMIT empty
$i++;
} // end foreach
// close the connection
ftp_close($conn_id);
return $filenames;
}
The problem is that script takes a long time.
The longest period I have detected by now is when in getFilenamesFromFtp()
I create the array:
$filenames[]= array(
"filename" => $filename,
"date_modified" => $date_modified
);
That part so far lasts for 4h and is still not done.
While writing this I had an idea to remove "date modified" from the beginning and to use that later only if I am planning to store that image in DB.
I will update this question as soon as I am done with this change and test :)
Processing a million filenames will take time, however, I see no reason to store those file names (and
date_modified
) in an array, why not process a filename directly?Also, instead of completely processing a filename, why not store it in a database table first? Then you can do the real processing later. By splitting the task in two, retrieval and processing, it becomes more flexible. For instance, you don't need to do a new retrieval if you want to change the processing.