PHP - can my script for fetching filenames and finding new files be faster?

133 Views Asked by At

I have FTP access to 1 directory that holds all images for all products of the vendor. 1 product has multiple images: variations in size and variations in display of the product.

There is no "list" (XML, CSV, database..) by which I am able to know "what's new". For now the only way I see is to grab all filenames and compare them with the ones in my DB.

The last check counted 998.283 files in that directory. 1 product has multiple variations and there is no documentation of how they are named.

I did an initial grab of the filenames, compared them with my products and saved in database table for "images" with their filenames and date modified (from file).

The next step is to check for "new ones".

What I am doing now is:

// get the file list /
foreach ($this->getFilenamesFromFtp() as $key => $image_data) {
  // I extract data from filenames (product code, size, variation number, extension..) so I can store them in table and later use that as reference (ie. I want to use only large images of variation, not all sizes 
  $data=self::extractDataFromImage($image_data);
  // checking if filename already exists in DB images
  // if there is DB entry (TRUE) it will do nothing, and if there is none it will continue with insertion in DB
  if($this->checkForFilenameInDb($data['filename'])){
  }
  else{
    $export_codes=$this->export->getProductIds();
    // check if product code is in export table - that is do we really need this image
    if($this->functions->in_array_r($data['product_code'],$export_codes)){
      self::insertImageDataInDb($data);
    } // end if                     
  } // end if check if filename is already in DB
} // end foreach

and my method getFilenamesFromFtp() looks like this:

$filenames = array();
$i=1;
$ftp = $this->getFtpConfiguration();

// set up basic connection
$conn_id = ftp_ssl_connect($ftp['host']);

// login with username and password
$login_result = ftp_login($conn_id, $ftp['username'], $ftp['pass']);

ftp_set_option($conn_id, FTP_USEPASVADDRESS, false);
$mode = ftp_pasv($conn_id, TRUE);
ftp_set_option($conn_id, FTP_TIMEOUT_SEC, 180);

//Login OK ?
if ((!$conn_id) || (!$login_result) || (!$mode)) { //  || (!$mode)
   die("FTP connection has failed !");
}
else{
  // I get all filenames and store them in array
  $files=ftp_nlist($conn_id, ".");
  // I count the number of files in array = the number of files on FTP 
  $nofiles=count($files);
  foreach($files as $filename){
  // the limit I implemented while developing or testing, but in production (current mode) it has to run without limit
  if(self::LIMIT>0 && $i==self::LIMIT){ //!empty(self::LIMIT) &&    
      break;
    }
    else{
      // I get date modified from from file
      $date_modified = ftp_mdtm($conn_id, $filename);
      
      // I create new array for filenames and date modified so I  can return it and store it in DB
      $filenames[]= array(
         "filename" => $filename,
         "date_modified" => $date_modified
      );
    } // end if LIMIT empty
    $i++;
  } // end foreach
  // close the connection
  ftp_close($conn_id);
  return $filenames;
}

The problem is that script takes a long time. The longest period I have detected by now is when in getFilenamesFromFtp() I create the array:

      $filenames[]= array(
         "filename" => $filename,
         "date_modified" => $date_modified
      );

That part so far lasts for 4h and is still not done.

While writing this I had an idea to remove "date modified" from the beginning and to use that later only if I am planning to store that image in DB.

I will update this question as soon as I am done with this change and test :)

2

There are 2 best solutions below

2
On

Processing a million filenames will take time, however, I see no reason to store those file names (and date_modified) in an array, why not process a filename directly?

Also, instead of completely processing a filename, why not store it in a database table first? Then you can do the real processing later. By splitting the task in two, retrieval and processing, it becomes more flexible. For instance, you don't need to do a new retrieval if you want to change the processing.

6
On

If the objective is to just display new files on the webpage:

  • You can just store the highest file created/modified time from the DB.
  • This way, for the next batch, just fetch the last modified time and compare it against file created/modified time of all the files. This will make your app pretty lightweight. You can use filemtime for this.
  • Now, take highest filemtime of all current files in iteration and store the highest recorded in the DB and repeat the same above steps.

Suggestions:

foreach ($this->getFilenamesFromFtp() as $key => $image_data) {

If the above snippet gets all filenames in an array, you can discard this strategy. This would consume a lot of memory. Instead read files by one by one using directory functions as mentioned in this answer, as this one maintains an internal pointer for the handle and doesn't load all files at once. Of course, you need to make the pointed out answer follow recursive iteration as well for nested directories.