Ok, this is the scenario: I need to parse my logs to find how many times image thumbnails have been downloaded without actually watching the "large image" page... This is basically a hotlink protection system based on a ratio of "thumb" to "full" image views
Considering the server is bombarded constantly by requests to the thumbnails, the most efficient solution seems to use buffered apache logs that write to disk once every, say, 1Mb, and then parse the logs periodically
My question is this: how do I parse an apache log in PHP to save the data, with the following being true:
- The log will be used and update in real time, and I need my PHP script to be able to read it while this is being done
- The php script will have to "remember" which parts of the log it read, so as not to read the same part twice and skew data
- Memory consumption should be at a minimum, since the logs can easily reach 10Gb of data in a few hours
The php logger script would be called once every 60 seconds and process whatever amount of log lines it can during that time..
I've tried hacking some code together but I have problems using a minimum amount of memory, finding a way to keep track of the pointer with a "moving" filesize
Here's a part of the log:
212.180.168.244 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441268.jpg HTTP/1.1" 200 3072 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
122.53.168.123 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441276.jpg HTTP/1.1" 200 3007 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
143.22.203.211 - - [18/Jan/2012:20:06:57 +0100] "GET /t/0/11/11441/11441282.jpg HTTP/1.1" 200 4670 "-" "Opera/9.80 (Windows NT 6.1; U; pl) Presto/2.10.229 Version/11.60" "-"
Attaching the code for your review here:
<?php
//limit for running it every minute
error_reporting(E_ALL);
ini_set('display_errors',1);
set_time_limit(0);
include(dirname(__FILE__).'/../kframework/kcore.class.php');
$aj = new kajaxpage;
$aj->use_db=1;
$aj->init();
$db=kdbhandler::getInstance();
$d=kdebug::getInstance();
$d->debug=TRUE;
$d->verbose=TRUE;
$log_file = "/var/log/nginx/access.log"; //full path to log file when run by cron
$pid_file = dirname(__FILE__)."/../kframework/cron/cron_log.pid";
//$images_id = array("8308086", "7485151", "6666231", "8343336");
if (file_exists($pid_file)) {
$pid = file_get_contents($pid_file);
$temp = explode(" ", $pid);
$pid_timestamp = $temp[0];
$now_timestamp = strtotime("now");
//if (($now_timestamp - $pid_timestamp) < 90) return;
$pointer = $temp[1];
if ($pointer > filesize($log_file)) $pointer = 0;
}
else $pointer = 0;
$pattern = "/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[^\[]*\[([^\]]*)\][^\"]*\"([^\"]*)\"\s([0-9]*)\s([0-9]*)(.*)/";
$last_time = 0;
$lines_processed=0;
if ($fp = fopen($log_file, "r+")) {
fseek($fp, $pointer);
while (!feof($fp)) {
//if ($lines_processed>100) exit;
$lines_processed++;
$log_line = trim(fgets($fp));
if (!empty($log_line)) {
preg_match_all($pattern, $log_line, $matches);
//print_r($matches);
$size = $matches[5][0];
$matches[3][0] = str_replace("GET ", "", $matches[3][0]);
$matches[3][0] = str_replace("HTTP/1.1", "", $matches[3][0]);
$matches[3][0] = str_replace(".jpg/", ".jpg", $matches[3][0]);
if (substr($matches[3][0],0,3) == "/t/") {
$get = explode("-",end(explode("/",$matches[3][0])));
$imgid = $get[0];
$type='thumb';
}
elseif (substr($matches[3][0], 0, 5) == "/img/") {
$get1 = explode("/", $matches[3][0]);
$get2 = explode("-", $get1[2]);
$imgid = $get2[0];
$type='raw';
}
echo $matches[3][0];
// put here your sql insert or update
$imgid=(int) $imgid;
if (isset($type) && $imgid!=1) {
switch ($type) {
case 'thumb':
//use the second slave in the registry
$sql=$db->slave_query("INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1 ",2);
echo "INSERT INTO hotlink SET thumbviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE thumbviews=thumbviews+1";
break;
case 'raw':
//use the second slave in the registry
$sql=$db->slave_query("INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1",2);
echo "INSERT INTO hotlink SET rawviews=1, imageid=".$imgid." ON DUPLICATE KEY UPDATE rawviews=rawviews+1";
break;
}
}
// $imgid - image ID
// $size - image size
$timestamp = strtotime("now");
if (($timestamp - $last_time) > 30) {
file_put_contents($pid_file, $timestamp . " " . ftell($fp));
$last_time = $timestamp;
}
}
}
file_put_contents($pid_file, (strtotime("now") - 95) . " " . ftell($fp));
fclose($fp);
}
?>
I'd personally send the log entries to a running script instead. Apache will allow this with by starting the filename for the log with a pipe (|). If this doesn't work, you can create a fifo as well (see mkfifo).
The running script (whatever it is) can buffer x lines and do what it needs to do based on that. Reading the data isn't all that hard, and shouldn't be where your bottleneck will be.
I do suspect that you will run into issues with your INSERT statements on the database.