PERL / PHP Parsing APACHE Access log

1.4k Views Asked by At

Hi I already have a WordPress plugin
[https://wordpress.org/plugins/strictly-system-check/]Srictly System Check[1] that lets me know when my server / site goes down. Server load is 2.50, Swap is X, RAM usage is X, Page load too long, not a 200 status code when crawling desired page for testing, text not found on page, too many DB connections, too many slow queries, queries ran, open connections, queries with no indexes memory usage, PHP memory usage and so on.

However I want to be able to parse my Apache and Error log files and link these up together to get a clearer picture on what was going on at the time of downtime e.g this page being hit X times, this IP hitting too many times and so on so I can go at the time of downtime when server has a load of 3.00 and was swapping to disk X RAM and the page took 60 seconds to load, with average query wait time of 20 seconds THAT ALSO -The top ten 10 IP addressess hitting were (with reverse IP & geo) -Top ten referer if possible were .... -Top ten NON SERP IPs were (ignoring a list of known safe BOT IOs) -The last ten errors within the timespan of the error e.g 10 mins +/- were

So I have these questions (and I am a PERL noob - can do PHP though)]]

-Taking this article on parsing apache log files as an example [http://www.leancrew.com/all-this/2013/07/parsing-my-apache-logs/][1]

  1. Can I just run the PERL script straight into BASH to get results?
  2. Can I save it as a file and then build into plugin to run on demand as call usage.pl ?

Reason for confusion is that at the top he says he calls it by passing in the no of days to it e.g top5log 25 < apache.log

But then the example of the script is just a paste into BASH

#!/usr/bin/python 2
3 import re 4 import sys

So a a newbie how do I take my new .pl PERL script and save it somewhere before then running it and how do I run it on demand??

  1. How do I find out MY own log file format as none of the ones I can see match with the one I have E.G common log format.

An example line from my log file is

12.201.2.12 - - [25/Nov/2014:03:20:01 +0000] "GET /wp-cron.php?doing_wp_cron HTTP/1.1" 200 26 "-" "StrictlyCron" 2/2971379

And how do I find a) where my format is defined (checked in Apache config) b) and what it relates to e.g (2 lines from apache log file)

Remote IP - - [Date of Request] [VERB Requested page/file] [status] ? [?] [user-agent] secs/ms (guess) 207.46.13.19 - - [25/Nov/2014:03:20:36 +0000] "GET /2014/08/somepage-of-mine/ HTTP/1.1" 200 18956 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" 1/1457264 5.9.40.98 - - [25/Nov/2014:03:23:44 +0000] "GET /2014/11/somepage/ HTTP/1.1" 200 16653 "-" "Mozilla/5.0 (Windows NT 6.0; rv:13.0) Gecko/20100101 Firefox/13.0.1" 0/901549

So once the format I know I need to convert has been found I just need to modify his scripts regular expression once I know what each segment means.

# Regex for the Apache common log format.
parts = [
r'(?P<host>\S+)',                   # host %h 
r'\S+',                             # indent %l (unused)
r'(?P<user>\S+)',                   # user %u
r'\[(?P<time>.+)\]',                # time %t
r'"(?P<request>.*)"',               # request "%r"
r'(?P<status>[0-9]+)',              # status %>s
r'(?P<size>\S+)',                   # size %b (careful, can be '-')
r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

Now I am used to using regex in most languages but never in PERL so does r'"(?P)\S+)'", equate to r'"( )"', == capture group OR insides between the '"( and )"' (?P == store group? (?P == name to reference group by OR can you do it by index e.g [0] or [2]? (?P.) == the contents of the group so really '"(.)"' everything between '" and "'

Once I can re-shuffle his regex pattern about to my own format which isn't the common one then I reckon I can work the rest of the code out - just need some pointers on saving and running .pl or PERL scripts.

Also if I can run SHELL_EXEC from my webserver what is the best way of running the perl script is it by the name of the file or a long line by line delimited file like in the example?

This looks like a good script if I can get it working seeing I don't have AWE Stats of for CGI security leaks.

Any help would be much appreciated.

Thanks

Rob

2

There are 2 best solutions below

0
On

First, the script in the article is Python, not Perl. You can tell by the #!/usr/bin/python line at the top.

Second, what the article is suggesting is to save the script as a file named "top5log" somewhere in your $PATH, say /usr/local/bin/top5log, and then mark it executable, which you can do by running chmod +x /usr/local/bin/top5log. Once you've done that, you can run the script from anywhere on your system by typing "top5log".

Next, the author is suggesting you run the script like this:

top5log 25 < apache.log

That tells the shell to give the number "25" to the script as the first argument, and sends the contents of apache.log to the script as the script's STDIN.

That should be helpful information about saving and running Python (and Perl) scripts. As far as understanding the regular expression, here's an article about Python and named capturing groups: http://www.regular-expressions.info/named.html.

Good luck!

1
On

There's Perl plenty of Perl modules to parse logs in various formats on CPAN, for example Logfile::Access:

use Logfile::Access;

my $log = new Logfile::Access;

open (IN, $filename);
while (<IN>)
{
    $log->parse($_);
    warn $log->remote_host;
}
close IN;