Monday, September 12, 2011

Making a scalable Apache access logs analyzer in Java

I'm thinking about writing a system for handling log files, especially Apache Tomcat access logs with the processing time field enabled. It will mainly be used to analyze request processing time and requests per second.

It doesn't have to be real time, as it's primary goal is to present change of performance over time.

The system might get hold on the access logs over HTTP, where an agent on the server can listen on a directory for newly rotated log files and submit those to the analyzer using HTTP PUT or POST.

By making a REST web-service with variable parameters for system, server and logfile per log submit, it allows segmentation per server and system. It also provides a nice known ID for each log file entry, allowing the agent to resubmit a log file in case of failure or updates.

For all this, I need some functionality:

For the log entry parser, I can utilize StringTokenizer to read each field of the log entries.

I also need some statistics reporters, multiple per log file. For graphing, I have some Perl scripts from an earlier project that can come handy. I might store raw output data in RRD files, with for example rrd4j, for later reporting. This is needed since I can't know about all needed reports and most likely need new ones after log files starts to flow in.

Processing log files and outputting different analysis reports if the perfect case for MapReduce, and Hadoop might be a nice choice as distributed architecture. The Cascading framework that sits on top of Hadoop looks very cool and works more like the UNIX way of piping and filtering.

To control the analyzing jobs I need some scheduler, and Quartz might do the job. New reports are generated periodically, but only if new log files have been submitted. I have to find some way to persist the job details, to allow downtime without losing data.

I am a little unsure about the storage backend. The log files can be stored compressed and uncompressed on the fly when streamed to consumers. For storing raw log files, a simple NoSQL storage might do the job very well, such as CouchDB, a document storage with streaming attachment support. Parsing, filtering, grouping and mapping data can be done in Java code in separate jobs. If those intermediate values from those jobs are persisted, then I don't need advanced queries from the storage backend.

Ok, so this were some thoughts on how a log analyzer system might work.

The alternative: Google Analytics

I have also considered using Google Analytics for processing the logs for me, but it might not be the best choice since the access logs will contain machine-to-machine communication of XML documents and a-like. Even if I choose to do this, I still need to detect rotated log files, parse them and submit them to GA. Also, there are not many good examples of server side tracking with Google Analytics. I found one with example code, but it focuses on browser based traffic to track users. I can't seem to find a way to track server processing time in a stable manner with Google Analyics, so it's a pretty much no-option for me.

After writing all these words, it seems to me that it might be a little more complicated then initially hoped for. I might have to do some shortcuts. For example can the sending of log files be scheduled using a cron job instead of a daemon agent. I have not worked too much with RRD files yet, so I don't know if it can hold all data I need.

I guess the system might be scalable as well, backed by a horizontally scalable database, asynchronous analyzer jobs and standalone log agents on each server.

No comments:

Post a Comment