Currently I’m writing a couple of munin plugins at work and for certain components I need to parse log files to get out the numbers so munin can draw its graphs. My particular task is to parse a Pound log file and generate several custom graphs out of it and so the examples in this post will relate to this particular problem. I thought this might be useful anyway as text / log file parsing is quite a common task.
As you might know, munin asks its nodes every 5 minutes for new data and so the simplest approach would be to iterate over all the lines, trying to find the right ones. The first examples are a bit more general though. Instead of the last 5 minutes you can specify a start time and it will emit the next 5 minutes of log output.
For these experiments I concatenated log files from the last 5 days, starting with Feb. 8th which resulted in 218MB or 1.351.440 lines of text. Here is the code for the simple parser:
Here is how it is invoked:
time ruby parser_1.rb big.log "2011-02-09 01:10:00"
real 0m11.620s
user 0m11.551s
sys 0m0.066s
Note that I took a very early date and still it takes up 11 seconds to emit the right lines. This is far beyond reasonable. Even if I set the the starting time to 2011-02-08 20:00:00 it still takes 4.6 seconds at 100% cpu time. This is not something you want to run every 5 minutes on your production server.
The next approach is to use a state file. Every day after logrotate has done its duty, the logfile is small and the initial search is fast and cheap. On each run the last useful line number is written to the state file so that on the next run, this amount of lines can be skipped and only last 5 minutes of log output has to be searched instead of the entire log file over and over again.
This is the corresponding code which looks quite similar to the first approach:
It is invoked just like the first one and also the result for the first run is quite similar:
Here is how it is invoked:
time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real 0m11.861s
user 0m11.789s
sys 0m0.069s
On the second run its quite faster though because of the line skipping:
time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real 0m0.124s
user 0m0.096s
sys 0m0.025s
This considerably faster and almost useable for production. In other examples you see the use of gets() or readline() but in my benchmarks I couldn’t find any difference to the each_line() method in terms of performance. Also skipping on lines via the next keyword is fast and nice to read compared to lengthy if-clauses.
The next thing I tried was to remember the byte offset instead of the number of lines to skip. I already knew about the IO#lineno() method which would return the current line number of a given file but just recently I found out about thr IO#tell() method which returns the current byte offset. Maybe seeking byte wise would increase the speed some more I though – but unfortunately it didn’t.
First run:
time ruby parser_3.rb big.log "2011-02-09 01:10:00"
real 0m11.343s
user 0m11.296s
sys 0m0.054s
Second run:
time ruby parser_3.rb big.log "2011-02-09 01:10:00"
real 0m0.113s
user 0m0.101s
sys 0m0.009s
It turns out that seeking and skipping line wise is almost just as fast as byte wise. The time consuming part is the creation of Time objects and their comparison. As we all know, object creation is expensive in ruby and this is really the bottleneck here. So unfortunately this can’t be optimized in this case as I need to compare times. I had another log file where I could just compare Strings to filter for the proper lines of code and that was a lot faster but the apache common log timestamp doesn’t work for this.
After this insight I tried to optimize the search so that the first run would be a lot faster and wouldn’t take more than a second. Obviously searching line by line wasn’t the fastest way and so I tried to read in bigger chunks at once, matching for one of the correct timestamps and then narrowing it down to the the first line of the the time range. The lines of code increased for this example but the important stuff to look at is in the search method.
There is probably lots of room for making this a little shorter and more beautiful but it worked for now and I didn’t want to spend the whole day with this (which I did in the end). Anyway, how well does it perform?
First run:
time ruby parser_4.rb big.log "2011-02-09 01:10:00"
real 0m0.291s
user 0m0.255s
sys 0m0.032s
As you can see, the initial search is more than 35 times faster than the previous attempts. I experimented with the parameters and choosing a large buffer size increased the performance. Usually you find values like 4096 or 8192 bytes. I haven’t tried many combinations but somebody suggested to take a rather large value and it worked well enough.
Second run:
time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real 0m0.173s
user 0m0.155s
sys 0m0.016s
Now the second run is slower than before but I suspect that the search implementation isn’t narrowing down to the exact byte and so the emit method has more lines left to go through. As mentioned before, I’m sure this can be optimized a little further.
The last step for me was to make it work for my munin task. In the end I’m just interested in the last 5 minutes of the log file and I don’t care how large it is and therefor searching backwards should boost the overall performance to a desirable degree and render the state file obsolete. Again this is prototype code and a little too simple when it comes to matching timestamps but nonetheless shows how effective this method is. My log file ends at 2011-02-12 09:10:00.
time ruby parser_5.rb big.log "2011-02-12 09:10:00"
real 0m0.043s
user 0m0.026s
sys 0m0.014s
Even without a state file the search and the output now only take 43 milliseconds. Best of all this class yields the lines to a block which is very useful for my munin modules.
The most valuable lessons for this came right from reading through the documention of IO (same with prettier template). Before this I had no idea about seeking through files in any direction. I learned about a couple of new useful methods and found a good use case for StringIO. So I encourage you to do the same! Lastly is also great to know that Ruby is capable enough for tasks like these.
On my way I also found a gem called file-tail which works like the command line tool “tail” but is written entirely in ruby. Its code is also a good source of inspiration (and probably better factored than my prototyping code).
Please let me know if you have additions, corrections or further optimizations for my examples! I’d really like to know about even better ways to do that.
[UPDATE]
After having another look at the task at hand and considering the hints from the comments I wrote another version today which is a lot shorter again, requires less dependencies and is executing very efficiently. This is because I was able to get a more usable timestamp in the log file and I’m now able to compare strings instead of times. So here is the code:
I made a gem out of this which isn’t really that useful for the general public at the moment as its not really that flexible.
DISCLAIMER: I have an SSD – so accessing and seeking files of that size may take longer on regular disks although tests on our server do not suggest a meaningful slowdown. Also, my test log file is way larger than it ever gets in reality thanks to logrotate. I just wanted to test it on a really large file.
[UPDATE]
If you have to parse lots of timestamps it can slow down your code considerably. There is however a ruby gem called home_run which claims to be 20-200 times faster than Rubys own Time class.