Parsing large log files with ruby

Currently I’m writing a couple of munin plugins at work and for certain components I need to parse log files to get out the numbers so munin can draw its graphs. My particular task is to parse a Pound log file and generate several custom graphs out of it and so the examples in this post will relate to this particular problem. I thought this might be useful anyway as text / log file parsing is quite a common task.

As you might know, munin asks its nodes every 5 minutes for new data and so the simplest approach would be to iterate over all the lines, trying to find the right ones. The first examples are a bit more general though. Instead of the last 5 minutes you can specify a start time and it will emit the next 5 minutes of log output.

For these experiments I concatenated log files from the last 5 days, starting with Feb. 8th which resulted in 218MB or 1.351.440 lines of text. Here is the code for the simple parser:

Here is how it is invoked:

time ruby parser_1.rb big.log "2011-02-09 01:10:00"
real	0m11.620s
user	0m11.551s
sys	0m0.066s

Note that I took a very early date and still it takes up 11 seconds to emit the right lines. This is far beyond reasonable. Even if I set the the starting time to 2011-02-08 20:00:00 it still takes 4.6 seconds at 100% cpu time. This is not something you want to run every 5 minutes on your production server.

The next approach is to use a state file. Every day after logrotate has done its duty, the logfile is small and the initial search is fast and cheap. On each run the last useful line number is written to the state file so that on the next run, this amount of lines can be skipped and only last 5 minutes of log output has to be searched instead of the entire log file over and over again.

This is the corresponding code which looks quite similar to the first approach:

It is invoked just like the first one and also the result for the first run is quite similar:

Here is how it is invoked:

time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real	0m11.861s
user	0m11.789s
sys	0m0.069s

On the second run its quite faster though because of the line skipping:

time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real	0m0.124s
user	0m0.096s
sys	0m0.025s

This considerably faster and almost useable for production. In other examples you see the use of gets() or readline() but in my benchmarks I couldn’t find any difference to the each_line() method in terms of performance. Also skipping on lines via the next keyword is fast and nice to read compared to lengthy if-clauses.

The next thing I tried was to remember the byte offset instead of the number of lines to skip. I already knew about the IO#lineno() method which would return the current line number of a given file but just recently I found out about thr IO#tell() method which returns the current byte offset. Maybe seeking byte wise would increase the speed some more I though – but unfortunately it didn’t.

First run:

time ruby parser_3.rb big.log "2011-02-09 01:10:00"
real	0m11.343s
user	0m11.296s
sys	0m0.054s

Second run:

time ruby parser_3.rb big.log "2011-02-09 01:10:00"
real	0m0.113s
user	0m0.101s
sys	0m0.009s

It turns out that seeking and skipping line wise is almost just as fast as byte wise. The time consuming part is the creation of Time objects and their comparison. As we all know, object creation is expensive in ruby and this is really the bottleneck here. So unfortunately this can’t be optimized in this case as I need to compare times. I had another log file where I could just compare Strings to filter for the proper lines of code and that was a lot faster but the apache common log timestamp doesn’t work for this.

After this insight I tried to optimize the search so that the first run would be a lot faster and wouldn’t take more than a second. Obviously searching line by line wasn’t the fastest way and so I tried to read in bigger chunks at once, matching for one of the correct timestamps and then narrowing it down to the the first line of the the time range. The lines of code increased for this example but the important stuff to look at is in the search method.

There is probably lots of room for making this a little shorter and more beautiful but it worked for now and I didn’t want to spend the whole day with this (which I did in the end). Anyway, how well does it perform?

First run:

time ruby parser_4.rb big.log "2011-02-09 01:10:00"
real	0m0.291s
user	0m0.255s
sys	0m0.032s

As you can see, the initial search is more than 35 times faster than the previous attempts. I experimented with the parameters and choosing a large buffer size increased the performance. Usually you find values like 4096 or 8192 bytes. I haven’t tried many combinations but somebody suggested to take a rather large value and it worked well enough.

Second run:

time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real	0m0.173s
user	0m0.155s
sys	0m0.016s

Now the second run is slower than before but I suspect that the search implementation isn’t narrowing down to the exact byte and so the emit method has more lines left to go through. As mentioned before, I’m sure this can be optimized a little further.

The last step for me was to make it work for my munin task. In the end I’m just interested in the last 5 minutes of the log file and I don’t care how large it is and therefor searching backwards should boost the overall performance to a desirable degree and render the state file obsolete. Again this is prototype code and a little too simple when it comes to matching timestamps but nonetheless shows how effective this method is. My log file ends at 2011-02-12 09:10:00.

time ruby parser_5.rb big.log "2011-02-12 09:10:00"
real	0m0.043s
user	0m0.026s
sys	0m0.014s

Even without a state file the search and the output now only take 43 milliseconds. Best of all this class yields the lines to a block which is very useful for my munin modules.

The most valuable lessons for this came right from reading through the documention of IO (same with prettier template). Before this I had no idea about seeking through files in any direction. I learned about a couple of new useful methods and found a good use case for StringIO. So I encourage you to do the same! Lastly is also great to know that Ruby is capable enough for tasks like these.

On my way I also found a gem called file-tail which works like the command line tool “tail” but is written entirely in ruby. Its code is also a good source of inspiration (and probably better factored than my prototyping code).

Please let me know if you have additions, corrections or further optimizations for my examples! I’d really like to know about even better ways to do that.

[UPDATE]

After having another look at the task at hand and considering the hints from the comments I wrote another version today which is a lot shorter again, requires less dependencies and is executing very efficiently. This is because I was able to get a more usable timestamp in the log file and I’m now able to compare strings instead of times. So here is the code:

I made a gem out of this which isn’t really that useful for the general public at the moment as its not really that flexible.

DISCLAIMER: I have an SSD – so accessing and seeking files of that size may take longer on regular disks although tests on our server do not suggest a meaningful slowdown. Also, my test log file is way larger than it ever gets in reality thanks to logrotate. I just wanted to test it on a really large file.

[UPDATE]

If you have to parse lots of timestamps it can slow down your code considerably. There is however a ruby gem called home_run which claims to be 20-200 times faster than Rubys own Time class.

5 Responses to Parsing large log files with ruby

  1. Nice write up! I wish I had known all of this about a year ago – had a project at my day job involving parsing large files and I didn’t take the time to look into the details as you have here.

    It would be fun to look at the MRI interpreter C code and see what exactly is happening and why exactly there is such a large improvement from your first version to the last. cheers.

  2. For exercises like these perftools.rb (https://github.com/tmm1/perftools.rb/) is your friend. I never try to optimize before running my code through a profiler, it would just be random guessing.

    Out of curiosity, I ran your code through perftools.rb and this is what I found:

    * 33% of the time was spend in IO#tell
    * 23% of the time was spent in IO#each_line
    * 18% of the time was spent in IO#write
    * 18% of the time was spent in the garbage collector, that is expected (see below)
    * the rest is insignificant (at this point, but will be interesting if you optimize away the ones above)

    Basically, your code spends almost all of it’s time coordinating the reading of data from disk. One way of interpreting this is that it’s slow to read from disk, but that is not the case (and you have an SSD, so you shouldn’t expect Ruby to be IO bound). What it actually says is that IO#tell is incredibly slow, and reading a file line by line is also very slow.

    What I’ve learnt from parsing lots and lots of log files is:

    * Don’t read files line by line (and definitely not byte by byte), gobble it all up in memory with File.read and use split to split into lines using String#split. Unless you have a seriously underpowered machine, temporarily using a gig or so of memory is not an issue. On the other hand, in your particular use case the machine isn’t used primarily for log parsing, so you might not want to do that. Reading chunks of 20 or 50 megs shouldn’t be much of a problem though. Avoid IO#tell, #seek, etc. unless called once or twice.

    * Avoid Date and DateTime like the plague (along with Time.parse). If you really have to parse dates use Time, or home_run. Most of the time you don’t even need to parse dates, so think about how you’re going to use the information, is a string like “20110213″ just as good? Else try to extract the date, month, etc. and use Time.new.

    * Avoid creating lots of objects. In a log file parser the GC will take around 20-30% of the running time, because of all the temporary objects being created (mostly strings). Move literal strings to constants, if you run a piece of code like str.split(‘,’) a million times, that will result in a million temporary strings with the content ‘,’. The same goes for literal regexes, ranges and floats (but not fixnums and symbols).

    * Don’t use regular expressions. Matching a regex to each line is pretty slow. It’s not the slowest thing you can do, but it’s not cheap in any way. Never use an unanchored regex like in your code above, that is really bad. Use String#split or #index and #slice to extract the parts of the line you’re interested in.

    * Know your algorithms. If, like in your case, for example, need to search in a sorted data set, use binary search. Start in the middle and continue checking the middle of the remaining lines until you’ve found what you’re looking for.

    * If you can, control the input. If you can change the log format, opt for things like Unix timestamps instead of complicated date formats (the one in the Apache common log format is especially atrocious). It can make a ton of difference.

    Some of these points go against each other (for example, using String#split will create lots of temporary strings), and it’s down to balance, they don’t apply in all cases. Use a profiler and go for the hotspots.

    I think you should be able to cut the time by an order of magnitude. Read the whole file into memory, use binary search to find the position to start reading from, extract the date with an anchored regex or String#split, don’t use Time.parse to parse the date (extract the year, month, etc. using String#split, use a hash to map the date name to a number and use Time.new(year, month, …) instead), make sure not to use too many temporary objects (including literal strings).

    • Hey, thank you so much for the input! I’m definitely going to try to optimize it considering some of your suggestions. For the particular munin use case, where I’m just going for the last 5 minutes the 128kB buffer size is sufficient most of the which means I’m only seeking / telling one or two times. Also in this case I’m really bound to the apache timestamp here which makes it a little messy. As stated in the post, I had other log files where comparing strings of dates was sufficient and therefor orders of magnitudes faster.

      I will post my results! Thanks again!

  3. Just a tip: File#each is more than fast enough. Your bottleneck here is the parsing.

    Here’s some benchmarks for reading a 500k logfile with NO parsing:

    [File#each]
    elapsed time: 0.00252s
    [File#reverse_each]
    elapsed time: 0.00285s
    [File#read with 128k blocks, then enumerate each line in the data]
    elapsed time: 0.00262s

    Basically, you shouldn’t do the block reading. It’s pointless. Plus, with block reading, you have to worry about the edge case where a block read cuts a line in half, so you have to splice the last line of the previous block read onto the beginning of the next block.

    Also note: reverse_each (reading a file backwards) is almost as fast as each. :)

  4. For some deep craziness, check out Tim Bray’s Widefinder I & II compentions, parsing large log files, the first was just a GB or two, the second one was 45GB, not reading that in to memory (well, back then), on a Sun T1000 (32 hw threads) with a big disk array. Tim is a big fan of Ruby, but not the idea choice here.

    http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

    http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>