Parsing large log files with ruby

Currently I’m writing a couple of munin plugins at work and for certain components I need to parse log files to get out the numbers so munin can draw its graphs. My particular task is to parse a Pound log file and generate several custom graphs out of it and so the examples in this post will relate to this particular problem. I thought this might be useful anyway as text / log file parsing is quite a common task.

As you might know, munin asks its nodes every 5 minutes for new data and so the simplest approach would be to iterate over all the lines, trying to find the right ones. The first examples are a bit more general though. Instead of the last 5 minutes you can specify a start time and it will emit the next 5 minutes of log output.

For these experiments I concatenated log files from the last 5 days, starting with Feb. 8th which resulted in 218MB or 1.351.440 lines of text. Here is the code for the simple parser:

Here is how it is invoked:

time ruby parser_1.rb big.log "2011-02-09 01:10:00"
real	0m11.620s
user	0m11.551s
sys	0m0.066s

Note that I took a very early date and still it takes up 11 seconds to emit the right lines. This is far beyond reasonable. Even if I set the the starting time to 2011-02-08 20:00:00 it still takes 4.6 seconds at 100% cpu time. This is not something you want to run every 5 minutes on your production server.

The next approach is to use a state file. Every day after logrotate has done its duty, the logfile is small and the initial search is fast and cheap. On each run the last useful line number is written to the state file so that on the next run, this amount of lines can be skipped and only last 5 minutes of log output has to be searched instead of the entire log file over and over again.

This is the corresponding code which looks quite similar to the first approach:

It is invoked just like the first one and also the result for the first run is quite similar:

Here is how it is invoked:

time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real	0m11.861s
user	0m11.789s
sys	0m0.069s

On the second run its quite faster though because of the line skipping:

time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real	0m0.124s
user	0m0.096s
sys	0m0.025s

This considerably faster and almost useable for production. In other examples you see the use of gets() or readline() but in my benchmarks I couldn’t find any difference to the each_line() method in terms of performance. Also skipping on lines via the next keyword is fast and nice to read compared to lengthy if-clauses.

The next thing I tried was to remember the byte offset instead of the number of lines to skip. I already knew about the IO#lineno() method which would return the current line number of a given file but just recently I found out about thr IO#tell() method which returns the current byte offset. Maybe seeking byte wise would increase the speed some more I though – but unfortunately it didn’t.

First run:

time ruby parser_3.rb big.log "2011-02-09 01:10:00"
real	0m11.343s
user	0m11.296s
sys	0m0.054s

Second run:

time ruby parser_3.rb big.log "2011-02-09 01:10:00"
real	0m0.113s
user	0m0.101s
sys	0m0.009s

It turns out that seeking and skipping line wise is almost just as fast as byte wise. The time consuming part is the creation of Time objects and their comparison. As we all know, object creation is expensive in ruby and this is really the bottleneck here. So unfortunately this can’t be optimized in this case as I need to compare times. I had another log file where I could just compare Strings to filter for the proper lines of code and that was a lot faster but the apache common log timestamp doesn’t work for this.

After this insight I tried to optimize the search so that the first run would be a lot faster and wouldn’t take more than a second. Obviously searching line by line wasn’t the fastest way and so I tried to read in bigger chunks at once, matching for one of the correct timestamps and then narrowing it down to the the first line of the the time range. The lines of code increased for this example but the important stuff to look at is in the search method.

There is probably lots of room for making this a little shorter and more beautiful but it worked for now and I didn’t want to spend the whole day with this (which I did in the end). Anyway, how well does it perform?

First run:

time ruby parser_4.rb big.log "2011-02-09 01:10:00"
real	0m0.291s
user	0m0.255s
sys	0m0.032s

As you can see, the initial search is more than 35 times faster than the previous attempts. I experimented with the parameters and choosing a large buffer size increased the performance. Usually you find values like 4096 or 8192 bytes. I haven’t tried many combinations but somebody suggested to take a rather large value and it worked well enough.

Second run:

time ruby parser_2.rb big.log "2011-02-09 01:10:00"
real	0m0.173s
user	0m0.155s
sys	0m0.016s

Now the second run is slower than before but I suspect that the search implementation isn’t narrowing down to the exact byte and so the emit method has more lines left to go through. As mentioned before, I’m sure this can be optimized a little further.

The last step for me was to make it work for my munin task. In the end I’m just interested in the last 5 minutes of the log file and I don’t care how large it is and therefor searching backwards should boost the overall performance to a desirable degree and render the state file obsolete. Again this is prototype code and a little too simple when it comes to matching timestamps but nonetheless shows how effective this method is. My log file ends at 2011-02-12 09:10:00.

time ruby parser_5.rb big.log "2011-02-12 09:10:00"
real	0m0.043s
user	0m0.026s
sys	0m0.014s

Even without a state file the search and the output now only take 43 milliseconds. Best of all this class yields the lines to a block which is very useful for my munin modules.

The most valuable lessons for this came right from reading through the documention of IO (same with prettier template). Before this I had no idea about seeking through files in any direction. I learned about a couple of new useful methods and found a good use case for StringIO. So I encourage you to do the same! Lastly is also great to know that Ruby is capable enough for tasks like these.

On my way I also found a gem called file-tail which works like the command line tool “tail” but is written entirely in ruby. Its code is also a good source of inspiration (and probably better factored than my prototyping code).

Please let me know if you have additions, corrections or further optimizations for my examples! I’d really like to know about even better ways to do that.

[UPDATE]

After having another look at the task at hand and considering the hints from the comments I wrote another version today which is a lot shorter again, requires less dependencies and is executing very efficiently. This is because I was able to get a more usable timestamp in the log file and I’m now able to compare strings instead of times. So here is the code:

I made a gem out of this which isn’t really that useful for the general public at the moment as its not really that flexible.

DISCLAIMER: I have an SSD – so accessing and seeking files of that size may take longer on regular disks although tests on our server do not suggest a meaningful slowdown. Also, my test log file is way larger than it ever gets in reality thanks to logrotate. I just wanted to test it on a really large file.

[UPDATE]

If you have to parse lots of timestamps it can slow down your code considerably. There is however a ruby gem called home_run which claims to be 20-200 times faster than Rubys own Time class.

4x4x4 Cube – Things are getting more serious

After all those blogposts on how to build an LED cube it was time to actually do it. A few weeks ago I build a 4x4x4 cube. Although it is my goal to build a 8x8x8 or even larger cube, I still wanted another prototype to play around with. In case something goes wrong, I would only lose 64 instead of 512 or more LEDs.

Basically I followed the steps from one of my previous posts for this, building a cube from vertical layers. After building the four layers I drilled 16 holes into a 150x150x15mm piece of MDF and adjusted the layers in them using hot glue and this is how it looked like:

Side

Side

The next task was to do the wiring. My favorite for things like this are good old 40 pin IDE cables. Now there are four layers with 16 LEDs on them, meaning that only 20 wires would be needed but these specific wires are quite thin and as I mentioned in earlier posts, bigger cubes need more power. Each layer could use up to 320mA and while that may have even worked with single wires, I just used two wires for each pin. Take a look at the picture with the final case, you can see the pairs of wires.

Versuchswürfel 01

With some more MDF I built the rest of the case. The cube and the top plate are attached to the rest of the case with two screws, allowing me to take it apart again in case a wire falls off or something.

The best part about it: It worked right away!

The 40 pin IDE connector fits onto a / my Breadboard which makes it easy to build a test setup. Everything was until I was trying to multiplex the layers. Since every LED is switched on only for 25% of the time they appear a lot darker. Darker than I expected. Usually people use ultra bright LEDs which are way to bright at 100% but still bright enough when switched on for only 1/8th of the time. My LEDs should have been just fine for a 4x4x4 cube though and so I was getting quite annoyed by this whole issue.

I played around a lot using different circuits, different transistors and MOSFETs. I tried to read about the involved parts and tried to figure out what was wrong. Unfortunately I didn’t get answers right away and didn’t have the right tools at the time. Then I had work to do and the »Versuchswuerfel 01« caught some dust.

Now I’m back with an oscilloscope to figure out what exactly the problem is. I know that there are tons of guides out there about building an LED cube and that there was a new guide on instructables recently which looked quite impressive but I want to understand and control every aspect of it myself. The circuit / board layout, parts, software etc. I want maximum brightness. The ultimate goal is to use an FPGA for this but this seems like a long road ahead considering the basic problems i’m currently dealing with. On the other side I finally get the chance to get acquainted with an oscilloscope. The current setup looks like this, currently dealing with only one LED.

Side

Since I’m using MOSFETs to switch the layers on and off my first guess was that they weren’t opening up wide enough. This turned out to be true and so I tried logic level mosfets which can be directly controlled from the pins of an Arduino but the result was the same. Now with the oscilloscope I could see that the MOSFETs weren’t opening up fast enough resulting in a non rectangular wave and so I tried to use MOSFET drivers. You can see the result on the picture above. A quite clean, almost rectangular wave.

Now while that improved the situation already I could now increase the voltage on the layers to allow more current in the same amount of time. LEDs work with up to 10 times of the regular current but only if they are pulsed. Now since they are on only 25% of the time I want to try to increase the voltage until the LEDs get 80mA so that they appear equally bright to LEDs that are on all the time at 20mA.

To switch the individual LEDs on a layer, I’m using constant current sink LED driver.

Maybe a list of current parts is also helpful:

  • STP16CP05 Constant current LED driver
  • ICL 7667 MOSFET Driver
  • IRL2905 Logic Level MOSFETs

I will experiment some more before I light up the cube again but I will report on my results. If you have any questions leave a comment or write me an email to cubino [at] smyck [dot] org

Your own “DNS server”

DNS is one of the key ingredients of the internet and the world wide web. Without it you would have to type long numbers into the browsers address bar instead of convenient domain names like example.org. Humans seem to remember words better than numbers. In that sense, DNS is like the internets “phonebook”. To look up those numbers, which are the real address of (web)servers in the internet, every computer asks a so called name server to resolve a given domain name to its actual address.

But where do you get a name server address if you don’t have a name server yet? Well, most of the internet service providers just tell your computer to use their name server if you don’t have one already. Recently google announced their name server with a number which is quite easy to remember (8.8.8.8). So where ever you go you shouldn’t have to care about configuring a name server.

On the other side the name server provider can refuse to resolve certain addresses to an extend where it could be called »censorship«. A provided name server can be very slow and degrade your overall internet performance. Fun fact: Most of the times when people cry out “My internet is not working”, faulty or non-reachable name servers are involved.

Luckily DNS is very decentralized. Everyone can have its own name server. There is no magic to it and its very easy to do as well. Most users even have one pre-installed. Its called »bind«. Its quite easy to setup but there are other ones which are more convenient. I use »unbound« locally on my laptop. Where ever I go, I bring my name server with me. If something is not resolving or working, I have a log file to look into it. It allows me to block google tracking domains easily and offers speed and lots of flexibility. The best thing: It’s super easy to set up:

On Mac OS X with Macports run the following in your Terminal:

sudo port install unbound
sudo launchctl load -w /Library/LaunchDaemons/org.macports.unbound.plist

Thats it. Test it in the terminal via:

dig www.slashdot.org @127.0.0.1

Now configure it in the Network Preferences as your DNS server and move on.

Packages for other operating systems should be available.

Why isn’t everybody using a local DNS resolver/cache?

Using public and well known DNS servers has a big advantage. Since all its users are contributing to its cache, 99% of the domains that need to get resolved are already in the cache. This saves time and bandwidth. As karsten pointed out in the comments, it wouldn’t be wise to switch completely to locally installed DNS resolvers. On the other side it has no disadvantages having a DNS resolver installed which is only running / used when you need it. Have a look at the comments for a more elaborate reply.

Instead of using GoogleDNS or OpenDNS only, look for other public DNS servers or set up one on your own for you and everybody else.

Even if you still don’t care to run your own name server, do yourself a favor and inform yourself about DNS. Its something, that doesn’t hurt to know and can be really useful.

Links

Videos now without flash

I bought the Vimeo plus membership for one year and added mobile versions to all videos so that iOS users should be able to watch them without the need for flash. Also if you have the latest YouTube5 Safari extension installed, you should be able to watch the videos with the html5 player.

Back to self-hosted blogging

After quitting this blog and saying goodbye to wordpress for a while I’m back. First I didn’t want to blog at all. Then I started an electronics blog on tumblr. Then I got very annoyed by tumblr. Now tumblr is down for more than 16h hours and I can’t stand it any longer. I was planning to write my own blog software but since this takes too much time than I can afford right now, I’ll use wordpress again. This will be only a temporary solution as wordpress is still very annoying itself (…but still less annoying than tumblr).

I found my old sql dumps but upgrading to the latest wordpress version and fixing all the encoding issues took two hours of my time. Luckily I found this post which solved the problem fast and easy.

Some things still need some adjustments like the theme and the search. Then the tumblr posts need to be migrated and I have to figure out the flattr setup for wordpress. Lets see how long that takes.