Quick GC Histograms

2013-04-18
3 min read

Some times its hard to find a way of explaining to the technically less literate the power that the Unix shell hands you. The original ideas behing the combination wordprocessor and programmers work bench combined with the power of pipes make text processing easy.

So to today’s problem - analysing GC logs for latency spikes. As we run the jvm with the following properties;

-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime 
-XX:+PrintTenuringDistribution -Xloggc:/path/to/gc.logfile

The output looks like this;

2013-04-18T13:30:33.491+0000: 97108.543: [GC 97108.543: [ParNew
Desired survivor size 26836992 bytes, new threshold 3 (max 4)
- age 1: 12848568 bytes, 12848568 total
- age 2: 7033104 bytes, 19881672 total
- age 3: 7791880 bytes, 27673552 total
- age 4: 5567656 bytes, 33241208 total
: 459080K->45463K(471872K), 0.0147270 secs] 2571295K->2162244K(4968704K) 
icms_dc=0 , 0.0148850 secs] [Times: user=0.27 sys=0.00, real=0.02 secs] 
Total time for which application threads were stopped: 0.0154030 seconds
Total time for which application threads were stopped: 0.0002950 seconds
Total time for which application threads were stopped: 0.0002120 seconds
Total time for which application threads were stopped: 0.0003010 seconds
Total time for which application threads were stopped: 0.0002510 seconds

(This system is not on the critical path by the way).

So, looking for an idea of spikes/deviation from normal or just a view of how the pauses are distributed and the length of our tail a histogram is a good start.

We want to see something like this;

Count Millis
 1301 0.000
   71 0.001
    2 0.002
    1 0.003
    3 0.005
    8 0.006
    8 0.007
   15 0.008
   29 0.009
   33 0.010
   32 0.011
   45 0.012
   47 0.013
   50 0.014
   52 0.015
   54 0.016
   60 0.017
   30 0.018
   39 0.019
   37 0.020
   24 0.021
   24 0.022
   18 0.023
    8 0.024
   10 0.025
    4 0.026
    8 0.027
    5 0.028
    3 0.029
    4 0.030
    3 0.031
    1 0.032

Which we get by using this;

grep 'stopped:' ems2-gc.log | cut -d':' -f 2 | cut -d' ' -f 2 | sort -n 
| awk '{ printf "%1.3fn", $1 }' | (echo " Count Millis" ; uniq -c )

From left to right we’re;

  • pulling out the stopped lines using grep
  • extracting the seconds using the two cut statements
  • numerically sorting the output
  • formatting the output to print out in the buckets we’re interested in - in this case milliseconds using awk
  • and lastly creating a histogram using uniq

There are much cleverer bash one liners out there, and each linux admin has their own favourite set of tools, so the above can be done in many different ways no doubt. There are also those who reach for perl/python for anything more complex than grepping the output of ps. I think they’re missing out, as almost every day (and there have been many) there’s a new flag or command to surprise and delight me with bash.