Posts

We have had some new starters within the test team at LMAX over the past year. Here are their thoughts of their first year.
2023-12-08
7 min read
What started as a simple bug reported soon caused a pair of developers to question maths itself.
2023-05-03
14 min read
The best thing coverage can tell you is that code is unused, and should therefore be deleted.
2023-05-01
5 min read
After a move to new hardware and a new kernel, a pair of hosts in two of our production environments started seeing out of order traffic. Had we set something up wrong, or was the hardware/kernel change the cause?
2023-03-24
9 min read
Our new production exchange recently produced an impossible looking NullPointerException. At the same time, we saw another application in the same deployment throw an OutOfMemoryError. Both problems turned out to have the same root cause. This post tells the story of how we found that out.
2022-06-15
13 min read
A section on our big office whiteboard has this mysterious series of markings on it: How did it get there? What does it mean?
2022-05-03
6 min read
A good old fashioned ringbuffer getting full and dropping messages problem. Brilliant. This is definitely what we want to happen in our production ITCH market data feed. You know, the delta stream where we broadcast every price change and trade to clients? The one where they expect low latency and to receive every event? Yep, definitely a problem.
2018-02-20
7 min read
Some cute round trip test tricks In the last post, we looked at layering our deserialization code to keep things simple. This time, we’ll enjoy the delightful testing benefits this effort yields.
2018-01-22
11 min read
…indeed, your life might get simpler if you don’t. This post will talk through two examples where clever serialization would have been an option, but stupid alternatives actually turned out to be preferable.
2017-12-18
9 min read
We have recently added an extra (optional) call back to the disruptor library. This post will walk through one of our motivations for doing this: monitoring. Before we start – what are we monitoring, and why? At LMAX Exchange, the vast majority of our applications look a bit like this: Here, events are received from the network on a single thread, and placed in an input (or application) ring buffer. Another thread (we’ll call it the application thread) processes these messages, and may emplace events in several output (or ‘publisher’) ring buffers.
2017-09-15
7 min read
Recently at LMAX Exchange we’ve had a couple of services suffer from memory leaks. In both cases, we noticed the problem much later than we’d like. One stricken application started to apply backpressure on (really rather important) upstream components. Another caused an account related action to be intermittently unavailable. What we’d like to be able to do is detect when any of our java services is in acute heap distress as soon as it enters that state.
2017-07-04
7 min read
Previously we’ve talked about how we use Nagios / Icinga for three broad types of monitoring at LMAX Exchange: alerting, metrics, and validation. The difference between our definitions of alerting and validation is a fine one and it more has to do with the importance of the state of the thing we are checking and the frequency in which we check it. An example of what I consider an “Alert” is if Apache is running or not on a web server.
2017-06-12
8 min read
At LMAX Exchange we use LVM for snapshotting volumes for two use cases 1. Take a snapshot of a slave database so it can catch up quickly while the work happens on the snapshotted volume. 2. Backups in case we need to roll back. Every now and then in our CI environment as we soak tested the integration of this with our in house deployment tool – scotty - we found that we would get a merge error.
2017-05-30
4 min read
At LMAX Exchange Nagios is one of our essential tools for monitoring and verifying the operation of our systems. We use it for three distinct purposes. Alerting when things break. Recording trends so that we can predict when problems will occur and then mitigate them. Using Nagios to verify the overall structure of our environments. Things have broken Using Nagios to monitor things breaking down is perhaps the most common use case. These checks need to run often, perhaps every few seconds.
2017-03-31
3 min read
Building great stuff fast I’ll be writing some posts over the coming weeks about how we run our technology department including the processes and procedures we use to keep moving fast but continue to work within our regulatory constraints and the demands we put upon ourselves for operational excellence. This post will be covering how we get things done in IT using an agile mentality with minimal process. The People Before talking about stories, iterations and retrospectives (processes) or boards and cards (tools) let us stop to consider the people.
2017-02-28
3 min read
Just before New Year 2017 a leap second was inserted into Coordinated Universal Time (UTC). At LMAX Exchange we had some luxury to play with how we handled the leap second. January 1st is a public holiday, there’s no trading, so we are free to do recovery if something didn’t go according to plan. This blog post is an analysis of the results of various time synchronisation clients (NTP and PTP) using different methods to handle the leap second.
2017-01-30
9 min read
In part one, we discovered that our multicast receipt thread was being stalled by page faults. In part two, we’ll dig down into the causes of those page faults, and with some help from our friends at Informatica, get to the bottom of things. Systems Cavalry Arrives Let’s focus on just the first stack trace to begin with. We can have a bit of a guess at what’s going on just by looking at the symbol names.
2016-11-04
9 min read
We recently fixed a long standing performance issue at LMAX Exchange. The path we followed to fixing it was sufficiently windy to merit a couple of posts. In this first post we’ll define our issue and then attempt to figure out its cause. Problem Identified At LMAX Exchange we have an application named the market data service. It’s multicast receipt thread is not keeping up – it occasionally stalls for around 150ms. This causes data loss; the service has to request retransmission (it NAKs) from upstream.
2016-11-04
12 min read
One reason that automated UI tests can be unreliable is that they tend to be sensitive to what else is on screen at the time and even things like the current screen size. Developers running the tests locally also find it annoying to have windows opening and closing on their machine while the test runs and are unable to do anything else because their clicking might interfere with the test. At LMAX Exchange we solve that by isolating tests in their own X session, created using vncserver.
2016-10-25
1 min read
A month or two ago I was asked by someone in our Operations team what clock synchronisation is and why we need to do it. I gave them a very basic few sentence answer. That got me thinking that I never read an easy explanation when I myself got started in this area, and the terminology used can be confusing if it’s the first time you come across it. Below is a copy-paste out of our internal documentation where I attempt to explain computer clock synchronisation and the reason for it.
2016-10-05
6 min read
Ever since I read some initial blogs posts about the upcoming eBPF tracing functionality in the 4.x Linux kernel, I have been looking for an excuse to get to grips with this technology. With a planned kernel upgrade in progress at LMAX Exchange, I now have access to an interesting environment and workload in order to play around with BCC. BPF Compiler Collection BCC is a collection of tools that allows the curious to express programs in C or Lua, and then load those programs as optimised kernel modules, hooked in to the runtime via a number of different mechanisms.
2016-08-12
9 min read
In my last couple of posts, I’ve been looking at how UDP network packets are received by the Linux kernel. While diving through the source code, it has been shown that there are a number of statistics available for monitoring receive errors, buffer overruns, and queue depths. In the course of investigating network throughput issues in our systems at LMAX Exchange, we have written some tooling for monitoring the available statistics. The result of that work is a small utility that provides an interface for monitoring system-wide or socket-specific statistics from a Java program.
2016-06-23
3 min read
Background At work we practice continuous integration in terms of performance testing alongside different stages of functional testing. In order to do this, we have a performance environment that fully replicates the hardware and software used in our production environments. This is necessary in order to be able to find the limits of our system in terms of throughput and latency, and means that we make sure that the environments are identical, right down to the network cables.
2016-05-06
12 min read
In this series we are attempting to solve a clock synchronisation problem to a degree of accuracy in order to satisfy MiFID II regulations, and we’re trying to do it without spending a lot of money. So far we have: Talked about the regulations and how we might solve this with Linux software Built a “PTP Bridge” with Puppet Started recording metrics with collectd and InfluxDB, and Finished recording metrics Drawn lots of graphs with Grafana and found contention on our firewall Tried a dedicated firewall for PTP The start of 2016 opened up a few new avenues for this project.
2016-04-08
20 min read
Last time we implemented a minimal detector, and I presented the code for the detector as a fait accompli. Let’s take a closer look at it. import java.nio.file.Files; import edu.umd.cs.findbugs.BugInstance; import edu.umd.cs.findbugs.BugReporter; import edu.umd.cs.findbugs.BytecodeScanningDetector; import edu.umd.cs.findbugs.classfile.ClassDescriptor; import edu.umd.cs.findbugs.classfile.DescriptorFactory; import edu.umd.cs.findbugs.classfile.MethodDescriptor; public class FilesLinesDetector extends BytecodeScanningDetector { private static final ClassDescriptor JAVA_NIO_FILES = DescriptorFactory.createClassDescriptor(Files.class); final BugReporter bugReporter; public FilesLinesDetector(final BugReporter bugReporter) { this.bugReporter = bugReporter; } @Override public void sawMethod() { MethodDescriptor invokedMethod = getMethodDescriptorOperand(); ClassDescriptor invokedObject = getClassDescriptorOperand(); if(invokedMethod !
2016-04-01
5 min read
Findbugs is an incredibly powerful tool, and it supports running of custom detectors. However, the API for writing custom detectors is not well documented, at least as far as I’ve been able to find. So, as I started writing detectors, I’ve been working primarily off a process of trial and error. It’s likely there are better ways of doing things: what follows, however, at least works. Let’s start off with something easy: a detector which labels invocations of a method as bugs.
2016-04-01
3 min read
Continuing on from my last post, here we’ll be looking at flags used to control the C2 or server compiler of the Hotspot JVM. In writing this article, I discovered that the C2 compiler flags did not operate as I expected, and I’ve drawn some possibly incorrect conclusions about how to achieve the required effects. Any enlightenment from those in the know would be welcomed… Configuration In order to reduce the noise created in the compilation logs, we’ll be disabling tiered compilation so that only the server compiler will be used.
2016-03-30
7 min read
Once an application goes live, it is absolutely essential that any future changes are able to be work with the existing data in production, typically by migrating it as changes are required. That existing data and the migrations applied to it are often the riskiest and least tested functions in the system. Mistakes in a migration will at best cause a multi-hour outage while backups are restored and more likely will subtly corrupt data, producing incorrect results that may go unnoticed for long periods making it impossible to roll back.
2016-03-12
5 min read
I recently wanted to use flowtype.js with wordpress to create a liquid layout for the text as well as the images. Integrating it required a bit of research into how to load javascript into wordpress the ‘correct’ way. Here’s how it fits together. flowtype.js (http://simplefocus.com/flowtype/) is a lovely way to scale font size in a responsive way to ensure readable blocks of text. It allows capping of the maximum and minimum font size and allows you to control the steepness of the scaling curve.
2016-03-09
4 min read
In this post, we will explore some of the various flags that can affect the operation of the JVM’s JIT compiler. Anything demonstrated in this post should come with a public health warning - these options are explored for reference only, and modifying them without being able to observe and reason about their effects should be avoided. You have been warned. The two compilers The JVM that ships with OpenJDK contains two compiler back-ends: C1, also known as ‘client’ C2, also known as ‘server’ The C1 compiler has a number of different modes, and will alter its response to a compilation request given a number of system factors, including, but not limited to, the current workload of the C1 & C2 compiler thread pool.
2016-03-05
9 min read