Testing@LMAX – Time Travel and the TARDIS

2014-04-01
5 min read

Testing time related functions is always a challenge – generally it involves adding some form of abstraction over the system clock which can then be stubbed, mocked or otherwise controlled by unit tests in order to test the functionality. At LMAX we like the confidence that end-to-end acceptance tests give us but, like most financial systems, a significant amount of our functionality is highly time dependent so we need the same kind of control over time but in a way that works even when the system is running as a whole (which means it’s running in multiple different JVMs or possibly even on different servers).

We’ve achieved that by building on the same abstracted clock as is used in unit tests but exposing it in a system-wide, distributed way. To stay as close as possible to real-world conditions we have some reduced control in acceptance tests, in particular time always progresses forward – there’s no pause button. However we do have the ability to travel forward in time so that we can test scenarios that span multiple days, weeks or even months quickly. When running acceptance tests, the system clock uses a “time travel” implementation. Initially this clock simply returns the current system time, but it also listens for special time messages on the system’s messaging bus. When one of these time messages is received, the clock calculates the difference between the time specified in the message with the current system clock time and records that. From then on when it’s asked for the time, the clock adds on that difference to the current system time. As a result, when a time message is received time immediately jumps forward to that time and then continues advancing at the same rate as the system clock.

Like all good schedulers, ours are written in a way that ensures that events fire in the correct order even if time suddenly jumps forward past the point that the event should have triggered. So receiving a time message not only jumps forward, it also triggers all the events that should have fired during the time period we skipped, allowing us to test that they did their job correctly.

The time messages are published by a time travel service which is only run in our acceptance test environment – it exposes a JMX method which our acceptance tests use to set the current system time. Each service that uses time also exposes it’s current time and the time it’s schedulers have reached via JMX so when a test time travels we can wait until the message is received by each service and all the scheduled events have finished being run.

The TARDIS

The trouble with controlling time like this is that it affects the entire system so we can’t run multiple tests at the same time or they would interfere with each other. Having to run tests sequentially significantly increases the feedback cycle. To solve this we added the TARDIS to the DSL that runs our acceptance tests. The TARDIS provides a central point of control for multiple test cases running in parallel, coordinating time travel so that the tests all move forward together, without the actual test code needing to care about any of the details or synchronisation.

The TARDIS hooks into the DSL at two points – when a test asks to time travel and when a test finishes (by either passing or failing). When a test asks to time travel, the TARDIS tracks the destination times being requested and blocks the test until all tests are either ready to time or have completed. It then time travels to the earliest requested time and wakes up any tests that requested that time point so they can continue running. Tests that requested a time point further in the future remain paused waiting for the next time travel.

Since we had a lot of time travel tests already written before we invented the TARDIS this approach allowed us to start running them in parallel without having to rewrite them – the TARDIS is simply integrated into the DSL framework we use for all tests.

Currently the TARDIS only works for tests running in the same JVM, so essentially it allows test cases to run in parallel with other cases from the same test suite, but it doesn’t allow multiple test suites on separate Romero agents to run in parallel. The next step in its evolution will be to move the TARDIS out of the test’s DSL and provide it as an API from the time travel service on the server. At that point we can run multiple test suites in parallel against the same server. However, we haven’t yet done the research to determine what, if any, benefit we’d get from that change as different test suites may have very different time travel patterns and thus spend most of their time at interim time points waiting for other tests. Also the load on servers during time travel is quite high due to the number of scheduled jobs that can fire so running multiple test suites at once may not be viable.

* Time being in sync is actually a more complex concept than it first appears. The overall architecture of our system meant this approach to time actually did provide very accurate time sources relative to our main “source of all truth”, the exchange venue itself which is what we really cared about. Even so, anything that had to be strictly “in-sync” generated a timestamp in the service that triggered it and then included in the outgoing event which is the only sane way to do such things.