The engineering behind Meter’s speed test
When a customer's network goes down, their first thought is a speed test. Our new speed test feature is built on an architecture that ensures their test always runs to completion. If a customer loses connection midstream, critical metrics enable them to understand the health of their WAN connection.
In this article, we will dive deeper into the technical decisions we made to help our customers monitor their ISP performance.
The need for remote speed (tests)
In a hybrid world where IT teams are centralized and monitoring many remote locations, they need a reliable way to remotely monitor the networks’ performance. They need to be able to run one-off speed tests and also run on a schedule to confidently know if they are getting the performance they expect. Meter may also need to run one-off speed tests at any location at any time.
Many users opt for speed tests when there is a possible outage, which can cause many speed tests to be started at once. We needed to have each device run only one test that can be viewed by many clients to avoid network saturation.
With these requirements in mind, we focused on the following:
- Triggering tests remotely and reliably.
- Configuring schedules for recurring tests.
- Ensuring only one active speed test per device.
Architecting for reliability: running speed tests offline
The goal was to run a speed test as simply as possible, both for our users and for our implementation. We also wanted to ensure that they run to completion even if the user loses connection to the Dashboard while the test is running.
In Dashboard, a user can select an uplink port (a port connected to WAN) to begin a speed test. They are then redirected to a live stream of the current job. Here the user can see realtime data from the device such as up and down speed on a live chart, jitter, and a breakdown of these statistics further down the page.
This flow solves our customers and Meter’s problems by providing realtime statistics of a speed test executed on any device in their network remotely. Furthermore it also solves the reliability issue through turning speedtest into offline jobs handled by our API. Here is the backend architecture to understand what’s required to provide reliable, remote speed tests:
The architecture above introduces the main entities in our backend:
- Dashboard: User interface
- API: Meter’s backend that orchestrates the offline speedtest job and polling for new data, this also encapsulates the database backing it
- Device: Meter F-series device which holds firmware logic to run a speedtest against Cloudflare
- Job System: River Queue serves as our job system and can schedule specific job types either on schedule or on demand
- ClickHouse: ClickHouse is our system of choice for handling the large volume of metrics sent by our devices in the field, which require analytical and time series based queries
- Speed Test Job: Driver for communication between the device and storing results in ClickHouse
Our speed test architecture starts by sending a request to our API (1), our API then communicates with the device to see if there is already a speedtest running (2). Our device firmware implements logic to ensure that only one speedtest can be run at once. We made this choice because a shortcoming of online speed tests like fast.com and others can lead to people accidentally DDOS themselves.
The box will then receive the large payloads ranging from 100kb to 25mb sent by these services en masse, leading to the network becoming completely saturated. In the case that there is no ongoing job the API will use InsertTx to create a new Speed Test Job (3). By running the speed test job offline and managed by river queue we provide the guarantee that even if the remote connection to the API from the dashboard is severed we know that the job will run to completion. Once the state of the device is established we either direct the dashboard to ongoing speedtest job or a fresh speedtest which is spun up by the job system (4).
To sync the test between the device and our backend, we use the API Database for current, polled status and ClickHouse as the final system of record for test completion. Since we have analyzed the first flow of the architecture, let's examine how we synchronize between executing a speed test on the firmware of our device and the state of the job in our API (5).
The primary piece of our synchronization is that our job will first begin the speed test on the job then constantly poll the device (6) for the statistics it has received from cloudflare (7) and will update the current state of the job (including the statistics) in the API database. The speed test job will then check if the job is complete by querying ClickHouse (8) for the complete set of statistics from ClickHouse. The job is therefore completed when the device finally writes a statistics to ClickHouse (9)1.
Finally, to provide the fluid looking speed test graphs (10) we can take advantage of the fact that we store results from each poll (6) in the API database. Our dashboard will poll the API for the current state of the job and use this to populate the live statistics counter and chart.
Protecting the network: managing concurrent tests
Meter protects our customers from getting dubious results for scenarios where multiple admins or users start measuring ISP performance at the same time. The solution we chose was not to prevent the subsequent runs, rather have the users join and view the results of an ongoing run. We believe this is a better way to address that scenario and guarantee close to accurate measurements.
There were two ways we could approach this:
- Set up websockets to stream data directly to the dashboard
- Just poll our api
We settled on the second approach because it naturally fit the polling mechanism used to monitor the device on the api. Ultimately, choosing to poll sessions for results allowed us to focus on what matters most to our customers: reliability and diagnostic simplicity, rather than the heavy engineering cost of maintaining true real-time websockets. Implementing sessions was relatively trivial since we modeled each session as a speedtest job on our backend. All each client needs to do is poll our api for the current state of the job.
The scheduler: continuous, low-impact ISP monitoring
Now that we had a solid kernel for executing a single test we could move our focus to designing a scheduling mechanism for executing speed tests offline. Our approach was to define a set of rules that our customers can configure to decide what is tested (network or port), what time range, and how often (daily or weekly).
Using these rules we constructed a new offline cron job that spawns all the jobs in the scheduled range for the hour the job was run at. Let’s take a brief look at the architecture of our continuously monitored speedtest to see how it differs from the previous one:
The components of the architecture for running the continuous speed tests are largely the same as the previous architecture, with Redis being the only major addition. We introduced Redis, leveraging its fast, atomic operations to implement a token bucket rate limiter that utilizes the TTL of keys to efficiently clean up old keys.
In this case the initiator of these tests is a cron job that runs every 15 minutes (1). The cron job spawns a recurring speed test job (2). The recurring speed test job, referred to as the scheduler, will query the automation rules from the API database and determine what jobs are scheduled to run in the current hour, alongside some internal validations to ensure the device is in a good state.
The scheduler will then see if there is capacity to start a new speed test job by checking our rate limiter (3). We introduced a token bucket rate limiter to derisk the chance of our speed tests being recognized as a bot net, given that many networks might schedule their speed test jobs at the same time (12am). Assuming that Cloudflare has bot detection capabilities, we want to avoid getting our devices blocklisted if our traffic pattern catches its attention. If there is capacity we start the speedtest (5)(6). In the case there is no capacity, the job will wait until there is capacity or wait until the job times out. The process for executing the speedtest job is shared with the previous architecture.
Handling schedule contention for multiple WAN ports
There is one requirement that the observant reader might notice is unaddressed in our scheduling job. The requirement is that the device can only run one test but it may have multiple connections to the WAN. This means that we need to modify our scheduler job to queue speedtest jobs for devices that are already running.
This requires implementing some contention logic so that any jobs started on a device will block other jobs from running until it is free to start another job. An example of this can be seen below. The first test completes on port 4 and the second test begins on port 5 of the primary controller.
With all of these pieces in place we were able to provide a scheduled speed test feature that provides the flexibility that users need, happily talks to Cloudflare, and avoids breaking the one test per device constraint.
What's next
The next step of product development involves building out state-of-the-art tooling for customers to dive deep into the observability of their network. We are working on real-time views of packet routes through complex topologies and proactive, real-time anomaly detections in our customers networks.
If you are ready to own complex problems and shape the future of network engineering, we're hiring! Find your next role on our careers page.
- Our devices do not write directly to ClickHouse, instead we send messages to our Kafka instance and our stream processors handle writing the statistics to ClickHouse. This has been simplified in the diagram since it is out of scope. Check out a future blog post on our Kafka service if that interests you.