Evolution of our queuing technology
A year or so later, Microsoft stopped supporting Service Bus under Windows Server so it was time to move on again. There being no obvious alternatives in the Microsoft .NET stack, we opted for RabbitMQ in our VM environment while continuing with Service Bus in Azure.
RabbitMQ was probably one of the largest technology mistakes we have ever made. While it’s a superb product, it was an exceptionally poor match for our workloads, deployment model and environment.
After a year of staving off queue partitions, we decided to move on (again). This time, we picked MSMQ (Microsoft Message Queuing) for our South African environment. Simplistic but proven, it served us well. It required no additional components, no complex configuration and best of all it was stable.
The next inflection point was that as our Azure utilisation started to pick up and we built more instrumentation into the platform, we started noticing spurious spikes in processing latency. We tracked the issue down to the Azure Service Bus. Azure Service Bus comes in a few different flavours but basically one type is shared, the other is dedicated. We were using shared but to move to dedicated was a 60x increase in cost which was untenable.
As we dug deeper into the problem we found out that not only were there insanely high delays in processing messages on the queues (think upwards of 120 seconds) but actually the average processing times were unacceptable high too (often upwards of 100ms).
Over the course of a few weeks we experimented with some mitigation techniques that would deal with the massive latency spikes. None of them worked well enough and we never deployed them into production.
What we realized about queues
As we were deciding how to tackle this problem, we realised we had been missing the wood for the trees and began to question why we were using queues in the first place.
Queues are intended to process messages where asynchronous communication is required. Our mistake had been to use them as a solution for remote procedure call (RPC). This insight was critical because we realised we didn’t actually need or want a middleman to process messages – in Flowgear all communication is essentially synchronous.
Flowgear’s new architecture
We developed a new architecture that allowed direct role-to-role communication (identical in both Azure PaaS and VM environments). We implemented this by publishing a small RESTful service within each role. Any role can then communicate with any other role via the REST API and a simple role-discovery service permits the consumer to locate a desired target role.
As a bonus, the role discovery service also tracks load levels in each role which enabled us to implement orchestrator-independent load balancing.
The new architecture went live in August and we’ve been very happy with the results. The extreme latency spikes we had seen under Azure Service Bus were eradicated and our average message delivery time was reduced by between 20ms and 80ms (one environment actually saw a reduction of 300ms!)
If you’re looking for a way to communicate across roles in your service, consider whether you’re actually looking for RPC or queues. A simple heuristic for this determination is to consider whether you want asynchronous or synchronous communication. If you need synchronous communication, queues are probably the wrong approach.