How to design an Azure Topic subscriber for downtime and not lose the messages?

147 Views Asked by At

If I have an Azure topic and for every event of type X that gets added to the topic I need to have one subscriber that logs that event to one location and I have another subscriber that actually does some processing on that topic message.

What is the correct design to handle the failure of my audit or processing subscriber for a few minutes and ensure that I don't miss out on any topic message and create a data corruption scenario?

I could run three versions of each instance and it is then unlikely that all three will ever be down at the same time but that isn't a perfect scenario. What other options are out there for this? Am I missing something as part of the API?

1

There are 1 best solutions below

0
On

I may not be understanding the failure you are attempting to solve for. If I understand your scenario correctly you have an Audit subscription and a Processing Subscription both subscribed to and "Event Topic". This means you'll have two logical consumers: one for the audit and one for the processing (I say logical because each consumer could have multiple instances reading from the same subscription for throughput and redundancy).

If you are using PeekLock (the default) as the Receive Mode on your Subscription Client that means that if there is a failure or exception on your consumer when recording the audit message or processing the event the message will eventually reappear to be processed by another consumer instance. This assumes that Complete was not called due to the exception. In theory, if your audit and processing consumers are doing idempotent operations then even if your consumers fail they can catch back up when they come back online and no messages will get missed, though some may get picked up more than once. This doesn't change if you run multiple instances of each consumer as you suggested above. Having multiple instances of each consumer running does reduce the amount of possible downtime, but you shouldn't miss any messages even if you have a single instance processing. The subscription will hold on to them until the consumers are back up.

If you used the RecieveAndDelete receive mode, then you have the possibility of losing messages. Here is a great article on Best Practices for Performance Improvements Using Service Bus Brokered Messaging. Read through this.

There are all sorts of options for deployment based on how resource intensive the audit and processing operations are. You could have a worker role or process that handles both audit and processing messages on different threads as a pair and deploy multiple instances. This would mean that each instance can process both types of messages, but there is redundancy in that if one of the machines does down another running instance can keep processing.

You'd need to be checking for deadlettered messages (such as poison messages) as well as those message weren't processed or perhaps not fully processed.

Now, you do mention data corruption, so I'm assuming you mean the possibility that an audit log gets written, but that the actual event fails to process. This is a little more tricky. These are two distinct operations that you are attempting to marry up. The simple answer is really you can't guarantee this won't get out of sync. There is no transaction across both of these operations (nor would you want there to be in a distributed system). Think about the audit as the intent to perform the operation, not that the operation was actually completed. You can't assume that the processing will complete successfully simply because the message was provided to the system. Once the processing occurs it can log that the operation was in fact completed. Or perhaps it throws a message out for another auditor to record. This will give you a better metric to analyze on your system: the number of requested operations vs the number of actually completed operations. When viewed over a period of time this metric can provide you the actual successful throughput of your system.