Samza: Delay processing of messages until timestamp

1.1k Views Asked by At

I'm processing messages from a Kafka topic with Samza. Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. In the meantime, I'd like to keep processing other incoming messages.

What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. The basic idea looks like this:

public class MyTask implements StreamTask, WindowableTask {

    private HashSet<MyMessage> waitingMessages = new HashSet<>();

    @Override
    public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
        byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
        MyMessage parsedMessage = MyMessage.parseFrom(message);

        if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
            // Do the processing
        } else {
            waitingMessages.add(parsedMessage);
        }

    }

    @Override
    public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
        for (MyMessage message : waitingMessages) {
            if (message.getValidFromDateTime().isBeforeNow()) {
                // Do the processing and remove the message from the set
            }
        }
    }
}

This obviously has some downsides. I'd be losing my waiting messages in memory when I redeploy my task. So I'd like to know the best practice for delaying the processing of messages with Samza. Do I need to reemit the messages to the same topic again and again until I can finally process them? We're talking about delaying the processing for a few minutes up to 1-2 hours here.

2

There are 2 best solutions below

0
On BEST ANSWER

I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set. It should look something like:

public class MyTask implements StreamTask, WindowableTask, InitableTask {

  private KeyValueStore<String, MyMessage> waitingMessages;


  @SuppressWarnings("unchecked")
  @Override
  public void init(Config config, TaskContext context) throws Exception {
    this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
  }

  @Override
  public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
      TaskCoordinator taskCoordinator) {
    byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
    MyMessage parsedMessage = MyMessage.parseFrom(message);

    if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
      // Do the processing
    } else {
      waitingMessages.put(parsedMessage.getId(), parsedMessage);
    }

  }

  @Override
  public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
    KeyValueIterator<String, MyMessage> all = waitingMessages.all();
    while(all.hasNext()) {
      MyMessage message = all.next().getValue();
      // Do the processing and remove the message from the set
    }
  }

}

If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). You need of course provide some extra configuration of your store (in above example for messages-store).

You could read about key-value store here (for the latest Samza version): https://samza.apache.org/learn/documentation/0.14/container/state-management.html

0
On

It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. It is expected that a properly-functioning message queue will deliver messages on demand. What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message.

Notice that delay is not a configurable part of the equation. Instead, delay is an output variable of a system with a queue. In fact, Little's Law offers some interesting insights into this.

So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. Typically a queryable database would make sense in this particular instance. If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker.