AWS Simple Queue Service - Visibility Timeout
I had an application that would pull async work by receiving messages into the queue. The example is simple.
A message is pushed to the queue in question by a producer and one consumer will pull it and process it.


The problem that happened to me was that the message was initially pulled, it was processing, and the application on the way was being killed by auto scaling down, throwing no error, and then blocking this message to process for the VisibilityTimeout that was configeured.
In my specific case it was a lot, it was X hours.
The result: this message wont go to the queue and be visible again for X amount hours. If you have a high demand system to process messages, this can never happen.
The solution
Create a class that acts as a orchestrator which will take care of the message and signal correctly SQS that message is being processed.
Below is the implementation:
using Amazon.SQS;
using Amazon.SQS.Model;
public interface ISqsMessageVisibilityHeartBeat
{
Task BeatAsync();
}
public class SqsMessageVisibilityHeartBeat : ISqsMessageVisibilityHeartBeat
{
private readonly IAmazonSQS _amazonSqsClient;
private readonly string _queueUrl;
private readonly string _messageReceipt;
private readonly TimeSpan _newVisibilityTimeout;
private readonly TimeSpan _refreshAfter;
private DateTime _lastBeat = DateTime.MinValue;
public SqsMessageVisibilityHeartBeat(
string queueUrl,
string messageReceipt,
TimeSpan newVisibilityTimeout,
TimeSpan refreshAfter
)
{
_amazonSqsClient = new AmazonSQSClient();
_queueUrl = queueUrl;
_messageReceipt = messageReceipt;
_newVisibilityTimeout = newVisibilityTimeout;
_refreshAfter = refreshAfter;
}
private TimeSpan GetTimeBetweenLastCall() => DateTime.UtcNow.Subtract(_lastBeat);
public async Task BeatAsync()
{
if (GetTimeBetweenLastCall() >= _refreshAfter)
{
var req = new ChangeMessageVisibilityRequest
{
QueueUrl = _queueUrl,
ReceiptHandle = _messageReceipt,
VisibilityTimeout = (int)_newVisibilityTimeout.TotalSeconds
};
_lastBeat = DateTime.UtcNow;
await _amazonSqsClient.ChangeMessageVisibilityAsync(req);
}
}
}
In the implementation you construct the object with the constructor like:
SqsMessageVisibilityHeartBeat messageHeartBeat = new ("queueUrl", "receipt", TimeSpan.FromSeconds(30), TimeSpan.FromMinutes(1));
Then the application code that is processing should be changed like this:
foreach(...)
{
messageHeartBeat.BeatAsync();
}
The internal Beat implementation holds the calls for amount of 'refreshAfter' you configured on the constructor. This will hold amount of time to invoke it, because isn't free to call this API's on AWS, and you don't want to be invoking this every item in your foreach loop, but from time to time.
If for whatever reason anything would happen and the application would be killed and then not deleting the message on time, the message would be visible again after 30 seconds.
In my case I was pushed to a project where this is already running and configured to use X hours (because they didn't know how much time would take to process. In some cases hours, in other minutes) and was blocking us a lot of messages causing the whole system to crash.
With this solution in place the queues will not be holding invisible messages for such amount of times and consumers will be ready to process them immediately.
This could be even worse if you have a Queue FIFO which will in fact block all other messages to process because they arrived later, and FIFO queues always prioritize the first message that arrived. Because the visibility was too high it will always be on top of the queue and will block the other messages in the queue causing your queue to flood with many blocked messages.
In fact we should always configure DLQ (dead-letter-queues) usually following the queue's name:
messages-processing-queue
messages-processing-queue-errors
The last would store messages with errors allowing troubleshooting them isolated, separate from the main.