Parallel polling of the AWS SQS standard queue - Message processing is too slow












2















I've a module that polls an AWS SQS queue at specified intervals with one message at a time with ReceiveMessageRequest. Following is the method:



public static ReceiveMessageResult receiveMessageFromQueue() {

String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(targetedQueueUrl)
.withWaitTimeSeconds(10).withMaxNumberOfMessages(1);
return sqsClient.receiveMessage(receiveMessageRequest);
}


Once a message is received and processed its get deleted from the queue with the DeleteMessageResult .



public static DeleteMessageResult deleteMessageFromQueue(String receiptHandle) {

log.info("Deleting Message with receipt handle - [{}]", receiptHandle);
String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
return sqsClient.deleteMessage(new DeleteMessageRequest(targetedQueueUrl, receiptHandle));

}


I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages.
But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Following are the configuration parameters of the queue.



Default Visibility Timeout: 30 seconds
Message Retention Period: 4 days
Maximum Message Size: 256 KB
Receive Message Wait Time: 0 seconds
Messages Available (Visible): 4,776
Delivery Delay: 0 seconds
Messages in Flight (Not Visible): 2
Queue Type: Standard
Messages Delayed: 0
Content-Based Deduplication: N/A


Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? Please advise.



UPDATE:



All the EC2 instances and the SQS are in the same region. The consumers (jar file that polls the queue) run as part of the start-up script of the EC2 instance. And it is having a scheduled task that polls the queue every 12 seconds. Before I push the messages to the queue I spun up 2-3 instances. (We may have some already running instances at that time - this adds up the number of receivers(Caped to 50) for the queue. On receiving the message it will do some tasks (including some DB operations, data analysis and calculations, report file generation and upload the report to S3 etc..) and It'll take approx. 10-12 seconds. After that's done it deletes the message from the queue. Below image is the screenshot of the SQS metrics for last 1 week (from SQS monitoring console).



SQS Metrics for the targeted Queue for last 1 week










share|improve this question

























  • Hm - assuming a host processes one message every 30 seconds, a single host should be able to handle 3000/day. Based on your ApproxiateAgeOfOldestMessage metric, it looks like there are big periods of time that little to no work is being done (the gradual rises) - what’s going on there?

    – Krease
    Nov 27 '18 at 17:24
















2















I've a module that polls an AWS SQS queue at specified intervals with one message at a time with ReceiveMessageRequest. Following is the method:



public static ReceiveMessageResult receiveMessageFromQueue() {

String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(targetedQueueUrl)
.withWaitTimeSeconds(10).withMaxNumberOfMessages(1);
return sqsClient.receiveMessage(receiveMessageRequest);
}


Once a message is received and processed its get deleted from the queue with the DeleteMessageResult .



public static DeleteMessageResult deleteMessageFromQueue(String receiptHandle) {

log.info("Deleting Message with receipt handle - [{}]", receiptHandle);
String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
return sqsClient.deleteMessage(new DeleteMessageRequest(targetedQueueUrl, receiptHandle));

}


I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages.
But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Following are the configuration parameters of the queue.



Default Visibility Timeout: 30 seconds
Message Retention Period: 4 days
Maximum Message Size: 256 KB
Receive Message Wait Time: 0 seconds
Messages Available (Visible): 4,776
Delivery Delay: 0 seconds
Messages in Flight (Not Visible): 2
Queue Type: Standard
Messages Delayed: 0
Content-Based Deduplication: N/A


Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? Please advise.



UPDATE:



All the EC2 instances and the SQS are in the same region. The consumers (jar file that polls the queue) run as part of the start-up script of the EC2 instance. And it is having a scheduled task that polls the queue every 12 seconds. Before I push the messages to the queue I spun up 2-3 instances. (We may have some already running instances at that time - this adds up the number of receivers(Caped to 50) for the queue. On receiving the message it will do some tasks (including some DB operations, data analysis and calculations, report file generation and upload the report to S3 etc..) and It'll take approx. 10-12 seconds. After that's done it deletes the message from the queue. Below image is the screenshot of the SQS metrics for last 1 week (from SQS monitoring console).



SQS Metrics for the targeted Queue for last 1 week










share|improve this question

























  • Hm - assuming a host processes one message every 30 seconds, a single host should be able to handle 3000/day. Based on your ApproxiateAgeOfOldestMessage metric, it looks like there are big periods of time that little to no work is being done (the gradual rises) - what’s going on there?

    – Krease
    Nov 27 '18 at 17:24














2












2








2


0






I've a module that polls an AWS SQS queue at specified intervals with one message at a time with ReceiveMessageRequest. Following is the method:



public static ReceiveMessageResult receiveMessageFromQueue() {

String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(targetedQueueUrl)
.withWaitTimeSeconds(10).withMaxNumberOfMessages(1);
return sqsClient.receiveMessage(receiveMessageRequest);
}


Once a message is received and processed its get deleted from the queue with the DeleteMessageResult .



public static DeleteMessageResult deleteMessageFromQueue(String receiptHandle) {

log.info("Deleting Message with receipt handle - [{}]", receiptHandle);
String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
return sqsClient.deleteMessage(new DeleteMessageRequest(targetedQueueUrl, receiptHandle));

}


I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages.
But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Following are the configuration parameters of the queue.



Default Visibility Timeout: 30 seconds
Message Retention Period: 4 days
Maximum Message Size: 256 KB
Receive Message Wait Time: 0 seconds
Messages Available (Visible): 4,776
Delivery Delay: 0 seconds
Messages in Flight (Not Visible): 2
Queue Type: Standard
Messages Delayed: 0
Content-Based Deduplication: N/A


Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? Please advise.



UPDATE:



All the EC2 instances and the SQS are in the same region. The consumers (jar file that polls the queue) run as part of the start-up script of the EC2 instance. And it is having a scheduled task that polls the queue every 12 seconds. Before I push the messages to the queue I spun up 2-3 instances. (We may have some already running instances at that time - this adds up the number of receivers(Caped to 50) for the queue. On receiving the message it will do some tasks (including some DB operations, data analysis and calculations, report file generation and upload the report to S3 etc..) and It'll take approx. 10-12 seconds. After that's done it deletes the message from the queue. Below image is the screenshot of the SQS metrics for last 1 week (from SQS monitoring console).



SQS Metrics for the targeted Queue for last 1 week










share|improve this question
















I've a module that polls an AWS SQS queue at specified intervals with one message at a time with ReceiveMessageRequest. Following is the method:



public static ReceiveMessageResult receiveMessageFromQueue() {

String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(targetedQueueUrl)
.withWaitTimeSeconds(10).withMaxNumberOfMessages(1);
return sqsClient.receiveMessage(receiveMessageRequest);
}


Once a message is received and processed its get deleted from the queue with the DeleteMessageResult .



public static DeleteMessageResult deleteMessageFromQueue(String receiptHandle) {

log.info("Deleting Message with receipt handle - [{}]", receiptHandle);
String targetedQueueUrl = sqsClient.getQueueUrl("myAWSqueueName").getQueueUrl();
return sqsClient.deleteMessage(new DeleteMessageRequest(targetedQueueUrl, receiptHandle));

}


I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages.
But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Following are the configuration parameters of the queue.



Default Visibility Timeout: 30 seconds
Message Retention Period: 4 days
Maximum Message Size: 256 KB
Receive Message Wait Time: 0 seconds
Messages Available (Visible): 4,776
Delivery Delay: 0 seconds
Messages in Flight (Not Visible): 2
Queue Type: Standard
Messages Delayed: 0
Content-Based Deduplication: N/A


Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests? Please advise.



UPDATE:



All the EC2 instances and the SQS are in the same region. The consumers (jar file that polls the queue) run as part of the start-up script of the EC2 instance. And it is having a scheduled task that polls the queue every 12 seconds. Before I push the messages to the queue I spun up 2-3 instances. (We may have some already running instances at that time - this adds up the number of receivers(Caped to 50) for the queue. On receiving the message it will do some tasks (including some DB operations, data analysis and calculations, report file generation and upload the report to S3 etc..) and It'll take approx. 10-12 seconds. After that's done it deletes the message from the queue. Below image is the screenshot of the SQS metrics for last 1 week (from SQS monitoring console).



SQS Metrics for the targeted Queue for last 1 week







java amazon-ec2 aws-sdk amazon-sqs






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 13:29







Master Po

















asked Nov 23 '18 at 13:28









Master PoMaster Po

677523




677523













  • Hm - assuming a host processes one message every 30 seconds, a single host should be able to handle 3000/day. Based on your ApproxiateAgeOfOldestMessage metric, it looks like there are big periods of time that little to no work is being done (the gradual rises) - what’s going on there?

    – Krease
    Nov 27 '18 at 17:24



















  • Hm - assuming a host processes one message every 30 seconds, a single host should be able to handle 3000/day. Based on your ApproxiateAgeOfOldestMessage metric, it looks like there are big periods of time that little to no work is being done (the gradual rises) - what’s going on there?

    – Krease
    Nov 27 '18 at 17:24

















Hm - assuming a host processes one message every 30 seconds, a single host should be able to handle 3000/day. Based on your ApproxiateAgeOfOldestMessage metric, it looks like there are big periods of time that little to no work is being done (the gradual rises) - what’s going on there?

– Krease
Nov 27 '18 at 17:24





Hm - assuming a host processes one message every 30 seconds, a single host should be able to handle 3000/day. Based on your ApproxiateAgeOfOldestMessage metric, it looks like there are big periods of time that little to no work is being done (the gradual rises) - what’s going on there?

– Krease
Nov 27 '18 at 17:24












1 Answer
1






active

oldest

votes


















4














I'll do the best I can with the information given. More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer.




I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests?




The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are.



In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. Without more detail on your setup, this should get you started on troubleshooting. (Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into.)




  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. I'll assume you're not doing this, and each host is running a loop within a single process

  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case).


    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing.



  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host.


    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful.

    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters



  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. If you're doing cross-region calls, this will impact your latency.


    • Make sure these hosts have adequate CPU/memory resources to do the processing

    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better.



  5. Verify there wasn't some outage or ongoing issue when you were running your test

  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. You don't need to call this repeatedly, as it'll be the same URL



    • This was actually the first thing I noticed in your code, but it's way down here because the above issues will have more impact if they are the cause.



  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. Metrics on this are important too.


    • In this case, you should probably do batch fetching instead of one-at-a-time.

    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case.



  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point)




Further note:




  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong. Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message.

  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors.






share|improve this answer
























  • Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

    – Master Po
    Nov 27 '18 at 13:31













  • Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

    – Krease
    Nov 27 '18 at 18:20












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447602%2fparallel-polling-of-the-aws-sqs-standard-queue-message-processing-is-too-slow%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














I'll do the best I can with the information given. More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer.




I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests?




The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are.



In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. Without more detail on your setup, this should get you started on troubleshooting. (Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into.)




  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. I'll assume you're not doing this, and each host is running a loop within a single process

  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case).


    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing.



  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host.


    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful.

    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters



  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. If you're doing cross-region calls, this will impact your latency.


    • Make sure these hosts have adequate CPU/memory resources to do the processing

    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better.



  5. Verify there wasn't some outage or ongoing issue when you were running your test

  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. You don't need to call this repeatedly, as it'll be the same URL



    • This was actually the first thing I noticed in your code, but it's way down here because the above issues will have more impact if they are the cause.



  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. Metrics on this are important too.


    • In this case, you should probably do batch fetching instead of one-at-a-time.

    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case.



  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point)




Further note:




  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong. Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message.

  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors.






share|improve this answer
























  • Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

    – Master Po
    Nov 27 '18 at 13:31













  • Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

    – Krease
    Nov 27 '18 at 18:20
















4














I'll do the best I can with the information given. More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer.




I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests?




The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are.



In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. Without more detail on your setup, this should get you started on troubleshooting. (Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into.)




  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. I'll assume you're not doing this, and each host is running a loop within a single process

  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case).


    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing.



  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host.


    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful.

    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters



  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. If you're doing cross-region calls, this will impact your latency.


    • Make sure these hosts have adequate CPU/memory resources to do the processing

    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better.



  5. Verify there wasn't some outage or ongoing issue when you were running your test

  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. You don't need to call this repeatedly, as it'll be the same URL



    • This was actually the first thing I noticed in your code, but it's way down here because the above issues will have more impact if they are the cause.



  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. Metrics on this are important too.


    • In this case, you should probably do batch fetching instead of one-at-a-time.

    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case.



  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point)




Further note:




  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong. Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message.

  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors.






share|improve this answer
























  • Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

    – Master Po
    Nov 27 '18 at 13:31













  • Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

    – Krease
    Nov 27 '18 at 18:20














4












4








4







I'll do the best I can with the information given. More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer.




I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests?




The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are.



In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. Without more detail on your setup, this should get you started on troubleshooting. (Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into.)




  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. I'll assume you're not doing this, and each host is running a loop within a single process

  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case).


    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing.



  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host.


    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful.

    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters



  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. If you're doing cross-region calls, this will impact your latency.


    • Make sure these hosts have adequate CPU/memory resources to do the processing

    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better.



  5. Verify there wasn't some outage or ongoing issue when you were running your test

  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. You don't need to call this repeatedly, as it'll be the same URL



    • This was actually the first thing I noticed in your code, but it's way down here because the above issues will have more impact if they are the cause.



  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. Metrics on this are important too.


    • In this case, you should probably do batch fetching instead of one-at-a-time.

    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case.



  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point)




Further note:




  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong. Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message.

  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors.






share|improve this answer













I'll do the best I can with the information given. More details about your processing loop logic, region setup, and metrics (see below) would help improve this answer.




I've created an executable jar file which is deployed in around 40 instances and are actively polling the queue. I could see each of them receives messages. But in AWS SQS console I can see only the numbers 0, 1, 2 or 3 on the 'in flight messages' column. Why that so even when there are 40+ different consumers are receiving messages from the queue? Also the number of messages available in the queue reduces very slowly.



Why the messages are not getting processed quickly even when there are multiple consumers? Do I need to modify any of the queue parameters or something in the receive message/delete message requests?




The fact that you're not seeing in-flight numbers that correspond more closely with the number of hosts you have processing messages definitely points to a problem - either your message processing is blazing fast (which doesn't seem to be the case) or your hosts aren't doing the work you think they are.



In general, fetching and deleting a single message from SQS should take on the range of a few milliseconds. Without more detail on your setup, this should get you started on troubleshooting. (Some of these steps may seem obvious, but every single one of these was the source of real life problems I've seen developers run into.)




  1. If you're launching a new process for each receive-process-delete, this overhead will slow you down substantially. I'll assume you're not doing this, and each host is running a loop within a single process

  2. Verify your processing loop isn't fatalling and restarting on you (effectively turning it into the above case).


    • I assume you've also verified your processes aren't also doing a bunch of work outside of message processing.



  3. You should generate some client-side metrics to indicate how long the SQS requests are taking on each host.


    • Cloudwatch will partly do this for you, but actual client-side metrics are always useful.

    • Recommend basic metrics the following: (1) receive latency, (2) process latency, (3) delete latency, (4) entire message loop latency (5) success/fail counters



  4. Your EC2 instances (the hosts doing the processing) should be in the same region as the SQS queue. If you're doing cross-region calls, this will impact your latency.


    • Make sure these hosts have adequate CPU/memory resources to do the processing

    • As an optimization, I recommend using more threads per host, and less hosts - reusing client connections & maximizing usage of your compute resources is always better.



  5. Verify there wasn't some outage or ongoing issue when you were running your test

  6. Perform getQueueUrl just once for the lifetime of your app, during some initialization step. You don't need to call this repeatedly, as it'll be the same URL



    • This was actually the first thing I noticed in your code, but it's way down here because the above issues will have more impact if they are the cause.



  7. If your message processing is incredibly short (less time than it takes to retrieve and delete a message), then you will end up with your hosts spending most of their time fetching messages. Metrics on this are important too.


    • In this case, you should probably do batch fetching instead of one-at-a-time.

    • Based on the number of messages in your queue and the comment that it's going slowly, it sounds like this isn't the case.



  8. Verify all of hosts are actually hitting the same queue (and not some beta/gamma version, or older version you used for testing at one point)




Further note:




  • The other answer suggests visibility timeout as a potential cause - this is flat-out wrong. Visibility timeout does not block the queue - it only impacts how long messages remain "in-flight" before another receiveMessageRequest can receive that message.

  • You'd consider reducing this if you wanted to try reprocessing your messages sooner in the event of errors / slow processors.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 23 '18 at 23:54









KreaseKrease

11.8k74262




11.8k74262













  • Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

    – Master Po
    Nov 27 '18 at 13:31













  • Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

    – Krease
    Nov 27 '18 at 18:20



















  • Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

    – Master Po
    Nov 27 '18 at 13:31













  • Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

    – Krease
    Nov 27 '18 at 18:20

















Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

– Master Po
Nov 27 '18 at 13:31







Thanks a lot for the detailed explanation. I've updated the question with SQS queue metrics and some consumer module details. As you suggested I'll modify the receiver module to get the queue URL only once. And will try to get the queue metrics in the consumer.

– Master Po
Nov 27 '18 at 13:31















Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

– Krease
Nov 27 '18 at 18:20





Your update mentions "a scheduled task that polls the queue every 12 seconds" - does that mean this is a new process each time? I suggest having your main process run in a while loop - while you continue to receive a non-empty message from SQS, process that message. Otherwise, sleep for X seconds/minutes and try again. Keep the process (and all the initializations / connections / etc) alive as much as possible.

– Krease
Nov 27 '18 at 18:20




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447602%2fparallel-polling-of-the-aws-sqs-standard-queue-message-processing-is-too-slow%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini