design of a scalable, stateful python RabbitMQ consumer
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.
One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.
Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.
- For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.
- For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.
Therefore, for [2] I can see two options:
- use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit. - use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.
Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.
Any advice on how to best implement a scalable architecture?
python multithreading rabbitmq multiprocessing
add a comment |
I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.
One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.
Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.
- For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.
- For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.
Therefore, for [2] I can see two options:
- use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit. - use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.
Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.
Any advice on how to best implement a scalable architecture?
python multithreading rabbitmq multiprocessing
add a comment |
I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.
One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.
Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.
- For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.
- For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.
Therefore, for [2] I can see two options:
- use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit. - use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.
Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.
Any advice on how to best implement a scalable architecture?
python multithreading rabbitmq multiprocessing
I am implementing a Natural Language Processing product consisting of multiple applications on cloud foundry.
These applications are CPU-intensive (as opposed to I/O intensive), communicate over rabbitMQ and are mostly developed in Python3.
One of these Python3 applications [1] consumes a rabbit queue and simply performs a given task based on the payload of the message received.
This payload is self-contained, i.e. it contains all the information required by the application to perform its job.Another application [2] needs to wait for certain conditions to be met: for instance it needs first to receive a message for 'topic1', then one for 'topic2', and then the app can perform its job. The processing will make use of both payloads from the two messages.
Now the question is how to ensure that both [1] and [2] can be scaled up to keep up with increased message volume.
- For [1], it should be just enough to deploy the same application multiple times (i.e. multiple workers), and let each consume the same queue in round robin fashion.
- For [2], state (i.e. seen messages) should be maintained, so it's not enough to simply adopt the same approach as above. That is because the message about 'topic1' could be processed by the first worker, 'topic2' by the second worker and each worker would not know about the other message, in absence of additional data structures.
Therefore, for [2] I can see two options:
- use multiprocessing (not multithreading because of the GIL found in several Python runtimes): only a single worker of [2] consumes the rabbit queue, each message is dispatched to a different process.
Statefulness is achieved through thread-safe in-memory data structures, shared between different processes. In this case the parallelism is transparent to rabbit. - use an external storage (like Redis or an SQL db): multiple workers of [2] consume the queue in round-robin fashion, but share the same storage (similar role as the in-memory data structures above).
In this case the parallelism is known to rabbit because the queue will have multiple consumers.
Messages will mostly consists of plain text and serialized numpy 300-dimensional word embeddings.
Volume could reach ~10k messages per second. Processing would consist of running Neural Network or NLP libraries.
Any advice on how to best implement a scalable architecture?
python multithreading rabbitmq multiprocessing
python multithreading rabbitmq multiprocessing
asked Nov 23 '18 at 17:15
MoZZoZoZZoMoZZoZoZZo
10810
10810
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450708%2fdesign-of-a-scalable-stateful-python-rabbitmq-consumer%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450708%2fdesign-of-a-scalable-stateful-python-rabbitmq-consumer%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown