Process of cleaning data that is in a bad state
I understand that topics are immutable.
Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?
I see a few different ways to handle this:
The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.
Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.
What are some other ways this situation is handled?
apache-kafka stream-processing
add a comment |
I understand that topics are immutable.
Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?
I see a few different ways to handle this:
The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.
Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.
What are some other ways this situation is handled?
apache-kafka stream-processing
add a comment |
I understand that topics are immutable.
Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?
I see a few different ways to handle this:
The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.
Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.
What are some other ways this situation is handled?
apache-kafka stream-processing
I understand that topics are immutable.
Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?
I see a few different ways to handle this:
The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.
Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.
What are some other ways this situation is handled?
apache-kafka stream-processing
apache-kafka stream-processing
edited Nov 12 '18 at 23:43
cricket_007
79.5k1142109
79.5k1142109
asked Nov 12 '18 at 18:37
BDig
334
334
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.
Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)
Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268164%2fprocess-of-cleaning-data-that-is-in-a-bad-state%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.
Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)
Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
add a comment |
Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.
Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)
Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
add a comment |
Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.
Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)
Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.
Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.
Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)
Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.
answered Nov 12 '18 at 23:15
AbhishekN
23616
23616
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
add a comment |
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
– BDig
Nov 13 '18 at 22:05
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268164%2fprocess-of-cleaning-data-that-is-in-a-bad-state%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown