Process of cleaning data that is in a bad state












0














I understand that topics are immutable.



Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?



I see a few different ways to handle this:




  1. The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.


  2. Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.



What are some other ways this situation is handled?










share|improve this question





























    0














    I understand that topics are immutable.



    Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?



    I see a few different ways to handle this:




    1. The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.


    2. Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.



    What are some other ways this situation is handled?










    share|improve this question



























      0












      0








      0







      I understand that topics are immutable.



      Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?



      I see a few different ways to handle this:




      1. The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.


      2. Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.



      What are some other ways this situation is handled?










      share|improve this question















      I understand that topics are immutable.



      Let's say your topic is in a bad state. Sections of data that are out of order, duplicate records, etc. What is the process of cleaning up that data? How does this process impact downstream consumers?



      I see a few different ways to handle this:




      1. The consumers don't listen to that first topic, but rather listen to a cleaned up derivative.


      2. Version the topic and rewrite the data with the de-dupe logic applied. Then have the consumers change which topic they listen to. But then I run into the situation where records are either buffered or interleaved with older records while new records continue to come in.



      What are some other ways this situation is handled?







      apache-kafka stream-processing






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 '18 at 23:43









      cricket_007

      79.5k1142109




      79.5k1142109










      asked Nov 12 '18 at 18:37









      BDig

      334




      334
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.



          Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)



          Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.






          share|improve this answer





















          • Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
            – BDig
            Nov 13 '18 at 22:05











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268164%2fprocess-of-cleaning-data-that-is-in-a-bad-state%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.



          Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)



          Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.






          share|improve this answer





















          • Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
            – BDig
            Nov 13 '18 at 22:05
















          0














          Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.



          Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)



          Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.






          share|improve this answer





















          • Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
            – BDig
            Nov 13 '18 at 22:05














          0












          0








          0






          Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.



          Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)



          Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.






          share|improve this answer












          Sounds like the data flow architecture is not idempotent. Data is never out of order or duplicated by Kafka, there would be a problem with the producer. Kafka automatically removed data from topics post retention period, so just wait till that period for cleanup if you are worried about existing data only. Once data is deleted by Kafka, any consumer lagging in reads (i.e. wants to read from deleted offset) will have to set auto.offset.reset from earliest or latest otherwise consumer will issue OffsetOutOfRange error.



          Meanwhile, if you can skip records and start polling for specific offset/partition by using consumer.seek(partition, offset)



          Solution would depend upon your business logic and incoming data pattern, but you will be better off by solving producer issues rather than handling it in consumer.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 '18 at 23:15









          AbhishekN

          23616




          23616












          • Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
            – BDig
            Nov 13 '18 at 22:05


















          • Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
            – BDig
            Nov 13 '18 at 22:05
















          Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
          – BDig
          Nov 13 '18 at 22:05




          Yes, definitely a problem with the producers. The producer has duplicated data and in some cases has produced records significantly out of order. Is there a generally accepted process for repairing the topic? Or do you just version the topic and move the consumers to the new clean topic? In this case, the topic is expected to contain all the data as a source of truth.
          – BDig
          Nov 13 '18 at 22:05


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53268164%2fprocess-of-cleaning-data-that-is-in-a-bad-state%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Academy of Television Arts & Sciences

          L'Équipe

          1995 France bombings