Analysis on real time streaming data





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective



Here's what the problem at hand looks like:



We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.



{
"query_params":[

],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url”:”http://someurl.com”,
“action”:”click”,
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[

],
"social_traffic":false,
"converted_goals_info":[

],
"referrer”:”http://www.google.com”,
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}


Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.



One example of the report we need to build is



Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.



or



Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
In one day this service sends out millions of such events amounting to GB’s of data per day.



Here are the specific doubts I have




  • How should we store this huge amount of data

  • How should we enable ourselves to analyse the data in real time.

  • How should the query system work here (I am relatively clueless about this part)

  • If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?










share|improve this question





























    0















    This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective



    Here's what the problem at hand looks like:



    We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.



    {
    "query_params":[

    ],
    "device_type":"Desktop",
    "browser_string":"Chrome 47.0.2526",
    "ip":"62.82.34.0",
    "screen_colors":"24",
    "os":"Mac OS X",
    "browser_version":"47.0.2526",
    "session":1,
    "country_code":"ES",
    "document_encoding":"UTF-8",
    "city":"Palma De Mallorca",
    "tz":"Europe/Madrid",
    "uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
    "combination_goals_facet_term":"c2_g1",
    "ts":1452015428,
    "hour_of_day":17,
    "os_version":"10.11.2",
    "experiment":465,
    "user_time":"2016-01-05T17:37:10.675000",
    "direct_traffic":false,
    "combination":"2",
    "search_traffic":false,
    "returning_visitor":false,
    "hit_time":"2016-01-05T17:37:08",
    "user_language":"es",
    "device":"Other",
    "active_goals":[
    1
    ],
    "account":196,
    "url”:”http://someurl.com”,
    “action”:”click”,
    "country":"Spain",
    "region":"Islas Baleares",
    "day_of_week":"Tuesday",
    "converted_goals":[

    ],
    "social_traffic":false,
    "converted_goals_info":[

    ],
    "referrer”:”http://www.google.com”,
    "browser":"Chrome",
    "ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
    "email_traffic":false
    }


    Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.



    One example of the report we need to build is



    Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.



    or



    Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
    In one day this service sends out millions of such events amounting to GB’s of data per day.



    Here are the specific doubts I have




    • How should we store this huge amount of data

    • How should we enable ourselves to analyse the data in real time.

    • How should the query system work here (I am relatively clueless about this part)

    • If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?










    share|improve this question

























      0












      0








      0








      This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective



      Here's what the problem at hand looks like:



      We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.



      {
      "query_params":[

      ],
      "device_type":"Desktop",
      "browser_string":"Chrome 47.0.2526",
      "ip":"62.82.34.0",
      "screen_colors":"24",
      "os":"Mac OS X",
      "browser_version":"47.0.2526",
      "session":1,
      "country_code":"ES",
      "document_encoding":"UTF-8",
      "city":"Palma De Mallorca",
      "tz":"Europe/Madrid",
      "uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
      "combination_goals_facet_term":"c2_g1",
      "ts":1452015428,
      "hour_of_day":17,
      "os_version":"10.11.2",
      "experiment":465,
      "user_time":"2016-01-05T17:37:10.675000",
      "direct_traffic":false,
      "combination":"2",
      "search_traffic":false,
      "returning_visitor":false,
      "hit_time":"2016-01-05T17:37:08",
      "user_language":"es",
      "device":"Other",
      "active_goals":[
      1
      ],
      "account":196,
      "url”:”http://someurl.com”,
      “action”:”click”,
      "country":"Spain",
      "region":"Islas Baleares",
      "day_of_week":"Tuesday",
      "converted_goals":[

      ],
      "social_traffic":false,
      "converted_goals_info":[

      ],
      "referrer”:”http://www.google.com”,
      "browser":"Chrome",
      "ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
      "email_traffic":false
      }


      Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.



      One example of the report we need to build is



      Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.



      or



      Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
      In one day this service sends out millions of such events amounting to GB’s of data per day.



      Here are the specific doubts I have




      • How should we store this huge amount of data

      • How should we enable ourselves to analyse the data in real time.

      • How should the query system work here (I am relatively clueless about this part)

      • If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?










      share|improve this question














      This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective



      Here's what the problem at hand looks like:



      We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.



      {
      "query_params":[

      ],
      "device_type":"Desktop",
      "browser_string":"Chrome 47.0.2526",
      "ip":"62.82.34.0",
      "screen_colors":"24",
      "os":"Mac OS X",
      "browser_version":"47.0.2526",
      "session":1,
      "country_code":"ES",
      "document_encoding":"UTF-8",
      "city":"Palma De Mallorca",
      "tz":"Europe/Madrid",
      "uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
      "combination_goals_facet_term":"c2_g1",
      "ts":1452015428,
      "hour_of_day":17,
      "os_version":"10.11.2",
      "experiment":465,
      "user_time":"2016-01-05T17:37:10.675000",
      "direct_traffic":false,
      "combination":"2",
      "search_traffic":false,
      "returning_visitor":false,
      "hit_time":"2016-01-05T17:37:08",
      "user_language":"es",
      "device":"Other",
      "active_goals":[
      1
      ],
      "account":196,
      "url”:”http://someurl.com”,
      “action”:”click”,
      "country":"Spain",
      "region":"Islas Baleares",
      "day_of_week":"Tuesday",
      "converted_goals":[

      ],
      "social_traffic":false,
      "converted_goals_info":[

      ],
      "referrer”:”http://www.google.com”,
      "browser":"Chrome",
      "ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
      "email_traffic":false
      }


      Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.



      One example of the report we need to build is



      Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.



      or



      Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
      In one day this service sends out millions of such events amounting to GB’s of data per day.



      Here are the specific doubts I have




      • How should we store this huge amount of data

      • How should we enable ourselves to analyse the data in real time.

      • How should the query system work here (I am relatively clueless about this part)

      • If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?







      hbase bigdata streaming spark-streaming apache-kafka-streams






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 23 '18 at 15:14









      Simran kaurSimran kaur

      93351747




      93351747
























          2 Answers
          2






          active

          oldest

          votes


















          2















          1. How should we store this huge amount of data.


          Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.




          1. How should we enable ourselves to analyse the data in real time.


          You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:




          1. Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.


          2. Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.



          Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link




          1. How should the query system work here


          As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
          If queried on real-time data, use Cassandra or HBase as low latency systems.




          1. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?


          If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.






          share|improve this answer































            2














            Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.




            1. How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)


            2. How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.


            3. How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.


            4. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.



            Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.






            share|improve this answer
























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449122%2fanalysis-on-real-time-streaming-data%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2















              1. How should we store this huge amount of data.


              Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.




              1. How should we enable ourselves to analyse the data in real time.


              You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:




              1. Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.


              2. Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.



              Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link




              1. How should the query system work here


              As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
              If queried on real-time data, use Cassandra or HBase as low latency systems.




              1. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?


              If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.






              share|improve this answer




























                2















                1. How should we store this huge amount of data.


                Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.




                1. How should we enable ourselves to analyse the data in real time.


                You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:




                1. Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.


                2. Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.



                Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link




                1. How should the query system work here


                As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
                If queried on real-time data, use Cassandra or HBase as low latency systems.




                1. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?


                If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.






                share|improve this answer


























                  2












                  2








                  2








                  1. How should we store this huge amount of data.


                  Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.




                  1. How should we enable ourselves to analyse the data in real time.


                  You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:




                  1. Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.


                  2. Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.



                  Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link




                  1. How should the query system work here


                  As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
                  If queried on real-time data, use Cassandra or HBase as low latency systems.




                  1. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?


                  If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.






                  share|improve this answer














                  1. How should we store this huge amount of data.


                  Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.




                  1. How should we enable ourselves to analyse the data in real time.


                  You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:




                  1. Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.


                  2. Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.



                  Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link




                  1. How should the query system work here


                  As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
                  If queried on real-time data, use Cassandra or HBase as low latency systems.




                  1. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?


                  If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 25 '18 at 6:45









                  NarenNaren

                  462




                  462

























                      2














                      Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.




                      1. How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)


                      2. How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.


                      3. How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.


                      4. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.



                      Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.






                      share|improve this answer




























                        2














                        Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.




                        1. How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)


                        2. How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.


                        3. How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.


                        4. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.



                        Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.






                        share|improve this answer


























                          2












                          2








                          2







                          Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.




                          1. How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)


                          2. How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.


                          3. How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.


                          4. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.



                          Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.






                          share|improve this answer













                          Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.




                          1. How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)


                          2. How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.


                          3. How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.


                          4. If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.



                          Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 23 '18 at 19:27









                          Jim ToddJim Todd

                          810511




                          810511






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449122%2fanalysis-on-real-time-streaming-data%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              Tangent Lines Diagram Along Smooth Curve

                              Yusuf al-Mu'taman ibn Hud

                              Zucchini