Analysis on real time streaming data
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective
Here's what the problem at hand looks like:
We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.
{
"query_params":[
],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url”:”http://someurl.com”,
“action”:”click”,
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[
],
"social_traffic":false,
"converted_goals_info":[
],
"referrer”:”http://www.google.com”,
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}
Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.
One example of the report we need to build is
Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.
or
Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
In one day this service sends out millions of such events amounting to GB’s of data per day.
Here are the specific doubts I have
- How should we store this huge amount of data
- How should we enable ourselves to analyse the data in real time.
- How should the query system work here (I am relatively clueless about this part)
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
hbase bigdata streaming spark-streaming apache-kafka-streams
add a comment |
This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective
Here's what the problem at hand looks like:
We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.
{
"query_params":[
],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url”:”http://someurl.com”,
“action”:”click”,
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[
],
"social_traffic":false,
"converted_goals_info":[
],
"referrer”:”http://www.google.com”,
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}
Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.
One example of the report we need to build is
Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.
or
Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
In one day this service sends out millions of such events amounting to GB’s of data per day.
Here are the specific doubts I have
- How should we store this huge amount of data
- How should we enable ourselves to analyse the data in real time.
- How should the query system work here (I am relatively clueless about this part)
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
hbase bigdata streaming spark-streaming apache-kafka-streams
add a comment |
This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective
Here's what the problem at hand looks like:
We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.
{
"query_params":[
],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url”:”http://someurl.com”,
“action”:”click”,
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[
],
"social_traffic":false,
"converted_goals_info":[
],
"referrer”:”http://www.google.com”,
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}
Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.
One example of the report we need to build is
Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.
or
Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
In one day this service sends out millions of such events amounting to GB’s of data per day.
Here are the specific doubts I have
- How should we store this huge amount of data
- How should we enable ourselves to analyse the data in real time.
- How should the query system work here (I am relatively clueless about this part)
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
hbase bigdata streaming spark-streaming apache-kafka-streams
This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective
Here's what the problem at hand looks like:
We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.
{
"query_params":[
],
"device_type":"Desktop",
"browser_string":"Chrome 47.0.2526",
"ip":"62.82.34.0",
"screen_colors":"24",
"os":"Mac OS X",
"browser_version":"47.0.2526",
"session":1,
"country_code":"ES",
"document_encoding":"UTF-8",
"city":"Palma De Mallorca",
"tz":"Europe/Madrid",
"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",
"combination_goals_facet_term":"c2_g1",
"ts":1452015428,
"hour_of_day":17,
"os_version":"10.11.2",
"experiment":465,
"user_time":"2016-01-05T17:37:10.675000",
"direct_traffic":false,
"combination":"2",
"search_traffic":false,
"returning_visitor":false,
"hit_time":"2016-01-05T17:37:08",
"user_language":"es",
"device":"Other",
"active_goals":[
1
],
"account":196,
"url”:”http://someurl.com”,
“action”:”click”,
"country":"Spain",
"region":"Islas Baleares",
"day_of_week":"Tuesday",
"converted_goals":[
],
"social_traffic":false,
"converted_goals_info":[
],
"referrer”:”http://www.google.com”,
"browser":"Chrome",
"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",
"email_traffic":false
}
Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.
One example of the report we need to build is
Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.
or
Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
In one day this service sends out millions of such events amounting to GB’s of data per day.
Here are the specific doubts I have
- How should we store this huge amount of data
- How should we enable ourselves to analyse the data in real time.
- How should the query system work here (I am relatively clueless about this part)
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
hbase bigdata streaming spark-streaming apache-kafka-streams
hbase bigdata streaming spark-streaming apache-kafka-streams
asked Nov 23 '18 at 15:14
Simran kaurSimran kaur
93351747
93351747
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
- How should we store this huge amount of data.
Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.
- How should we enable ourselves to analyse the data in real time.
You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:
Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.
Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.
Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link
- How should the query system work here
As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
If queried on real-time data, use Cassandra or HBase as low latency systems.
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.
add a comment |
Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.
How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)
How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark
Dstream
is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.How should the query system work here - spark
SQLcontext
can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.
Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449122%2fanalysis-on-real-time-streaming-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
- How should we store this huge amount of data.
Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.
- How should we enable ourselves to analyse the data in real time.
You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:
Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.
Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.
Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link
- How should the query system work here
As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
If queried on real-time data, use Cassandra or HBase as low latency systems.
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.
add a comment |
- How should we store this huge amount of data.
Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.
- How should we enable ourselves to analyse the data in real time.
You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:
Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.
Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.
Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link
- How should the query system work here
As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
If queried on real-time data, use Cassandra or HBase as low latency systems.
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.
add a comment |
- How should we store this huge amount of data.
Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.
- How should we enable ourselves to analyse the data in real time.
You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:
Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.
Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.
Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link
- How should the query system work here
As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
If queried on real-time data, use Cassandra or HBase as low latency systems.
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.
- How should we store this huge amount of data.
Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.
- How should we enable ourselves to analyse the data in real time.
You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:
Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.
Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.
Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link
- How should the query system work here
As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
If queried on real-time data, use Cassandra or HBase as low latency systems.
- If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?
If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.
answered Nov 25 '18 at 6:45
NarenNaren
462
462
add a comment |
add a comment |
Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.
How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)
How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark
Dstream
is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.How should the query system work here - spark
SQLcontext
can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.
Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.
add a comment |
Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.
How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)
How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark
Dstream
is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.How should the query system work here - spark
SQLcontext
can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.
Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.
add a comment |
Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.
How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)
How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark
Dstream
is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.How should the query system work here - spark
SQLcontext
can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.
Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.
Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.
How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)
How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark
Dstream
is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.How should the query system work here - spark
SQLcontext
can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.
Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.
answered Nov 23 '18 at 19:27
Jim ToddJim Todd
810511
810511
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449122%2fanalysis-on-real-time-streaming-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown