Analysis on real time streaming data

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

This is a relatively broad question and I am aware of tools I would possibly need to use for a problem like this (For ex. Spark, Kafka and Hadoop) but I am looking for a concrete vision from an experienced professional's perspective

Here's what the problem at hand looks like:

We are using a google analytics like service, which is sending us a stream of events. An event is an action performed on the page. It could be a click on a button, mouse movement, page scroll or a custom event defined by us.

{

"query_params":[



],

"device_type":"Desktop",

"browser_string":"Chrome 47.0.2526",

"ip":"62.82.34.0",

"screen_colors":"24",

"os":"Mac OS X",

"browser_version":"47.0.2526",

"session":1,

"country_code":"ES",

"document_encoding":"UTF-8",

"city":"Palma De Mallorca",

"tz":"Europe/Madrid",

"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",

"combination_goals_facet_term":"c2_g1",

"ts":1452015428,

"hour_of_day":17,

"os_version":"10.11.2",

"experiment":465,

"user_time":"2016-01-05T17:37:10.675000",

"direct_traffic":false,

"combination":"2",

"search_traffic":false,

"returning_visitor":false,

"hit_time":"2016-01-05T17:37:08",

"user_language":"es",

"device":"Other",

"active_goals":[

1

],

"account":196,

"url”:”http://someurl.com”,

“action”:”click”,

"country":"Spain",

"region":"Islas Baleares",

"day_of_week":"Tuesday",

"converted_goals":[



],

"social_traffic":false,

"converted_goals_info":[



],

"referrer”:”http://www.google.com”,

"browser":"Chrome",

"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

"email_traffic":false

}

Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.

One example of the report we need to build is

Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.

Show me the sum of clicks on a particular button of all the users who are coming from referrer = “http://www.google.com” and are based out of India and are using Desktop.
In one day this service sends out millions of such events amounting to GB’s of data per day.

Here are the specific doubts I have

How should we store this huge amount of data

How should we enable ourselves to analyse the data in real time.

How should the query system work here (I am relatively clueless about this part)

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

asked Nov 23 '18 at 15:14

Simran kaur

93351747

add a comment |

Here's what the problem at hand looks like:

{

"query_params":[



],

"device_type":"Desktop",

"browser_string":"Chrome 47.0.2526",

"ip":"62.82.34.0",

"screen_colors":"24",

"os":"Mac OS X",

"browser_version":"47.0.2526",

"session":1,

"country_code":"ES",

"document_encoding":"UTF-8",

"city":"Palma De Mallorca",

"tz":"Europe/Madrid",

"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",

"combination_goals_facet_term":"c2_g1",

"ts":1452015428,

"hour_of_day":17,

"os_version":"10.11.2",

"experiment":465,

"user_time":"2016-01-05T17:37:10.675000",

"direct_traffic":false,

"combination":"2",

"search_traffic":false,

"returning_visitor":false,

"hit_time":"2016-01-05T17:37:08",

"user_language":"es",

"device":"Other",

"active_goals":[

1

],

"account":196,

"url”:”http://someurl.com”,

“action”:”click”,

"country":"Spain",

"region":"Islas Baleares",

"day_of_week":"Tuesday",

"converted_goals":[



],

"social_traffic":false,

"converted_goals_info":[



],

"referrer”:”http://www.google.com”,

"browser":"Chrome",

"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

"email_traffic":false

}

Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.

One example of the report we need to build is

Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.

Here are the specific doubts I have

How should we store this huge amount of data

How should we enable ourselves to analyse the data in real time.

How should the query system work here (I am relatively clueless about this part)

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

asked Nov 23 '18 at 15:14

Simran kaur

93351747

add a comment |

Here's what the problem at hand looks like:

{

"query_params":[



],

"device_type":"Desktop",

"browser_string":"Chrome 47.0.2526",

"ip":"62.82.34.0",

"screen_colors":"24",

"os":"Mac OS X",

"browser_version":"47.0.2526",

"session":1,

"country_code":"ES",

"document_encoding":"UTF-8",

"city":"Palma De Mallorca",

"tz":"Europe/Madrid",

"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",

"combination_goals_facet_term":"c2_g1",

"ts":1452015428,

"hour_of_day":17,

"os_version":"10.11.2",

"experiment":465,

"user_time":"2016-01-05T17:37:10.675000",

"direct_traffic":false,

"combination":"2",

"search_traffic":false,

"returning_visitor":false,

"hit_time":"2016-01-05T17:37:08",

"user_language":"es",

"device":"Other",

"active_goals":[

1

],

"account":196,

"url”:”http://someurl.com”,

“action”:”click”,

"country":"Spain",

"region":"Islas Baleares",

"day_of_week":"Tuesday",

"converted_goals":[



],

"social_traffic":false,

"converted_goals_info":[



],

"referrer”:”http://www.google.com”,

"browser":"Chrome",

"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

"email_traffic":false

}

Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.

One example of the report we need to build is

Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.

Here are the specific doubts I have

How should we store this huge amount of data

How should we enable ourselves to analyse the data in real time.

How should the query system work here (I am relatively clueless about this part)

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

asked Nov 23 '18 at 15:14

Simran kaur

93351747

Here's what the problem at hand looks like:

{

"query_params":[



],

"device_type":"Desktop",

"browser_string":"Chrome 47.0.2526",

"ip":"62.82.34.0",

"screen_colors":"24",

"os":"Mac OS X",

"browser_version":"47.0.2526",

"session":1,

"country_code":"ES",

"document_encoding":"UTF-8",

"city":"Palma De Mallorca",

"tz":"Europe/Madrid",

"uuid":"A37F2D3A4B99FF003132D662EFEEAFCA",

"combination_goals_facet_term":"c2_g1",

"ts":1452015428,

"hour_of_day":17,

"os_version":"10.11.2",

"experiment":465,

"user_time":"2016-01-05T17:37:10.675000",

"direct_traffic":false,

"combination":"2",

"search_traffic":false,

"returning_visitor":false,

"hit_time":"2016-01-05T17:37:08",

"user_language":"es",

"device":"Other",

"active_goals":[

1

],

"account":196,

"url”:”http://someurl.com”,

“action”:”click”,

"country":"Spain",

"region":"Islas Baleares",

"day_of_week":"Tuesday",

"converted_goals":[



],

"social_traffic":false,

"converted_goals_info":[



],

"referrer”:”http://www.google.com”,

"browser":"Chrome",

"ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36",

"email_traffic":false

}

Now we need to build a solution to analyse this data. We need to make a reporting platform which can aggregate, filter and slice and dice the data.

One example of the report we need to build is

Show me all the users who are coming from United States and are using the chrome browser and are using the browser on an iPhone.

Here are the specific doubts I have

How should we store this huge amount of data

How should we enable ourselves to analyse the data in real time.

How should the query system work here (I am relatively clueless about this part)

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

hbase bigdata streaming spark-streaming apache-kafka-streams

asked Nov 23 '18 at 15:14

Simran kaur

93351747

asked Nov 23 '18 at 15:14

Simran kaur

93351747

asked Nov 23 '18 at 15:14

Simran kaur

93351747

asked Nov 23 '18 at 15:14

Simran kaur

93351747

asked Nov 23 '18 at 15:14

Simran kaur

93351747

add a comment |

2 Answers
2

active

oldest

votes

How should we store this huge amount of data.

Use one of the cloud storage providers (link) Partition the data based on date and hour (date=2018-11-25/hour=16), this will reduce the amount of data being read per query. Store the data in one of the binary formats like parquet or ORC, will give you better performance and compression ratio.

How should we enable ourselves to analyse the data in real time.

You can run multiple applications listening on a kakfa topic. First store the events to a storage using spark structured streaming 2.3 with continuous mode application (link). This will give you option to query and analyze historical data and re-process events if required. You have two options here:

Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.

Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.

Run another spark application in parallel for real-time analysis, if you know the dimensions and metrics on which you have to aggregate the data, use spark structured streaming with windowing. You could group by the columns and window every min or 5 mins and store in one of the above-mentioned storage providers which can be queried in real time. link

How should the query system work here

As mentioned in answer 3, build a hive catalog on the data stored to get a live view of the events. For reporting purpose, use spark/hive/presto to query the data.
If queried on real-time data, use Cassandra or HBase as low latency systems.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

If you have partitioned the data properly, you can archive data to cold backup based on a periodic archive rule. For e.g., dimensions and metrics generated from events can be maintained and events can be archived after 1 month.

answered Nov 25 '18 at 6:45

Naren

462

add a comment |

Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.

How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)

How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.

How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.

Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.

answered Nov 23 '18 at 19:27

Jim Todd

810511

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449122%2fanalysis-on-real-time-streaming-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

How should we store this huge amount of data.

How should we enable ourselves to analyse the data in real time.

Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.

Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.

How should the query system work here

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

answered Nov 25 '18 at 6:45

Naren

462

add a comment |

How should we store this huge amount of data.

How should we enable ourselves to analyse the data in real time.

Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.

Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.

How should the query system work here

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

answered Nov 25 '18 at 6:45

Naren

462

add a comment |

How should we store this huge amount of data.

How should we enable ourselves to analyse the data in real time.

Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.

Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.

How should the query system work here

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

answered Nov 25 '18 at 6:45

Naren

462

How should we store this huge amount of data.

How should we enable ourselves to analyse the data in real time.

Store in hdfs/s3/gcp storage etc. Build a hive catalog on the data stored to get a live view of the events. Can use spark/hive/presto to query the data. note: compaction will be required if small files are being generated.

Store in a wide column store like Cassandra or HBase. link I would prefer this option for this use case.

How should the query system work here

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this?

answered Nov 25 '18 at 6:45

Naren

462

answered Nov 25 '18 at 6:45

Naren

462

answered Nov 25 '18 at 6:45

Naren

462

answered Nov 25 '18 at 6:45

Naren

462

add a comment |

Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.

How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)

How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.

How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.

Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.

answered Nov 23 '18 at 19:27

Jim Todd

810511

add a comment |

Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.

How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)

How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.

How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.

Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.

answered Nov 23 '18 at 19:27

Jim Todd

810511

add a comment |

Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.

How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)

How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.

How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.

Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.

answered Nov 23 '18 at 19:27

Jim Todd

810511

Let me attempt an answer which I know as best is to use Hadoop, Kafka and Spark.

How should we store this huge amount of data : It's real time data. So you can stream it directly through Kafka to HDFS.More insights in point (4)

How should we enable ourselves to analyse the data in real time.- Learn Spark. As you say size in TB, ensure you have a cluster with good number of data nodes. Further, set up the spark cluster separately if you can. Spark Dstream is very good in analyzing real time data feeds. Also, it handled these kind of json data without complications.

How should the query system work here - spark SQLcontext can enable you write simple SQL like queries on top of your semi structured data for your use cases. It's simple as a SQL.

If we are looking at maintaining data of about 4 TB which we estimate to accumulate over 3 months, what should be the strategy to retain this data. When and how should we delete this? I will advise you move the data to a larger warehouse from HDFS after accumulating and analyzing some 10 days of data and then repeat this backing up process. Else, if you can buy hardwares for your hadoop, well and good. Store it in HDFS itself.

Whatever metrics you have mentioned above can be easily processed by spark in few lines. Trust me, it's simple as a SQL. Further, for dashboard you can send data to a qlikview front end.

answered Nov 23 '18 at 19:27

Jim Todd

810511

answered Nov 23 '18 at 19:27

Jim Todd

810511

answered Nov 23 '18 at 19:27

Jim Todd

810511

answered Nov 23 '18 at 19:27

Jim Todd

810511

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

e,zYnZ289EZ8NZOLk2RNGKOKLWYcfi20Tskz3VDd,hg7qOsG1iCdTSmQ 6yY,y3qDXAIj0Gm8eyU3EOFOUMAi,vi

搜尋此網誌

Wsrtjtyk