How can I return the number of distinct user logins over a 30 day period for any given day?
up vote
0
down vote
favorite
Let's say I have the following dataset in a table called UserEvents:
LogIn UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3
For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:
Day PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3
Here's what I've tried:
Select
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day
The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?
sql google-bigquery
add a comment |
up vote
0
down vote
favorite
Let's say I have the following dataset in a table called UserEvents:
LogIn UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3
For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:
Day PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3
Here's what I've tried:
Select
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day
The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?
sql google-bigquery
making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Let's say I have the following dataset in a table called UserEvents:
LogIn UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3
For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:
Day PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3
Here's what I've tried:
Select
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day
The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?
sql google-bigquery
Let's say I have the following dataset in a table called UserEvents:
LogIn UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3
For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:
Day PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3
Here's what I've tried:
Select
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day
The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?
sql google-bigquery
sql google-bigquery
edited Nov 7 at 19:48
asked Nov 7 at 18:47
Dylan Miller
32
32
making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32
add a comment |
making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32
making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32
making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know
#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
You have a low rate. Important on SO - you canmark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important tovote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know
#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
You have a low rate. Important on SO - you canmark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important tovote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday
add a comment |
up vote
0
down vote
Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know
#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
You have a low rate. Important on SO - you canmark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important tovote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday
add a comment |
up vote
0
down vote
up vote
0
down vote
Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know
#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)
Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know
#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)
answered Nov 7 at 20:27
Mikhail Berlyant
53.8k43064
53.8k43064
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
You have a low rate. Important on SO - you canmark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important tovote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday
add a comment |
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
You have a low rate. Important on SO - you canmark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important tovote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01
You have a low rate. Important on SO - you can
mark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!– Mikhail Berlyant
yesterday
You have a low rate. Important on SO - you can
mark accepted answer
by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer
. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!– Mikhail Berlyant
yesterday
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195869%2fhow-can-i-return-the-number-of-distinct-user-logins-over-a-30-day-period-for-any%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32