How can I return the number of distinct user logins over a 30 day period for any given day?











up vote
0
down vote

favorite












Let's say I have the following dataset in a table called UserEvents:



  LogIn     UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3


For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:



Day        PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3


Here's what I've tried:



Select 
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day


The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?










share|improve this question
























  • making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
    – rtbf
    Nov 8 at 9:32















up vote
0
down vote

favorite












Let's say I have the following dataset in a table called UserEvents:



  LogIn     UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3


For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:



Day        PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3


Here's what I've tried:



Select 
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day


The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?










share|improve this question
























  • making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
    – rtbf
    Nov 8 at 9:32













up vote
0
down vote

favorite









up vote
0
down vote

favorite











Let's say I have the following dataset in a table called UserEvents:



  LogIn     UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3


For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:



Day        PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3


Here's what I've tried:



Select 
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day


The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?










share|improve this question















Let's say I have the following dataset in a table called UserEvents:



  LogIn     UserID
2018-09-30 1
2018-09-30 3
2018-10-01 1
2018-10-01 2
2018-10-01 3
2018-10-02 2
2018-10-02 3


For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:



Day        PastMonthUsers
2018-09-30 2
2018-10-01 3
2018-10-02 3


Here's what I've tried:



Select 
UserEvents1.LogIn as Day,
count (distinct UserEvents1.UserID) as PastMonthUsers
from UserEvents as UserEvents1
inner join
(
Select
LogIn,
UserID
from UserEvents
) as UserEvents2
on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn
group UserEvents1.Day


The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?







sql google-bigquery






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 7 at 19:48

























asked Nov 7 at 18:47









Dylan Miller

32




32












  • making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
    – rtbf
    Nov 8 at 9:32


















  • making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
    – rtbf
    Nov 8 at 9:32
















making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32




making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32












1 Answer
1






active

oldest

votes

















up vote
0
down vote













Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know



#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)





share|improve this answer





















  • Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
    – Dylan Miller
    Nov 7 at 21:50










  • "still running for 20 min" does not make them similar :o)
    – Mikhail Berlyant
    Nov 7 at 23:14










  • The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
    – Dylan Miller
    Nov 8 at 14:59










  • solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
    – Mikhail Berlyant
    Nov 8 at 15:01










  • You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
    – Mikhail Berlyant
    yesterday











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195869%2fhow-can-i-return-the-number-of-distinct-user-logins-over-a-30-day-period-for-any%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know



#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)





share|improve this answer





















  • Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
    – Dylan Miller
    Nov 7 at 21:50










  • "still running for 20 min" does not make them similar :o)
    – Mikhail Berlyant
    Nov 7 at 23:14










  • The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
    – Dylan Miller
    Nov 8 at 14:59










  • solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
    – Mikhail Berlyant
    Nov 8 at 15:01










  • You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
    – Mikhail Berlyant
    yesterday















up vote
0
down vote













Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know



#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)





share|improve this answer





















  • Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
    – Dylan Miller
    Nov 7 at 21:50










  • "still running for 20 min" does not make them similar :o)
    – Mikhail Berlyant
    Nov 7 at 23:14










  • The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
    – Dylan Miller
    Nov 8 at 14:59










  • solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
    – Mikhail Berlyant
    Nov 8 at 15:01










  • You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
    – Mikhail Berlyant
    yesterday













up vote
0
down vote










up vote
0
down vote









Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know



#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)





share|improve this answer












Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know



#standardSQL
SELECT DISTINCT Day,
(SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers
FROM (
SELECT Day,
ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users
FROM (
SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days
FROM `project.dataset.UserEvents`
)
)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 7 at 20:27









Mikhail Berlyant

53.8k43064




53.8k43064












  • Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
    – Dylan Miller
    Nov 7 at 21:50










  • "still running for 20 min" does not make them similar :o)
    – Mikhail Berlyant
    Nov 7 at 23:14










  • The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
    – Dylan Miller
    Nov 8 at 14:59










  • solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
    – Mikhail Berlyant
    Nov 8 at 15:01










  • You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
    – Mikhail Berlyant
    yesterday


















  • Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
    – Dylan Miller
    Nov 7 at 21:50










  • "still running for 20 min" does not make them similar :o)
    – Mikhail Berlyant
    Nov 7 at 23:14










  • The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
    – Dylan Miller
    Nov 8 at 14:59










  • solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
    – Mikhail Berlyant
    Nov 8 at 15:01










  • You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
    – Mikhail Berlyant
    yesterday
















Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50




Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50












"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14




"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14












The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59




The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59












solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01




solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01












You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday




You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195869%2fhow-can-i-return-the-number-of-distinct-user-logins-over-a-30-day-period-for-any%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Hercules Kyvelos

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud