How can I return the number of distinct user logins over a 30 day period for any given day?

up vote
0
down vote

favorite

Let's say I have the following dataset in a table called UserEvents:

  LogIn     UserID

2018-09-30    1

2018-09-30    3

2018-10-01    1

2018-10-01    2

2018-10-01    3

2018-10-02    2

2018-10-02    3

For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:

Day        PastMonthUsers

2018-09-30      2

2018-10-01      3

2018-10-02      3

Here's what I've tried:

Select 

  UserEvents1.LogIn as Day, 

  count (distinct UserEvents1.UserID) as PastMonthUsers

from UserEvents as UserEvents1

inner join 

  (

    Select 

      LogIn,

      UserID

    from UserEvents

  ) as UserEvents2 

on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn 

group UserEvents1.Day

The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?

edited Nov 7 at 19:48

asked Nov 7 at 18:47

Dylan Miller

making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32

add a comment |

up vote
0
down vote

favorite

Let's say I have the following dataset in a table called UserEvents:

  LogIn     UserID

2018-09-30    1

2018-09-30    3

2018-10-01    1

2018-10-01    2

2018-10-01    3

2018-10-02    2

2018-10-02    3

For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:

Day        PastMonthUsers

2018-09-30      2

2018-10-01      3

2018-10-02      3

Here's what I've tried:

Select 

  UserEvents1.LogIn as Day, 

  count (distinct UserEvents1.UserID) as PastMonthUsers

from UserEvents as UserEvents1

inner join 

  (

    Select 

      LogIn,

      UserID

    from UserEvents

  ) as UserEvents2 

on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn 

group UserEvents1.Day

The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?

edited Nov 7 at 19:48

asked Nov 7 at 18:47

Dylan Miller

making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32

add a comment |

up vote
0
down vote

favorite

Let's say I have the following dataset in a table called UserEvents:

  LogIn     UserID

2018-09-30    1

2018-09-30    3

2018-10-01    1

2018-10-01    2

2018-10-01    3

2018-10-02    2

2018-10-02    3

For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:

Day        PastMonthUsers

2018-09-30      2

2018-10-01      3

2018-10-02      3

Here's what I've tried:

Select 

  UserEvents1.LogIn as Day, 

  count (distinct UserEvents1.UserID) as PastMonthUsers

from UserEvents as UserEvents1

inner join 

  (

    Select 

      LogIn,

      UserID

    from UserEvents

  ) as UserEvents2 

on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn 

group UserEvents1.Day

The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?

edited Nov 7 at 19:48

asked Nov 7 at 18:47

Dylan Miller

Let's say I have the following dataset in a table called UserEvents:

  LogIn     UserID

2018-09-30    1

2018-09-30    3

2018-10-01    1

2018-10-01    2

2018-10-01    3

2018-10-02    2

2018-10-02    3

For each day, I'd like to return the number of users logged-in in the past month. Here are the expected results for the above:

Day        PastMonthUsers

2018-09-30      2

2018-10-01      3

2018-10-02      3

Here's what I've tried:

Select 

  UserEvents1.LogIn as Day, 

  count (distinct UserEvents1.UserID) as PastMonthUsers

from UserEvents as UserEvents1

inner join 

  (

    Select 

      LogIn,

      UserID

    from UserEvents

  ) as UserEvents2 

on UserEvents2.LogIn between DATE_SUB(UserEvents1.LogIn, interval 1 month) and UserEvents1.LogIn 

group UserEvents1.Day

The above query ran for 20 minutes on my actual dataset before I cancelled it. Any thoughts on how I can simplify?

sql google-bigquery

edited Nov 7 at 19:48

asked Nov 7 at 18:47

Dylan Miller

edited Nov 7 at 19:48

asked Nov 7 at 18:47

Dylan Miller

edited Nov 7 at 19:48

asked Nov 7 at 18:47

Dylan Miller

asked Nov 7 at 18:47

Dylan Miller

asked Nov 7 at 18:47

Dylan Miller

making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32

add a comment |

making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32

making your table partitioned would probably help, now you read whole table where you just need only past month/30 days
– rtbf
Nov 8 at 9:32

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know

#standardSQL

SELECT DISTINCT Day, 

  (SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers

FROM (

  SELECT Day, 

    ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users

  FROM (

    SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days

    FROM `project.dataset.UserEvents`

  )

)

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50

"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14

The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59

solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01

You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53195869%2fhow-can-i-return-the-number-of-distinct-user-logins-over-a-30-day-period-for-any%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know

#standardSQL

SELECT DISTINCT Day, 

  (SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers

FROM (

  SELECT Day, 

    ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users

  FROM (

    SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days

    FROM `project.dataset.UserEvents`

  )

)

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50

"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14

The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59

solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01

You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday

add a comment |

up vote
0
down vote

Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know

#standardSQL

SELECT DISTINCT Day, 

  (SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers

FROM (

  SELECT Day, 

    ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users

  FROM (

    SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days

    FROM `project.dataset.UserEvents`

  )

)

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50

"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14

The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59

solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01

You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday

add a comment |

up vote
0
down vote

Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know

#standardSQL

SELECT DISTINCT Day, 

  (SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers

FROM (

  SELECT Day, 

    ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users

  FROM (

    SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days

    FROM `project.dataset.UserEvents`

  )

)

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

Below is for BigQuery Standard SQL and avoids doing self-JOIN in favor of using analytical function - but still can get into issue depends on your data - try and let us know

#standardSQL

SELECT DISTINCT Day, 

  (SELECT COUNT(DISTINCT user) FROM UNNEST(users) user) PastMonthUsers

FROM (

  SELECT Day, 

    ARRAY_AGG(UserID) OVER(ORDER BY days RANGE BETWEEN 29 PRECEDING AND CURRENT ROW) users

  FROM (

    SELECT LogIn Day, UserID, UNIX_DATE(LogIn) days

    FROM `project.dataset.UserEvents`

  )

)

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

answered Nov 7 at 20:27

Mikhail Berlyant

53.8k43064

Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50

"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14

The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59

solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01

You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday

add a comment |

Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50

"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14

The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59

solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01

You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday

Thanks! Ran this on my actual dataset, and it was still processing after 20 minutes (similar to original query).
– Dylan Miller
Nov 7 at 21:50

"still running for 20 min" does not make them similar :o)
– Mikhail Berlyant
Nov 7 at 23:14

The run-time complexity I'm assuming stems from the join of 30 days worth of data for each record. I was hoping someone would have another approach that avoids that join. I can get a rough number in an acceptable runtime by grouping by the month, but a rolling monthly count would be much more accurate.
– Dylan Miller
Nov 8 at 14:59

solution I proposed does not use join and rather uses analytical windowed functions, but as I mentioned - depends on your data still can have issue :o(
– Mikhail Berlyant
Nov 8 at 15:01

You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!
– Mikhail Berlyant
yesterday

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk