Adding a grouped, aggregate nunique column to pandas dataframe












3















I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.



my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.



something like this isn't working:



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()


nor is



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)


this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)


in R this is easily done in data.table with



df[, n_unique_id := uniqueN(id), by = c('track', 'type')]


thanks!










share|improve this question

























  • Can paste a copy of your source dataframe and what the expected output should look like?

    – Scott Boston
    May 1 '17 at 21:27
















3















I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.



my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.



something like this isn't working:



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()


nor is



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)


this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)


in R this is easily done in data.table with



df[, n_unique_id := uniqueN(id), by = c('track', 'type')]


thanks!










share|improve this question

























  • Can paste a copy of your source dataframe and what the expected output should look like?

    – Scott Boston
    May 1 '17 at 21:27














3












3








3


1






I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.



my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.



something like this isn't working:



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()


nor is



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)


this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)


in R this is easily done in data.table with



df[, n_unique_id := uniqueN(id), by = c('track', 'type')]


thanks!










share|improve this question
















I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.



my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.



something like this isn't working:



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()


nor is



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)


this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):



df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)


in R this is easily done in data.table with



df[, n_unique_id := uniqueN(id), by = c('track', 'type')]


thanks!







python pandas dataframe aggregate pandas-groupby






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 11:52









jpp

102k2165115




102k2165115










asked May 1 '17 at 21:14









wbartswbarts

142




142













  • Can paste a copy of your source dataframe and what the expected output should look like?

    – Scott Boston
    May 1 '17 at 21:27



















  • Can paste a copy of your source dataframe and what the expected output should look like?

    – Scott Boston
    May 1 '17 at 21:27

















Can paste a copy of your source dataframe and what the expected output should look like?

– Scott Boston
May 1 '17 at 21:27





Can paste a copy of your source dataframe and what the expected output should look like?

– Scott Boston
May 1 '17 at 21:27












1 Answer
1






active

oldest

votes


















5














df.groupby(['track', 'type'])['id'].transform(nunique)


Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.



As pointed out by @root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.



For example transform('sum') should be preferred over transform(sum).



Try this instead



df.groupby(['track', 'type'])['id'].transform('nunique')


demo



df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)

id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B

df.groupby(['track', 'type'])['id'].transform('nunique')

0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64





share|improve this answer





















  • 4





    I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

    – root
    May 1 '17 at 21:31













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f43726631%2fadding-a-grouped-aggregate-nunique-column-to-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









5














df.groupby(['track', 'type'])['id'].transform(nunique)


Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.



As pointed out by @root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.



For example transform('sum') should be preferred over transform(sum).



Try this instead



df.groupby(['track', 'type'])['id'].transform('nunique')


demo



df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)

id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B

df.groupby(['track', 'type'])['id'].transform('nunique')

0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64





share|improve this answer





















  • 4





    I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

    – root
    May 1 '17 at 21:31


















5














df.groupby(['track', 'type'])['id'].transform(nunique)


Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.



As pointed out by @root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.



For example transform('sum') should be preferred over transform(sum).



Try this instead



df.groupby(['track', 'type'])['id'].transform('nunique')


demo



df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)

id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B

df.groupby(['track', 'type'])['id'].transform('nunique')

0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64





share|improve this answer





















  • 4





    I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

    – root
    May 1 '17 at 21:31
















5












5








5







df.groupby(['track', 'type'])['id'].transform(nunique)


Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.



As pointed out by @root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.



For example transform('sum') should be preferred over transform(sum).



Try this instead



df.groupby(['track', 'type'])['id'].transform('nunique')


demo



df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)

id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B

df.groupby(['track', 'type'])['id'].transform('nunique')

0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64





share|improve this answer















df.groupby(['track', 'type'])['id'].transform(nunique)


Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.



As pointed out by @root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.



For example transform('sum') should be preferred over transform(sum).



Try this instead



df.groupby(['track', 'type'])['id'].transform('nunique')


demo



df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)

id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B

df.groupby(['track', 'type'])['id'].transform('nunique')

0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64






share|improve this answer














share|improve this answer



share|improve this answer








edited May 1 '17 at 21:31

























answered May 1 '17 at 21:27









piRSquaredpiRSquared

158k23154298




158k23154298








  • 4





    I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

    – root
    May 1 '17 at 21:31
















  • 4





    I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

    – root
    May 1 '17 at 21:31










4




4





I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

– root
May 1 '17 at 21:31







I was just about to comment with this. It might be useful to mention that the string aliases should usually be preferred if available, as they'll reference optimal functions, e.g. .transform(sum) uses python sum, whereas .transfrom('sum') would use numpy and be more computationally efficient.

– root
May 1 '17 at 21:31






















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f43726631%2fadding-a-grouped-aggregate-nunique-column-to-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Tangent Lines Diagram Along Smooth Curve

Yusuf al-Mu'taman ibn Hud

Zucchini