Pandas - How to groupby and remove specifc rows

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have a DF like this:

id     company     duration

0    Other Company    5

0    Other Company    19

0    X Company        7

1    Other Company    24

1    Other Company    6

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:

import pandas as pd

jobs = pd.read_csv("data/jobs.csv")

time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)

And got this:

id     company     duration

0    Other Company    24

0    X Company        7

1    Other Company    30

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'

Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'

Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..

Can someone shed some light in the issue for me? Thanks in advance.

asked Nov 23 '18 at 19:23

Anyone

225

time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26

Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35

Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16

add a comment |

I have a DF like this:

id     company     duration

0    Other Company    5

0    Other Company    19

0    X Company        7

1    Other Company    24

1    Other Company    6

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:

import pandas as pd

jobs = pd.read_csv("data/jobs.csv")

time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)

And got this:

id     company     duration

0    Other Company    24

0    X Company        7

1    Other Company    30

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'

Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'

Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..

Can someone shed some light in the issue for me? Thanks in advance.

asked Nov 23 '18 at 19:23

Anyone

225

time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26

Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35

Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16

add a comment |

I have a DF like this:

id     company     duration

0    Other Company    5

0    Other Company    19

0    X Company        7

1    Other Company    24

1    Other Company    6

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:

import pandas as pd

jobs = pd.read_csv("data/jobs.csv")

time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)

And got this:

id     company     duration

0    Other Company    24

0    X Company        7

1    Other Company    30

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'

Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'

Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..

Can someone shed some light in the issue for me? Thanks in advance.

asked Nov 23 '18 at 19:23

Anyone

225

I have a DF like this:

id     company     duration

0    Other Company    5

0    Other Company    19

0    X Company        7

1    Other Company    24

1    Other Company    6

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:

import pandas as pd

jobs = pd.read_csv("data/jobs.csv")

time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)

And got this:

id     company     duration

0    Other Company    24

0    X Company        7

1    Other Company    30

1    X Company        12

2    X Company        9

3    Other Company    30

3    X Company        16

Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'

Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'

Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..

Can someone shed some light in the issue for me? Thanks in advance.

python pandas

asked Nov 23 '18 at 19:23

Anyone

225

asked Nov 23 '18 at 19:23

Anyone

225

asked Nov 23 '18 at 19:23

Anyone

225

asked Nov 23 '18 at 19:23

Anyone

225

asked Nov 23 '18 at 19:23

Anyone

225

time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26

Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35

Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16

add a comment |

time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26

Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35

Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16

time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26

Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35

Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16

add a comment |

2 Answers
2

active

oldest

votes

Does this help?

time_in_company= time_in_company.reset_index(level='company')

time_in_company [time_in_company ['company']!="Other Company"]

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

– Anyone
Nov 23 '18 at 20:03

Also, do you mind explaining the first line?

– Anyone
Nov 23 '18 at 20:04

add a comment |

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:

import numpy as np

import pandas as pd





ids = [0,0,0,1,1,1,2,3,3]

company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']

duration = [5,19,7,24,6,12,9,30,16]



df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})





df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

You get:

ids  company      

0    Other Company    24

1    Other Company    30

3    Other Company    30

Name: duration, dtype: int64

EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

You get:

duration

ids company                

0.0 Other Company      24.0

1.0 Other Company      30.0

3.0 Other Company      30.0

Nonetheless, the firs one is faster:

2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 24 '18 at 9:34

answered Nov 24 '18 at 9:20

2Obe

1,04021027

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452008%2fpandas-how-to-groupby-and-remove-specifc-rows%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Does this help?

time_in_company= time_in_company.reset_index(level='company')

time_in_company [time_in_company ['company']!="Other Company"]

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

– Anyone
Nov 23 '18 at 20:03

Also, do you mind explaining the first line?

– Anyone
Nov 23 '18 at 20:04

add a comment |

Does this help?

time_in_company= time_in_company.reset_index(level='company')

time_in_company [time_in_company ['company']!="Other Company"]

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

– Anyone
Nov 23 '18 at 20:03

Also, do you mind explaining the first line?

– Anyone
Nov 23 '18 at 20:04

add a comment |

Does this help?

time_in_company= time_in_company.reset_index(level='company')

time_in_company [time_in_company ['company']!="Other Company"]

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

Does this help?

time_in_company= time_in_company.reset_index(level='company')

time_in_company [time_in_company ['company']!="Other Company"]

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

answered Nov 23 '18 at 19:46

Ken Dekalb

317112

Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

– Anyone
Nov 23 '18 at 20:03

Also, do you mind explaining the first line?

– Anyone
Nov 23 '18 at 20:04

add a comment |

Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

– Anyone
Nov 23 '18 at 20:03

Also, do you mind explaining the first line?

– Anyone
Nov 23 '18 at 20:04

Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

– Anyone
Nov 23 '18 at 20:03

Also, do you mind explaining the first line?

– Anyone
Nov 23 '18 at 20:04

add a comment |

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:

import numpy as np

import pandas as pd





ids = [0,0,0,1,1,1,2,3,3]

company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']

duration = [5,19,7,24,6,12,9,30,16]



df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})





df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

You get:

ids  company      

0    Other Company    24

1    Other Company    30

3    Other Company    30

Name: duration, dtype: int64

EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

You get:

duration

ids company                

0.0 Other Company      24.0

1.0 Other Company      30.0

3.0 Other Company      30.0

Nonetheless, the firs one is faster:

2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 24 '18 at 9:34

answered Nov 24 '18 at 9:20

2Obe

1,04021027

add a comment |

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:

import numpy as np

import pandas as pd





ids = [0,0,0,1,1,1,2,3,3]

company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']

duration = [5,19,7,24,6,12,9,30,16]



df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})





df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

You get:

ids  company      

0    Other Company    24

1    Other Company    30

3    Other Company    30

Name: duration, dtype: int64

EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

You get:

duration

ids company                

0.0 Other Company      24.0

1.0 Other Company      30.0

3.0 Other Company      30.0

Nonetheless, the firs one is faster:

2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 24 '18 at 9:34

answered Nov 24 '18 at 9:20

2Obe

1,04021027

add a comment |

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:

import numpy as np

import pandas as pd





ids = [0,0,0,1,1,1,2,3,3]

company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']

duration = [5,19,7,24,6,12,9,30,16]



df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})





df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

You get:

ids  company      

0    Other Company    24

1    Other Company    30

3    Other Company    30

Name: duration, dtype: int64

EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

You get:

duration

ids company                

0.0 Other Company      24.0

1.0 Other Company      30.0

3.0 Other Company      30.0

Nonetheless, the firs one is faster:

2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 24 '18 at 9:34

answered Nov 24 '18 at 9:20

2Obe

1,04021027

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:

import numpy as np

import pandas as pd





ids = [0,0,0,1,1,1,2,3,3]

company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']

duration = [5,19,7,24,6,12,9,30,16]



df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})





df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)

You get:

ids  company      

0    Other Company    24

1    Other Company    30

3    Other Company    30

Name: duration, dtype: int64

EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:

df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')

You get:

duration

ids company                

0.0 Other Company      24.0

1.0 Other Company      30.0

3.0 Other Company      30.0

Nonetheless, the firs one is faster:

2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Nov 24 '18 at 9:34

answered Nov 24 '18 at 9:20

2Obe

1,04021027

edited Nov 24 '18 at 9:34

answered Nov 24 '18 at 9:20

2Obe

1,04021027

answered Nov 24 '18 at 9:20

2Obe

1,04021027

answered Nov 24 '18 at 9:20

2Obe

1,04021027

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk