Pandas - How to groupby and remove specifc rows
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.
python pandas
add a comment |
I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.
python pandas
time_in_company [time_in_company ['company']!="Other Company"]
– Ken Dekalb
Nov 23 '18 at 19:26
Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:
– Anyone
Nov 23 '18 at 19:35
Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string"Other Company"
has failed.
– slackline
Nov 23 '18 at 20:16
add a comment |
I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.
python pandas
I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.
python pandas
python pandas
asked Nov 23 '18 at 19:23
AnyoneAnyone
225
225
time_in_company [time_in_company ['company']!="Other Company"]
– Ken Dekalb
Nov 23 '18 at 19:26
Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:
– Anyone
Nov 23 '18 at 19:35
Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string"Other Company"
has failed.
– slackline
Nov 23 '18 at 20:16
add a comment |
time_in_company [time_in_company ['company']!="Other Company"]
– Ken Dekalb
Nov 23 '18 at 19:26
Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:
– Anyone
Nov 23 '18 at 19:35
Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string"Other Company"
has failed.
– slackline
Nov 23 '18 at 20:16
time_in_company [time_in_company ['company']!="Other Company"]
– Ken Dekalb
Nov 23 '18 at 19:26
time_in_company [time_in_company ['company']!="Other Company"]
– Ken Dekalb
Nov 23 '18 at 19:26
Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:
– Anyone
Nov 23 '18 at 19:35
Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:
– Anyone
Nov 23 '18 at 19:35
Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string
"Other Company"
has failed.– slackline
Nov 23 '18 at 20:16
Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string
"Other Company"
has failed.– slackline
Nov 23 '18 at 20:16
add a comment |
2 Answers
2
active
oldest
votes
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
add a comment |
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452008%2fpandas-how-to-groupby-and-remove-specifc-rows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
add a comment |
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
add a comment |
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
answered Nov 23 '18 at 19:46
Ken DekalbKen Dekalb
317112
317112
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
add a comment |
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.
– Anyone
Nov 23 '18 at 20:03
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
Also, do you mind explaining the first line?
– Anyone
Nov 23 '18 at 20:04
add a comment |
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
edited Nov 24 '18 at 9:34
answered Nov 24 '18 at 9:20
2Obe2Obe
1,04021027
1,04021027
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452008%2fpandas-how-to-groupby-and-remove-specifc-rows%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
time_in_company [time_in_company ['company']!="Other Company"]
– Ken Dekalb
Nov 23 '18 at 19:26
Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:
– Anyone
Nov 23 '18 at 19:35
Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string
"Other Company"
has failed.– slackline
Nov 23 '18 at 20:16