Pandas - How to groupby and remove specifc rows





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







2















I have a DF like this:



id     company     duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:



import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)


And got this:



id     company     duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'



Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'



Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..



Can someone shed some light in the issue for me? Thanks in advance.










share|improve this question























  • time_in_company [time_in_company ['company']!="Other Company"]

    – Ken Dekalb
    Nov 23 '18 at 19:26













  • Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

    – Anyone
    Nov 23 '18 at 19:35











  • Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

    – slackline
    Nov 23 '18 at 20:16


















2















I have a DF like this:



id     company     duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:



import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)


And got this:



id     company     duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'



Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'



Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..



Can someone shed some light in the issue for me? Thanks in advance.










share|improve this question























  • time_in_company [time_in_company ['company']!="Other Company"]

    – Ken Dekalb
    Nov 23 '18 at 19:26













  • Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

    – Anyone
    Nov 23 '18 at 19:35











  • Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

    – slackline
    Nov 23 '18 at 20:16














2












2








2








I have a DF like this:



id     company     duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:



import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)


And got this:



id     company     duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'



Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'



Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..



Can someone shed some light in the issue for me? Thanks in advance.










share|improve this question














I have a DF like this:



id     company     duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:



import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)


And got this:



id     company     duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16


Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'



Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'



Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..



Can someone shed some light in the issue for me? Thanks in advance.







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 23 '18 at 19:23









AnyoneAnyone

225




225













  • time_in_company [time_in_company ['company']!="Other Company"]

    – Ken Dekalb
    Nov 23 '18 at 19:26













  • Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

    – Anyone
    Nov 23 '18 at 19:35











  • Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

    – slackline
    Nov 23 '18 at 20:16



















  • time_in_company [time_in_company ['company']!="Other Company"]

    – Ken Dekalb
    Nov 23 '18 at 19:26













  • Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

    – Anyone
    Nov 23 '18 at 19:35











  • Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

    – slackline
    Nov 23 '18 at 20:16

















time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26







time_in_company [time_in_company ['company']!="Other Company"]

– Ken Dekalb
Nov 23 '18 at 19:26















Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35





Got a pretty big traceback, but basically this: KeyError: 'company' AND this: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred:

– Anyone
Nov 23 '18 at 19:35













Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16





Your company variable has, in the background during the aggregation, been converted to a categorical variable and encoded, rather than remaining as a string, hence why the comparison to the string "Other Company" has failed.

– slackline
Nov 23 '18 at 20:16












2 Answers
2






active

oldest

votes


















1














Does this help?



time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]





share|improve this answer
























  • Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

    – Anyone
    Nov 23 '18 at 20:03











  • Also, do you mind explaining the first line?

    – Anyone
    Nov 23 '18 at 20:04



















0














First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:



import numpy as np
import pandas as pd


ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]

df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)


You get:



ids  company      
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64


EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:



df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')


You get:



duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0


Nonetheless, the firs one is faster:

2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






share|improve this answer


























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452008%2fpandas-how-to-groupby-and-remove-specifc-rows%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Does this help?



    time_in_company= time_in_company.reset_index(level='company')
    time_in_company [time_in_company ['company']!="Other Company"]





    share|improve this answer
























    • Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

      – Anyone
      Nov 23 '18 at 20:03











    • Also, do you mind explaining the first line?

      – Anyone
      Nov 23 '18 at 20:04
















    1














    Does this help?



    time_in_company= time_in_company.reset_index(level='company')
    time_in_company [time_in_company ['company']!="Other Company"]





    share|improve this answer
























    • Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

      – Anyone
      Nov 23 '18 at 20:03











    • Also, do you mind explaining the first line?

      – Anyone
      Nov 23 '18 at 20:04














    1












    1








    1







    Does this help?



    time_in_company= time_in_company.reset_index(level='company')
    time_in_company [time_in_company ['company']!="Other Company"]





    share|improve this answer













    Does this help?



    time_in_company= time_in_company.reset_index(level='company')
    time_in_company [time_in_company ['company']!="Other Company"]






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 23 '18 at 19:46









    Ken DekalbKen Dekalb

    317112




    317112













    • Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

      – Anyone
      Nov 23 '18 at 20:03











    • Also, do you mind explaining the first line?

      – Anyone
      Nov 23 '18 at 20:04



















    • Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

      – Anyone
      Nov 23 '18 at 20:03











    • Also, do you mind explaining the first line?

      – Anyone
      Nov 23 '18 at 20:04

















    Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

    – Anyone
    Nov 23 '18 at 20:03





    Didn't raise any error, but also didn't change anything in the df. I adjusted to time_in_company = time_in_company [time_in_company ['company']!="Other Company" - And it worked. Thank you.

    – Anyone
    Nov 23 '18 at 20:03













    Also, do you mind explaining the first line?

    – Anyone
    Nov 23 '18 at 20:04





    Also, do you mind explaining the first line?

    – Anyone
    Nov 23 '18 at 20:04













    0














    First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:



    import numpy as np
    import pandas as pd


    ids = [0,0,0,1,1,1,2,3,3]
    company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
    duration = [5,19,7,24,6,12,9,30,16]

    df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


    df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)


    You get:



    ids  company      
    0 Other Company 24
    1 Other Company 30
    3 Other Company 30
    Name: duration, dtype: int64


    EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:



    df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')


    You get:



    duration
    ids company
    0.0 Other Company 24.0
    1.0 Other Company 30.0
    3.0 Other Company 30.0


    Nonetheless, the firs one is faster:

    2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

    5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






    share|improve this answer






























      0














      First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:



      import numpy as np
      import pandas as pd


      ids = [0,0,0,1,1,1,2,3,3]
      company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
      duration = [5,19,7,24,6,12,9,30,16]

      df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


      df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)


      You get:



      ids  company      
      0 Other Company 24
      1 Other Company 30
      3 Other Company 30
      Name: duration, dtype: int64


      EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:



      df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')


      You get:



      duration
      ids company
      0.0 Other Company 24.0
      1.0 Other Company 30.0
      3.0 Other Company 30.0


      Nonetheless, the firs one is faster:

      2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

      5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






      share|improve this answer




























        0












        0








        0







        First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:



        import numpy as np
        import pandas as pd


        ids = [0,0,0,1,1,1,2,3,3]
        company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
        duration = [5,19,7,24,6,12,9,30,16]

        df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


        df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)


        You get:



        ids  company      
        0 Other Company 24
        1 Other Company 30
        3 Other Company 30
        Name: duration, dtype: int64


        EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:



        df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')


        You get:



        duration
        ids company
        0.0 Other Company 24.0
        1.0 Other Company 30.0
        3.0 Other Company 30.0


        Nonetheless, the firs one is faster:

        2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

        5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)






        share|improve this answer















        First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:



        import numpy as np
        import pandas as pd


        ids = [0,0,0,1,1,1,2,3,3]
        company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
        duration = [5,19,7,24,6,12,9,30,16]

        df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})


        df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)


        You get:



        ids  company      
        0 Other Company 24
        1 Other Company 30
        3 Other Company 30
        Name: duration, dtype: int64


        EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:



        df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')


        You get:



        duration
        ids company
        0.0 Other Company 24.0
        1.0 Other Company 30.0
        3.0 Other Company 30.0


        Nonetheless, the firs one is faster:

        2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

        5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 24 '18 at 9:34

























        answered Nov 24 '18 at 9:20









        2Obe2Obe

        1,04021027




        1,04021027






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53452008%2fpandas-how-to-groupby-and-remove-specifc-rows%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Tangent Lines Diagram Along Smooth Curve

            Yusuf al-Mu'taman ibn Hud

            Zucchini