split into train and test by group+ sklearn cross_val_score











up vote
0
down vote

favorite












I have a dataframe in python as shown below:



data    labels    group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z


It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.



I find that cross_val_score does the splitting, fitting model and predciting with the below function:



>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores


The documentation of cross_val_score have groups parameter which says:



groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.


Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?



>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)


Any help is appreciated.










share|improve this question
























  • You can use pandas to filter based on group, and then use the filtered data to split right?
    – jujuBee
    Nov 7 at 19:20






  • 1




    Are you looking for the stratify parameter in the train_test_split?
    – G. Anderson
    Nov 7 at 19:26










  • Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
    – chas
    Nov 7 at 19:27

















up vote
0
down vote

favorite












I have a dataframe in python as shown below:



data    labels    group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z


It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.



I find that cross_val_score does the splitting, fitting model and predciting with the below function:



>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores


The documentation of cross_val_score have groups parameter which says:



groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.


Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?



>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)


Any help is appreciated.










share|improve this question
























  • You can use pandas to filter based on group, and then use the filtered data to split right?
    – jujuBee
    Nov 7 at 19:20






  • 1




    Are you looking for the stratify parameter in the train_test_split?
    – G. Anderson
    Nov 7 at 19:26










  • Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
    – chas
    Nov 7 at 19:27















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a dataframe in python as shown below:



data    labels    group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z


It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.



I find that cross_val_score does the splitting, fitting model and predciting with the below function:



>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores


The documentation of cross_val_score have groups parameter which says:



groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.


Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?



>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)


Any help is appreciated.










share|improve this question















I have a dataframe in python as shown below:



data    labels    group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z


It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.



I find that cross_val_score does the splitting, fitting model and predciting with the below function:



>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores


The documentation of cross_val_score have groups parameter which says:



groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.


Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?



>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)


Any help is appreciated.







python scikit-learn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 18:42

























asked Nov 7 at 19:07









chas

49031427




49031427












  • You can use pandas to filter based on group, and then use the filtered data to split right?
    – jujuBee
    Nov 7 at 19:20






  • 1




    Are you looking for the stratify parameter in the train_test_split?
    – G. Anderson
    Nov 7 at 19:26










  • Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
    – chas
    Nov 7 at 19:27




















  • You can use pandas to filter based on group, and then use the filtered data to split right?
    – jujuBee
    Nov 7 at 19:20






  • 1




    Are you looking for the stratify parameter in the train_test_split?
    – G. Anderson
    Nov 7 at 19:26










  • Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
    – chas
    Nov 7 at 19:27


















You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20




You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20




1




1




Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26




Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26












Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27






Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27














2 Answers
2






active

oldest

votes

















up vote
1
down vote













The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.



X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])


On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets






share|improve this answer





















  • OP updated with cross_val_score function.
    – chas
    Nov 11 at 13:13










  • cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
    – G. Anderson
    Nov 12 at 16:15


















up vote
0
down vote













There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:



def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])

final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()

1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53196174%2fsplit-into-train-and-test-by-group-sklearn-cross-val-score%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.



    X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])


    On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets






    share|improve this answer





















    • OP updated with cross_val_score function.
      – chas
      Nov 11 at 13:13










    • cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
      – G. Anderson
      Nov 12 at 16:15















    up vote
    1
    down vote













    The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.



    X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])


    On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets






    share|improve this answer





















    • OP updated with cross_val_score function.
      – chas
      Nov 11 at 13:13










    • cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
      – G. Anderson
      Nov 12 at 16:15













    up vote
    1
    down vote










    up vote
    1
    down vote









    The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.



    X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])


    On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets






    share|improve this answer












    The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.



    X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])


    On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 7 at 19:40









    G. Anderson

    74029




    74029












    • OP updated with cross_val_score function.
      – chas
      Nov 11 at 13:13










    • cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
      – G. Anderson
      Nov 12 at 16:15


















    • OP updated with cross_val_score function.
      – chas
      Nov 11 at 13:13










    • cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
      – G. Anderson
      Nov 12 at 16:15
















    OP updated with cross_val_score function.
    – chas
    Nov 11 at 13:13




    OP updated with cross_val_score function.
    – chas
    Nov 11 at 13:13












    cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
    – G. Anderson
    Nov 12 at 16:15




    cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
    – G. Anderson
    Nov 12 at 16:15












    up vote
    0
    down vote













    There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:



    def train_test_split_group(x):
    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])

    final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
    final['X_train'].dropna()

    1 bb
    3 dd
    4 ee
    5 ff
    6 gg
    7 hh
    Name: X_train, dtype: object





    share|improve this answer

























      up vote
      0
      down vote













      There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:



      def train_test_split_group(x):
      X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
      return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])

      final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
      final['X_train'].dropna()

      1 bb
      3 dd
      4 ee
      5 ff
      6 gg
      7 hh
      Name: X_train, dtype: object





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:



        def train_test_split_group(x):
        X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
        return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])

        final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
        final['X_train'].dropna()

        1 bb
        3 dd
        4 ee
        5 ff
        6 gg
        7 hh
        Name: X_train, dtype: object





        share|improve this answer












        There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:



        def train_test_split_group(x):
        X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
        return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])

        final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
        final['X_train'].dropna()

        1 bb
        3 dd
        4 ee
        5 ff
        6 gg
        7 hh
        Name: X_train, dtype: object






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 7 at 19:47









        Franco Piccolo

        1,325611




        1,325611






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53196174%2fsplit-into-train-and-test-by-group-sklearn-cross-val-score%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Xamarin.form Move up view when keyboard appear

            Post-Redirect-Get with Spring WebFlux and Thymeleaf

            Anylogic : not able to use stopDelay()