split into train and test by group+ sklearn cross_val

split into train and test by group+ sklearn cross_val_score

up vote
0
down vote

favorite

I have a dataframe in python as shown below:

data    labels    group

 aa       1         x

 bb       1         x

 cc       2         y

 dd       1         y

 ee       3         y

 ff       3         x

 gg       3         z

 hh       1         z

 ii       2         z

It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.

I find that cross_val_score does the splitting, fitting model and predciting with the below function:

>>> from sklearn.model_selection import cross_val_score

>>> model = LogisticRegression(random_state=0)

>>> scores = cross_val_score(model, data, labels, cv=5)

>>> scores

The documentation of cross_val_score have groups parameter which says:

groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into 

train/test set.

Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?

>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)

Any help is appreciated.

edited Nov 11 at 18:42

asked Nov 7 at 19:07

chas

49031427

You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20

1

Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26

Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27

add a comment |

up vote
0
down vote

favorite

I have a dataframe in python as shown below:

data    labels    group

 aa       1         x

 bb       1         x

 cc       2         y

 dd       1         y

 ee       3         y

 ff       3         x

 gg       3         z

 hh       1         z

 ii       2         z

I find that cross_val_score does the splitting, fitting model and predciting with the below function:

>>> from sklearn.model_selection import cross_val_score

>>> model = LogisticRegression(random_state=0)

>>> scores = cross_val_score(model, data, labels, cv=5)

>>> scores

The documentation of cross_val_score have groups parameter which says:

groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into 

train/test set.

>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)

Any help is appreciated.

edited Nov 11 at 18:42

asked Nov 7 at 19:07

chas

49031427

You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20

1

Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26

Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27

add a comment |

up vote
0
down vote

favorite

I have a dataframe in python as shown below:

data    labels    group

 aa       1         x

 bb       1         x

 cc       2         y

 dd       1         y

 ee       3         y

 ff       3         x

 gg       3         z

 hh       1         z

 ii       2         z

I find that cross_val_score does the splitting, fitting model and predciting with the below function:

>>> from sklearn.model_selection import cross_val_score

>>> model = LogisticRegression(random_state=0)

>>> scores = cross_val_score(model, data, labels, cv=5)

>>> scores

The documentation of cross_val_score have groups parameter which says:

groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into 

train/test set.

>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)

Any help is appreciated.

edited Nov 11 at 18:42

asked Nov 7 at 19:07

chas

49031427

I have a dataframe in python as shown below:

data    labels    group

 aa       1         x

 bb       1         x

 cc       2         y

 dd       1         y

 ee       3         y

 ff       3         x

 gg       3         z

 hh       1         z

 ii       2         z

I find that cross_val_score does the splitting, fitting model and predciting with the below function:

>>> from sklearn.model_selection import cross_val_score

>>> model = LogisticRegression(random_state=0)

>>> scores = cross_val_score(model, data, labels, cv=5)

>>> scores

The documentation of cross_val_score have groups parameter which says:

groups : array-like, with shape (n_samples,), optional

Group labels for the samples used while splitting the dataset into 

train/test set.

>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)

Any help is appreciated.

python scikit-learn

edited Nov 11 at 18:42

asked Nov 7 at 19:07

chas

49031427

edited Nov 11 at 18:42

asked Nov 7 at 19:07

chas

49031427

edited Nov 11 at 18:42

asked Nov 7 at 19:07

chas

49031427

asked Nov 7 at 19:07

chas

49031427

asked Nov 7 at 19:07

chas

49031427

You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20

1

Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26

Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27

add a comment |

You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20

1

Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26

Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27

You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20

Are you looking for the stratify parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26

Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])

On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets

answered Nov 7 at 19:40

G. Anderson

74029

OP updated with cross_val_score function.
– chas
Nov 11 at 13:13

cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15

add a comment |

up vote
0
down vote

There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:

def train_test_split_group(x):

    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])

    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])



final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))

final['X_train'].dropna()



1    bb

3    dd

4    ee

5    ff

6    gg

7    hh

Name: X_train, dtype: object

answered Nov 7 at 19:47

Franco Piccolo

1,325611

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53196174%2fsplit-into-train-and-test-by-group-sklearn-cross-val-score%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])

On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets

answered Nov 7 at 19:40

G. Anderson

74029

OP updated with cross_val_score function.
– chas
Nov 11 at 13:13

cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15

add a comment |

up vote
1
down vote

The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])

On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets

answered Nov 7 at 19:40

G. Anderson

74029

OP updated with cross_val_score function.
– chas
Nov 11 at 13:13

cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15

add a comment |

up vote
1
down vote

The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])

On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets

answered Nov 7 at 19:40

G. Anderson

74029

The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])

On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets

answered Nov 7 at 19:40

G. Anderson

74029

answered Nov 7 at 19:40

G. Anderson

74029

answered Nov 7 at 19:40

G. Anderson

74029

answered Nov 7 at 19:40

G. Anderson

74029

OP updated with cross_val_score function.
– chas
Nov 11 at 13:13

cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15

add a comment |

OP updated with cross_val_score function.
– chas
Nov 11 at 13:13

cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15

OP updated with cross_val_score function.
– chas
Nov 11 at 13:13

cross_val_score also contains the groups parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15

add a comment |

up vote
0
down vote

There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:

def train_test_split_group(x):

    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])

    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])



final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))

final['X_train'].dropna()



1    bb

3    dd

4    ee

5    ff

6    gg

7    hh

Name: X_train, dtype: object

answered Nov 7 at 19:47

Franco Piccolo

1,325611

add a comment |

up vote
0
down vote

There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:

def train_test_split_group(x):

    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])

    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])



final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))

final['X_train'].dropna()



1    bb

3    dd

4    ee

5    ff

6    gg

7    hh

Name: X_train, dtype: object

answered Nov 7 at 19:47

Franco Piccolo

1,325611

add a comment |

up vote
0
down vote

There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:

def train_test_split_group(x):

    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])

    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])



final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))

final['X_train'].dropna()



1    bb

3    dd

4    ee

5    ff

6    gg

7    hh

Name: X_train, dtype: object

answered Nov 7 at 19:47

Franco Piccolo

1,325611

There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:

def train_test_split_group(x):

    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])

    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])



final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))

final['X_train'].dropna()



1    bb

3    dd

4    ee

5    ff

6    gg

7    hh

Name: X_train, dtype: object

answered Nov 7 at 19:47

Franco Piccolo

1,325611

answered Nov 7 at 19:47

Franco Piccolo

1,325611

answered Nov 7 at 19:47

Franco Piccolo

1,325611

answered Nov 7 at 19:47

Franco Piccolo

1,325611

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

LEqf12NqMNafd 5Ua9Cq N6UmZCbLSkFcAS4,i98n,CFURcfVf,Dx78eRY maPronTQx2 2JOoVaSbuCDbbncOKRDnT2KhRW1,Zr0

搜尋此網誌

Wsrtjtyk