split into train and test by group+ sklearn cross_val_score
up vote
0
down vote
favorite
I have a dataframe in python as shown below:
data labels group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z
It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group
should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.
I find that cross_val_score
does the splitting, fitting model and predciting with the below function:
>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores
The documentation of cross_val_score
have groups
parameter which says:
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?
>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)
Any help is appreciated.
python scikit-learn
add a comment |
up vote
0
down vote
favorite
I have a dataframe in python as shown below:
data labels group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z
It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group
should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.
I find that cross_val_score
does the splitting, fitting model and predciting with the below function:
>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores
The documentation of cross_val_score
have groups
parameter which says:
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?
>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)
Any help is appreciated.
python scikit-learn
You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20
1
Are you looking for thestratify
parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26
Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a dataframe in python as shown below:
data labels group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z
It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group
should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.
I find that cross_val_score
does the splitting, fitting model and predciting with the below function:
>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores
The documentation of cross_val_score
have groups
parameter which says:
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?
>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)
Any help is appreciated.
python scikit-learn
I have a dataframe in python as shown below:
data labels group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z
It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group
should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.
I find that cross_val_score
does the splitting, fitting model and predciting with the below function:
>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores
The documentation of cross_val_score
have groups
parameter which says:
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?
>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)
Any help is appreciated.
python scikit-learn
python scikit-learn
edited Nov 11 at 18:42
asked Nov 7 at 19:07
chas
49031427
49031427
You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20
1
Are you looking for thestratify
parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26
Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27
add a comment |
You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20
1
Are you looking for thestratify
parameter in the train_test_split?
– G. Anderson
Nov 7 at 19:26
Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27
You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20
You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20
1
1
Are you looking for the
stratify
parameter in the train_test_split?– G. Anderson
Nov 7 at 19:26
Are you looking for the
stratify
parameter in the train_test_split?– G. Anderson
Nov 7 at 19:26
Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27
Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
The stratify
parameter of train_test_split
takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
OP updated withcross_val_score
function.
– chas
Nov 11 at 13:13
cross_val_score
also contains thegroups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15
add a comment |
up vote
0
down vote
There is no way that I know straight from the function, but you could apply
train_test_split
to the groups and then concatenate the splits with pd.concat
like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
The stratify
parameter of train_test_split
takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
OP updated withcross_val_score
function.
– chas
Nov 11 at 13:13
cross_val_score
also contains thegroups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15
add a comment |
up vote
1
down vote
The stratify
parameter of train_test_split
takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
OP updated withcross_val_score
function.
– chas
Nov 11 at 13:13
cross_val_score
also contains thegroups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15
add a comment |
up vote
1
down vote
up vote
1
down vote
The stratify
parameter of train_test_split
takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
The stratify
parameter of train_test_split
takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
answered Nov 7 at 19:40
G. Anderson
74029
74029
OP updated withcross_val_score
function.
– chas
Nov 11 at 13:13
cross_val_score
also contains thegroups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15
add a comment |
OP updated withcross_val_score
function.
– chas
Nov 11 at 13:13
cross_val_score
also contains thegroups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way
– G. Anderson
Nov 12 at 16:15
OP updated with
cross_val_score
function.– chas
Nov 11 at 13:13
OP updated with
cross_val_score
function.– chas
Nov 11 at 13:13
cross_val_score
also contains the groups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way– G. Anderson
Nov 12 at 16:15
cross_val_score
also contains the groups
parameter which, from the docs, you pass "Group labels for the samples used while splitting the dataset into train/test set." so that should also work the same way– G. Anderson
Nov 12 at 16:15
add a comment |
up vote
0
down vote
There is no way that I know straight from the function, but you could apply
train_test_split
to the groups and then concatenate the splits with pd.concat
like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
add a comment |
up vote
0
down vote
There is no way that I know straight from the function, but you could apply
train_test_split
to the groups and then concatenate the splits with pd.concat
like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
add a comment |
up vote
0
down vote
up vote
0
down vote
There is no way that I know straight from the function, but you could apply
train_test_split
to the groups and then concatenate the splits with pd.concat
like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
There is no way that I know straight from the function, but you could apply
train_test_split
to the groups and then concatenate the splits with pd.concat
like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
answered Nov 7 at 19:47
Franco Piccolo
1,325611
1,325611
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53196174%2fsplit-into-train-and-test-by-group-sklearn-cross-val-score%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You can use pandas to filter based on group, and then use the filtered data to split right?
– jujuBee
Nov 7 at 19:20
1
Are you looking for the
stratify
parameter in the train_test_split?– G. Anderson
Nov 7 at 19:26
Lets's say we split the data randomly into 70:30. But Splitting should consider that 70% of data from each group goes to training and 30% data from each group should go to test data set instead of making the training set (70%) that has values from only few groups. Does stratify does the same?
– chas
Nov 7 at 19:27