Fitting sklearn GridSearchCV model
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
python machine-learning scikit-learn random-forest grid-search
add a comment |
I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
python machine-learning scikit-learn random-forest grid-search
This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here).
– desertnaut
Nov 23 '18 at 16:22
add a comment |
I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
python machine-learning scikit-learn random-forest grid-search
I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
python machine-learning scikit-learn random-forest grid-search
python machine-learning scikit-learn random-forest grid-search
edited Nov 23 '18 at 15:41
Rookie_123
asked Nov 23 '18 at 15:28
Rookie_123Rookie_123
447213
447213
This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here).
– desertnaut
Nov 23 '18 at 16:22
add a comment |
This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here).
– desertnaut
Nov 23 '18 at 16:22
This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here).
– desertnaut
Nov 23 '18 at 16:22
This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here).
– desertnaut
Nov 23 '18 at 16:22
add a comment |
2 Answers
2
active
oldest
votes
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
1
But then the hyperparameters obtained will be biased with the samples present in thatX_trainis what I feel
– Rookie_123
Nov 23 '18 at 15:36
2
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
add a comment |
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
This is clarification is with respect to your answer for Problem 1: When you sayGridsearchCVdoes cross validation, its cross validation will be still be limited toX_trainandy_train, correct me if I am wrong
– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained byGridsearchCV?
– Rookie_123
Nov 24 '18 at 7:34
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449337%2ffitting-sklearn-gridsearchcv-model%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
1
But then the hyperparameters obtained will be biased with the samples present in thatX_trainis what I feel
– Rookie_123
Nov 23 '18 at 15:36
2
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
add a comment |
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
1
But then the hyperparameters obtained will be biased with the samples present in thatX_trainis what I feel
– Rookie_123
Nov 23 '18 at 15:36
2
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
add a comment |
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
edited Nov 23 '18 at 15:36
answered Nov 23 '18 at 15:31
FMarazziFMarazzi
328213
328213
1
But then the hyperparameters obtained will be biased with the samples present in thatX_trainis what I feel
– Rookie_123
Nov 23 '18 at 15:36
2
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
add a comment |
1
But then the hyperparameters obtained will be biased with the samples present in thatX_trainis what I feel
– Rookie_123
Nov 23 '18 at 15:36
2
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
1
1
But then the hyperparameters obtained will be biased with the samples present in that
X_train is what I feel– Rookie_123
Nov 23 '18 at 15:36
But then the hyperparameters obtained will be biased with the samples present in that
X_train is what I feel– Rookie_123
Nov 23 '18 at 15:36
2
2
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
Rookie_123, you have a valid concern but any model/hyperparams will inherently be biased to the train set. If it were biased on a test set then you technically can't speak of a test set to begin with.
– cantdutchthis
Nov 23 '18 at 15:45
add a comment |
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
This is clarification is with respect to your answer for Problem 1: When you sayGridsearchCVdoes cross validation, its cross validation will be still be limited toX_trainandy_train, correct me if I am wrong
– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained byGridsearchCV?
– Rookie_123
Nov 24 '18 at 7:34
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
add a comment |
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
This is clarification is with respect to your answer for Problem 1: When you sayGridsearchCVdoes cross validation, its cross validation will be still be limited toX_trainandy_train, correct me if I am wrong
– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained byGridsearchCV?
– Rookie_123
Nov 24 '18 at 7:34
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
add a comment |
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
answered Nov 23 '18 at 16:01
Matthieu BrucherMatthieu Brucher
17.5k42345
17.5k42345
This is clarification is with respect to your answer for Problem 1: When you sayGridsearchCVdoes cross validation, its cross validation will be still be limited toX_trainandy_train, correct me if I am wrong
– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained byGridsearchCV?
– Rookie_123
Nov 24 '18 at 7:34
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
add a comment |
This is clarification is with respect to your answer for Problem 1: When you sayGridsearchCVdoes cross validation, its cross validation will be still be limited toX_trainandy_train, correct me if I am wrong
– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained byGridsearchCV?
– Rookie_123
Nov 24 '18 at 7:34
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
This is clarification is with respect to your answer for Problem 1: When you say
GridsearchCV does cross validation, its cross validation will be still be limited to X_train and y_train, correct me if I am wrong– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 1: When you say
GridsearchCV does cross validation, its cross validation will be still be limited to X_train and y_train, correct me if I am wrong– Rookie_123
Nov 24 '18 at 7:30
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained by
GridsearchCV ?– Rookie_123
Nov 24 '18 at 7:34
This is clarification is with respect to your answer for Problem 3: So no need to validate the model created on entire dataset with the best parameters obtained by
GridsearchCV ?– Rookie_123
Nov 24 '18 at 7:34
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
Of course, CV will be done on the train dataset. Then you can validate the CV (best estimator) on the test dataset.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
And indeed, you can use the whole dataset for the final training, as indicated on the datascience question.
– Matthieu Brucher
Nov 24 '18 at 9:24
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53449337%2ffitting-sklearn-gridsearchcv-model%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here).
– desertnaut
Nov 23 '18 at 16:22