Confusion matrix for training and validation sets

What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?

pred <- predict(tree1, type = "class")

confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG)) 



pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)

confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))

edited Nov 15 '18 at 7:50

RLave

4,0051922

asked Nov 15 '18 at 5:09

FIC

163

2

It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13

based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22

add a comment |

What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?

pred <- predict(tree1, type = "class")

confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG)) 



pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)

confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))

edited Nov 15 '18 at 7:50

RLave

4,0051922

asked Nov 15 '18 at 5:09

FIC

163

2

It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13

based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22

add a comment |

What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?

pred <- predict(tree1, type = "class")

confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG)) 



pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)

confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))

edited Nov 15 '18 at 7:50

RLave

4,0051922

asked Nov 15 '18 at 5:09

FIC

163

What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?

pred <- predict(tree1, type = "class")

confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG)) 



pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)

confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))

r confusion-matrix

edited Nov 15 '18 at 7:50

RLave

4,0051922

asked Nov 15 '18 at 5:09

FIC

163

edited Nov 15 '18 at 7:50

RLave

4,0051922

asked Nov 15 '18 at 5:09

FIC

163

edited Nov 15 '18 at 7:50

RLave

4,0051922

edited Nov 15 '18 at 7:50

RLave

4,0051922

edited Nov 15 '18 at 7:50

RLave

4,0051922

asked Nov 15 '18 at 5:09

FIC

163

asked Nov 15 '18 at 5:09

FIC

163

asked Nov 15 '18 at 5:09

FIC

163

2

It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13

based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22

add a comment |

2

It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13

based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22

It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13

based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22

add a comment |

1 Answer
1

active

oldest

votes

In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.

This is useful because you can train your algorithm in the first set, and test it on the second.

This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.

You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.

The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.

In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.

In the second case you should specify newdata=test_set, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.

I'll build here an example for you to see a classic approach:

data <- iris # iris dataset



# first split the data

set.seed(123) # for reproducibility

pos <- sample(100)



train <- data[pos, ] # random pick of 100 obs

test <- data[-pos, ] # remaining 50



# now you can start with your model - please not that this is a dummy example

library(rpart)



tree <- rpart(Species ~ ., data=train) # fit tree on train data



# make prediction on train data (no need to specify newclass= ) # NOT very useful

pred <- predict(tree, type = "class")

caret::confusionMatrix(pred, train$Species)



# make prediction on test data (remove the response)

pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)

# build confusion from predictions against the truth (ie the test$Species)

caret::confusionMatrix(pred, test$Species)

Note how the performance is awful on the test data, while it was almost perfect on train data.

edited Nov 15 '18 at 8:18

answered Nov 15 '18 at 8:10

RLave

4,0051922

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312799%2fconfusion-matrix-for-training-and-validation-sets%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.

This is useful because you can train your algorithm in the first set, and test it on the second.

This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.

You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.

The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.

In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.

I'll build here an example for you to see a classic approach:

data <- iris # iris dataset



# first split the data

set.seed(123) # for reproducibility

pos <- sample(100)



train <- data[pos, ] # random pick of 100 obs

test <- data[-pos, ] # remaining 50



# now you can start with your model - please not that this is a dummy example

library(rpart)



tree <- rpart(Species ~ ., data=train) # fit tree on train data



# make prediction on train data (no need to specify newclass= ) # NOT very useful

pred <- predict(tree, type = "class")

caret::confusionMatrix(pred, train$Species)



# make prediction on test data (remove the response)

pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)

# build confusion from predictions against the truth (ie the test$Species)

caret::confusionMatrix(pred, test$Species)

Note how the performance is awful on the test data, while it was almost perfect on train data.

edited Nov 15 '18 at 8:18

answered Nov 15 '18 at 8:10

RLave

4,0051922

add a comment |

In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.

This is useful because you can train your algorithm in the first set, and test it on the second.

This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.

You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.

The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.

In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.

I'll build here an example for you to see a classic approach:

data <- iris # iris dataset



# first split the data

set.seed(123) # for reproducibility

pos <- sample(100)



train <- data[pos, ] # random pick of 100 obs

test <- data[-pos, ] # remaining 50



# now you can start with your model - please not that this is a dummy example

library(rpart)



tree <- rpart(Species ~ ., data=train) # fit tree on train data



# make prediction on train data (no need to specify newclass= ) # NOT very useful

pred <- predict(tree, type = "class")

caret::confusionMatrix(pred, train$Species)



# make prediction on test data (remove the response)

pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)

# build confusion from predictions against the truth (ie the test$Species)

caret::confusionMatrix(pred, test$Species)

Note how the performance is awful on the test data, while it was almost perfect on train data.

edited Nov 15 '18 at 8:18

answered Nov 15 '18 at 8:10

RLave

4,0051922

add a comment |

In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.

This is useful because you can train your algorithm in the first set, and test it on the second.

This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.

You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.

The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.

In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.

I'll build here an example for you to see a classic approach:

data <- iris # iris dataset



# first split the data

set.seed(123) # for reproducibility

pos <- sample(100)



train <- data[pos, ] # random pick of 100 obs

test <- data[-pos, ] # remaining 50



# now you can start with your model - please not that this is a dummy example

library(rpart)



tree <- rpart(Species ~ ., data=train) # fit tree on train data



# make prediction on train data (no need to specify newclass= ) # NOT very useful

pred <- predict(tree, type = "class")

caret::confusionMatrix(pred, train$Species)



# make prediction on test data (remove the response)

pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)

# build confusion from predictions against the truth (ie the test$Species)

caret::confusionMatrix(pred, test$Species)

Note how the performance is awful on the test data, while it was almost perfect on train data.

edited Nov 15 '18 at 8:18

answered Nov 15 '18 at 8:10

RLave

4,0051922

In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.

This is useful because you can train your algorithm in the first set, and test it on the second.

This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.

You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.

The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.

In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.

I'll build here an example for you to see a classic approach:

data <- iris # iris dataset



# first split the data

set.seed(123) # for reproducibility

pos <- sample(100)



train <- data[pos, ] # random pick of 100 obs

test <- data[-pos, ] # remaining 50



# now you can start with your model - please not that this is a dummy example

library(rpart)



tree <- rpart(Species ~ ., data=train) # fit tree on train data



# make prediction on train data (no need to specify newclass= ) # NOT very useful

pred <- predict(tree, type = "class")

caret::confusionMatrix(pred, train$Species)



# make prediction on test data (remove the response)

pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)

# build confusion from predictions against the truth (ie the test$Species)

caret::confusionMatrix(pred, test$Species)

Note how the performance is awful on the test data, while it was almost perfect on train data.

edited Nov 15 '18 at 8:18

answered Nov 15 '18 at 8:10

RLave

4,0051922

edited Nov 15 '18 at 8:18

answered Nov 15 '18 at 8:10

RLave

4,0051922

answered Nov 15 '18 at 8:10

RLave

4,0051922

answered Nov 15 '18 at 8:10

RLave

4,0051922

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk