Confusion matrix for training and validation sets












0















What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?



pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))

pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))









share|improve this question




















  • 2





    It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

    – mickey
    Nov 15 '18 at 5:13











  • based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

    – sai saran
    Nov 15 '18 at 5:22
















0















What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?



pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))

pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))









share|improve this question




















  • 2





    It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

    – mickey
    Nov 15 '18 at 5:13











  • based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

    – sai saran
    Nov 15 '18 at 5:22














0












0








0








What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?



pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))

pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))









share|improve this question
















What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t in the first case?



pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))

pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))






r confusion-matrix






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 7:50









RLave

4,0051922




4,0051922










asked Nov 15 '18 at 5:09









FICFIC

163




163








  • 2





    It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

    – mickey
    Nov 15 '18 at 5:13











  • based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

    – sai saran
    Nov 15 '18 at 5:22














  • 2





    It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

    – mickey
    Nov 15 '18 at 5:13











  • based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

    – sai saran
    Nov 15 '18 at 5:22








2




2





It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13





It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.

– mickey
Nov 15 '18 at 5:13













based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22





based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,

– sai saran
Nov 15 '18 at 5:22












1 Answer
1






active

oldest

votes


















1














In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.



This is useful because you can train your algorithm in the first set, and test it on the second.



This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.



You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.



The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.



In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.



In the second case you should specify newdata=test_set, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.



I'll build here an example for you to see a classic approach:



data <- iris # iris dataset

# first split the data
set.seed(123) # for reproducibility
pos <- sample(100)

train <- data[pos, ] # random pick of 100 obs
test <- data[-pos, ] # remaining 50

# now you can start with your model - please not that this is a dummy example
library(rpart)

tree <- rpart(Species ~ ., data=train) # fit tree on train data

# make prediction on train data (no need to specify newclass= ) # NOT very useful
pred <- predict(tree, type = "class")
caret::confusionMatrix(pred, train$Species)

# make prediction on test data (remove the response)
pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
# build confusion from predictions against the truth (ie the test$Species)
caret::confusionMatrix(pred, test$Species)


Note how the performance is awful on the test data, while it was almost perfect on train data.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312799%2fconfusion-matrix-for-training-and-validation-sets%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.



    This is useful because you can train your algorithm in the first set, and test it on the second.



    This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.



    You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.



    The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.



    In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.



    In the second case you should specify newdata=test_set, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.



    I'll build here an example for you to see a classic approach:



    data <- iris # iris dataset

    # first split the data
    set.seed(123) # for reproducibility
    pos <- sample(100)

    train <- data[pos, ] # random pick of 100 obs
    test <- data[-pos, ] # remaining 50

    # now you can start with your model - please not that this is a dummy example
    library(rpart)

    tree <- rpart(Species ~ ., data=train) # fit tree on train data

    # make prediction on train data (no need to specify newclass= ) # NOT very useful
    pred <- predict(tree, type = "class")
    caret::confusionMatrix(pred, train$Species)

    # make prediction on test data (remove the response)
    pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
    # build confusion from predictions against the truth (ie the test$Species)
    caret::confusionMatrix(pred, test$Species)


    Note how the performance is awful on the test data, while it was almost perfect on train data.






    share|improve this answer






























      1














      In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.



      This is useful because you can train your algorithm in the first set, and test it on the second.



      This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.



      You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.



      The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.



      In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.



      In the second case you should specify newdata=test_set, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.



      I'll build here an example for you to see a classic approach:



      data <- iris # iris dataset

      # first split the data
      set.seed(123) # for reproducibility
      pos <- sample(100)

      train <- data[pos, ] # random pick of 100 obs
      test <- data[-pos, ] # remaining 50

      # now you can start with your model - please not that this is a dummy example
      library(rpart)

      tree <- rpart(Species ~ ., data=train) # fit tree on train data

      # make prediction on train data (no need to specify newclass= ) # NOT very useful
      pred <- predict(tree, type = "class")
      caret::confusionMatrix(pred, train$Species)

      # make prediction on test data (remove the response)
      pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
      # build confusion from predictions against the truth (ie the test$Species)
      caret::confusionMatrix(pred, test$Species)


      Note how the performance is awful on the test data, while it was almost perfect on train data.






      share|improve this answer




























        1












        1








        1







        In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.



        This is useful because you can train your algorithm in the first set, and test it on the second.



        This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.



        You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.



        The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.



        In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.



        In the second case you should specify newdata=test_set, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.



        I'll build here an example for you to see a classic approach:



        data <- iris # iris dataset

        # first split the data
        set.seed(123) # for reproducibility
        pos <- sample(100)

        train <- data[pos, ] # random pick of 100 obs
        test <- data[-pos, ] # remaining 50

        # now you can start with your model - please not that this is a dummy example
        library(rpart)

        tree <- rpart(Species ~ ., data=train) # fit tree on train data

        # make prediction on train data (no need to specify newclass= ) # NOT very useful
        pred <- predict(tree, type = "class")
        caret::confusionMatrix(pred, train$Species)

        # make prediction on test data (remove the response)
        pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
        # build confusion from predictions against the truth (ie the test$Species)
        caret::confusionMatrix(pred, test$Species)


        Note how the performance is awful on the test data, while it was almost perfect on train data.






        share|improve this answer















        In every machine learning process (in this case a classification problem), you have to split your data in a train and a test set.



        This is useful because you can train your algorithm in the first set, and test it on the second.



        This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.



        You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.



        The predict function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata= arg.



        In your first case so, you "test" you performance on the already trained data by not specifying the newdata= arg, so the confusionMatrix could be over-ottimistic.



        In the second case you should specify newdata=test_set, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.



        I'll build here an example for you to see a classic approach:



        data <- iris # iris dataset

        # first split the data
        set.seed(123) # for reproducibility
        pos <- sample(100)

        train <- data[pos, ] # random pick of 100 obs
        test <- data[-pos, ] # remaining 50

        # now you can start with your model - please not that this is a dummy example
        library(rpart)

        tree <- rpart(Species ~ ., data=train) # fit tree on train data

        # make prediction on train data (no need to specify newclass= ) # NOT very useful
        pred <- predict(tree, type = "class")
        caret::confusionMatrix(pred, train$Species)

        # make prediction on test data (remove the response)
        pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
        # build confusion from predictions against the truth (ie the test$Species)
        caret::confusionMatrix(pred, test$Species)


        Note how the performance is awful on the test data, while it was almost perfect on train data.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 '18 at 8:18

























        answered Nov 15 '18 at 8:10









        RLaveRLave

        4,0051922




        4,0051922






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312799%2fconfusion-matrix-for-training-and-validation-sets%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Xamarin.form Move up view when keyboard appear

            Post-Redirect-Get with Spring WebFlux and Thymeleaf

            Anylogic : not able to use stopDelay()