Confusion matrix for training and validation sets
What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t
in the first case?
pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))
pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))
r confusion-matrix
add a comment |
What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t
in the first case?
pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))
pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))
r confusion-matrix
2
It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.
– mickey
Nov 15 '18 at 5:13
based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,
– sai saran
Nov 15 '18 at 5:22
add a comment |
What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t
in the first case?
pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))
pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))
r confusion-matrix
What is the purpose of the "newdata" argument?
Why don't we need to specify newdata = tlFLAG.t
in the first case?
pred <- predict(tree1, type = "class")
confusionMatrix(pred, factor(tlFLAG.t$TERM_FLAG))
pred.v <- predict(tree1, type = "class", newdata = tlFLAG.v)
confusionMatrix(pred.v, factor(tlFLAG.v$TERM_FLAG))
r confusion-matrix
r confusion-matrix
edited Nov 15 '18 at 7:50
RLave
4,76711124
4,76711124
asked Nov 15 '18 at 5:09
FICFIC
163
163
2
It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.
– mickey
Nov 15 '18 at 5:13
based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,
– sai saran
Nov 15 '18 at 5:22
add a comment |
2
It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.
– mickey
Nov 15 '18 at 5:13
based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,
– sai saran
Nov 15 '18 at 5:22
2
2
It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.
– mickey
Nov 15 '18 at 5:13
It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.
– mickey
Nov 15 '18 at 5:13
based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,
– sai saran
Nov 15 '18 at 5:22
based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,
– sai saran
Nov 15 '18 at 5:22
add a comment |
1 Answer
1
active
oldest
votes
In every machine learning process (in this case a classification
problem), you have to split your data in a train
and a test
set.
This is useful because you can train your algorithm in the first set, and test it on the second.
This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.
You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.
The predict
function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata=
arg.
In your first case so, you "test" you performance on the already trained data by not specifying the newdata=
arg, so the confusionMatrix
could be over-ottimistic.
In the second case you should specify newdata=test_set
, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.
I'll build here an example for you to see a classic approach:
data <- iris # iris dataset
# first split the data
set.seed(123) # for reproducibility
pos <- sample(100)
train <- data[pos, ] # random pick of 100 obs
test <- data[-pos, ] # remaining 50
# now you can start with your model - please not that this is a dummy example
library(rpart)
tree <- rpart(Species ~ ., data=train) # fit tree on train data
# make prediction on train data (no need to specify newclass= ) # NOT very useful
pred <- predict(tree, type = "class")
caret::confusionMatrix(pred, train$Species)
# make prediction on test data (remove the response)
pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
# build confusion from predictions against the truth (ie the test$Species)
caret::confusionMatrix(pred, test$Species)
Note how the performance is awful on the test
data, while it was almost perfect on train
data.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312799%2fconfusion-matrix-for-training-and-validation-sets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
In every machine learning process (in this case a classification
problem), you have to split your data in a train
and a test
set.
This is useful because you can train your algorithm in the first set, and test it on the second.
This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.
You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.
The predict
function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata=
arg.
In your first case so, you "test" you performance on the already trained data by not specifying the newdata=
arg, so the confusionMatrix
could be over-ottimistic.
In the second case you should specify newdata=test_set
, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.
I'll build here an example for you to see a classic approach:
data <- iris # iris dataset
# first split the data
set.seed(123) # for reproducibility
pos <- sample(100)
train <- data[pos, ] # random pick of 100 obs
test <- data[-pos, ] # remaining 50
# now you can start with your model - please not that this is a dummy example
library(rpart)
tree <- rpart(Species ~ ., data=train) # fit tree on train data
# make prediction on train data (no need to specify newclass= ) # NOT very useful
pred <- predict(tree, type = "class")
caret::confusionMatrix(pred, train$Species)
# make prediction on test data (remove the response)
pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
# build confusion from predictions against the truth (ie the test$Species)
caret::confusionMatrix(pred, test$Species)
Note how the performance is awful on the test
data, while it was almost perfect on train
data.
add a comment |
In every machine learning process (in this case a classification
problem), you have to split your data in a train
and a test
set.
This is useful because you can train your algorithm in the first set, and test it on the second.
This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.
You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.
The predict
function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata=
arg.
In your first case so, you "test" you performance on the already trained data by not specifying the newdata=
arg, so the confusionMatrix
could be over-ottimistic.
In the second case you should specify newdata=test_set
, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.
I'll build here an example for you to see a classic approach:
data <- iris # iris dataset
# first split the data
set.seed(123) # for reproducibility
pos <- sample(100)
train <- data[pos, ] # random pick of 100 obs
test <- data[-pos, ] # remaining 50
# now you can start with your model - please not that this is a dummy example
library(rpart)
tree <- rpart(Species ~ ., data=train) # fit tree on train data
# make prediction on train data (no need to specify newclass= ) # NOT very useful
pred <- predict(tree, type = "class")
caret::confusionMatrix(pred, train$Species)
# make prediction on test data (remove the response)
pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
# build confusion from predictions against the truth (ie the test$Species)
caret::confusionMatrix(pred, test$Species)
Note how the performance is awful on the test
data, while it was almost perfect on train
data.
add a comment |
In every machine learning process (in this case a classification
problem), you have to split your data in a train
and a test
set.
This is useful because you can train your algorithm in the first set, and test it on the second.
This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.
You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.
The predict
function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata=
arg.
In your first case so, you "test" you performance on the already trained data by not specifying the newdata=
arg, so the confusionMatrix
could be over-ottimistic.
In the second case you should specify newdata=test_set
, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.
I'll build here an example for you to see a classic approach:
data <- iris # iris dataset
# first split the data
set.seed(123) # for reproducibility
pos <- sample(100)
train <- data[pos, ] # random pick of 100 obs
test <- data[-pos, ] # remaining 50
# now you can start with your model - please not that this is a dummy example
library(rpart)
tree <- rpart(Species ~ ., data=train) # fit tree on train data
# make prediction on train data (no need to specify newclass= ) # NOT very useful
pred <- predict(tree, type = "class")
caret::confusionMatrix(pred, train$Species)
# make prediction on test data (remove the response)
pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
# build confusion from predictions against the truth (ie the test$Species)
caret::confusionMatrix(pred, test$Species)
Note how the performance is awful on the test
data, while it was almost perfect on train
data.
In every machine learning process (in this case a classification
problem), you have to split your data in a train
and a test
set.
This is useful because you can train your algorithm in the first set, and test it on the second.
This has to be done, otherwise (if you use all the data) you're exposing yourself to overfitting, because almost every algorithm will try to fit best the data you feed.
You'll end up with even a perfect model for your data, but that will predict very poorly on new data, that it has not yet seen.
The predict
function, because of this, lets you pick new data to "test" the goodness of your model on unseen data by the newdata=
arg.
In your first case so, you "test" you performance on the already trained data by not specifying the newdata=
arg, so the confusionMatrix
could be over-ottimistic.
In the second case you should specify newdata=test_set
, and with this your prediction will be based on test data, so the performance will be more accurate, and even more interesting on this second case.
I'll build here an example for you to see a classic approach:
data <- iris # iris dataset
# first split the data
set.seed(123) # for reproducibility
pos <- sample(100)
train <- data[pos, ] # random pick of 100 obs
test <- data[-pos, ] # remaining 50
# now you can start with your model - please not that this is a dummy example
library(rpart)
tree <- rpart(Species ~ ., data=train) # fit tree on train data
# make prediction on train data (no need to specify newclass= ) # NOT very useful
pred <- predict(tree, type = "class")
caret::confusionMatrix(pred, train$Species)
# make prediction on test data (remove the response)
pred <- predict(tree, type = "class", newdata = test[, -5]) # I removed Species (5th column in test)
# build confusion from predictions against the truth (ie the test$Species)
caret::confusionMatrix(pred, test$Species)
Note how the performance is awful on the test
data, while it was almost perfect on train
data.
edited Nov 15 '18 at 8:18
answered Nov 15 '18 at 8:10
RLaveRLave
4,76711124
4,76711124
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312799%2fconfusion-matrix-for-training-and-validation-sets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
It's not really clear what you're asking, or at least the point you're trying to make with the code you showed. The "newdata" argument is for making predictions from a model with a new data set.
– mickey
Nov 15 '18 at 5:13
based on u r data u r going to use in confusion matrix..try to google it for more information or below thread have more info for u rdocumentation.org/packages/caret/versions/3.45/topics/… ,
– sai saran
Nov 15 '18 at 5:22