How can I look at a specific generated train and test sets made from for loop?
up vote
1
down vote
favorite
My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed
function, but I am very unfamiliar with this function.
r random
add a comment |
up vote
1
down vote
favorite
My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed
function, but I am very unfamiliar with this function.
r random
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed
function, but I am very unfamiliar with this function.
r random
My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed
function, but I am very unfamiliar with this function.
r random
r random
asked Nov 7 at 8:58
pineapple
486
486
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test
, train
, train_ind
variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind
from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed
isn't really going to help here. If all you were doing was running rnorm(1)
hundred times, then yes, by using set.seed
you could quickly recover the n-th generated value later. In your case, however, you are not only using sample
for train_ind
; the model estimation functions are also very likely generating random values.
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test
, train
, train_ind
variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind
from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed
isn't really going to help here. If all you were doing was running rnorm(1)
hundred times, then yes, by using set.seed
you could quickly recover the n-th generated value later. In your case, however, you are not only using sample
for train_ind
; the model estimation functions are also very likely generating random values.
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
add a comment |
up vote
0
down vote
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test
, train
, train_ind
variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind
from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed
isn't really going to help here. If all you were doing was running rnorm(1)
hundred times, then yes, by using set.seed
you could quickly recover the n-th generated value later. In your case, however, you are not only using sample
for train_ind
; the model estimation functions are also very likely generating random values.
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
add a comment |
up vote
0
down vote
up vote
0
down vote
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test
, train
, train_ind
variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind
from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed
isn't really going to help here. If all you were doing was running rnorm(1)
hundred times, then yes, by using set.seed
you could quickly recover the n-th generated value later. In your case, however, you are not only using sample
for train_ind
; the model estimation functions are also very likely generating random values.
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test
, train
, train_ind
variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind
from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed
isn't really going to help here. If all you were doing was running rnorm(1)
hundred times, then yes, by using set.seed
you could quickly recover the n-th generated value later. In your case, however, you are not only using sample
for train_ind
; the model estimation functions are also very likely generating random values.
answered Nov 7 at 10:42
Julius Vainora
26.5k75877
26.5k75877
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
add a comment |
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
@pineapple, does it answer your question?
– Julius Vainora
Nov 7 at 15:31
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53186195%2fhow-can-i-look-at-a-specific-generated-train-and-test-sets-made-from-for-loop%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown