Compare multiple boolean columns in r
little crossword puzzle. As always I think I'm missing something. I have a dataframe like this:
id creator att1 att2 att3 att... att500
a1 person1 TRUE TRUE FALSE ...
a2 person2 TRUE TRUE TRUE ...
a3 person1 TRUE FALSE FALSE ...
a4 person1 TRUE TRUE FALSE ...
a5 person2 TRUE TRUE FALSE ...
And so on. I want to count the occurences of the same attribute combination (about 500 boolish values) by different creators and do this for each line, adding the count to the repective line. In the above example hence I want to have count=1 for the first row (a1) because in a5 a different person has done the very same attribute combination. Notice that a4 does not count, because it is the same combination but by the same person. Think of self mixed cocktails and the frequency they are mixed by different persons independent of each other. row a2 shall have a count of 0, so shall a3 (no same attribute combination) and a4 respectively count = 1 because of a5. a5 has a count of 1 too. However, if other persons mix the same cocktail several times, this shall be counted. I don't want to simply remove duplicates.
My plan is hence to loop through the rows, exclude all cocktails by the same creator of the row, take the attribute combination and compare it with all the rows in the temporary dataset:
for (row in 1:nrow(data)){
# for each row in data
creator <- row$creator
# get creator
attr_tupel <- row[1, 3:500]
#return the attribute combination of the row
data[row]$count <- nrow(data[data$creator != creator & data[3:500] == attr_tupel])
# into the column $count of the current row write the number of observations that are not from the same creator and match the exact tupel of my ~500 Attributes (equal cocktails by different persons)
}
Unfortunately I can't compare the tupel of the reference row with the other rows, as
‘==’ only defined for equally-sized data frames
And now I'm stuck. I could for sure write each column separately - but that would take ages. Do I need to cast that dataframe into a list or vector or //insert sthg here// (vector and list doesn't work.) Is it at all possible to compare one row of values with many other rows for equality? I don't think having a duplicate of the row would be the solution, besides usually R does simply loop through the entries when he does not have anything to compare anymore. Why not here?
I read several threads about comparing several columns with each other, but did not succeed in transferring the solutions to my problem. e.g.: wants to look up one value for the boolish value, I have multiple TRUE values , same , wants to convert to a c() - which I could do too and compare those, but kind of a hard way, isn't it?
At last (from that last link) I was now even thinking of converting the boolish values to a number (adding indices so that we have
id creator att1 ... index
a1 person1 1 2 0 ... 3
a2 person2 1 2 3 ... 6
and compare that index. Should work. But kind of feel like that is an ugly workaround. Also when thinking of having data other than boolean, like several strings, I'd still in the long run like to able to compare a tupel of columns against each other independent of their content.
What am I missing? :)
Thanks for your help!
as asked for in the comment, here short script to create a similar dataframe. Keep in mind though that there are way more columns to compare.
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
r loops boolean comparison
add a comment |
little crossword puzzle. As always I think I'm missing something. I have a dataframe like this:
id creator att1 att2 att3 att... att500
a1 person1 TRUE TRUE FALSE ...
a2 person2 TRUE TRUE TRUE ...
a3 person1 TRUE FALSE FALSE ...
a4 person1 TRUE TRUE FALSE ...
a5 person2 TRUE TRUE FALSE ...
And so on. I want to count the occurences of the same attribute combination (about 500 boolish values) by different creators and do this for each line, adding the count to the repective line. In the above example hence I want to have count=1 for the first row (a1) because in a5 a different person has done the very same attribute combination. Notice that a4 does not count, because it is the same combination but by the same person. Think of self mixed cocktails and the frequency they are mixed by different persons independent of each other. row a2 shall have a count of 0, so shall a3 (no same attribute combination) and a4 respectively count = 1 because of a5. a5 has a count of 1 too. However, if other persons mix the same cocktail several times, this shall be counted. I don't want to simply remove duplicates.
My plan is hence to loop through the rows, exclude all cocktails by the same creator of the row, take the attribute combination and compare it with all the rows in the temporary dataset:
for (row in 1:nrow(data)){
# for each row in data
creator <- row$creator
# get creator
attr_tupel <- row[1, 3:500]
#return the attribute combination of the row
data[row]$count <- nrow(data[data$creator != creator & data[3:500] == attr_tupel])
# into the column $count of the current row write the number of observations that are not from the same creator and match the exact tupel of my ~500 Attributes (equal cocktails by different persons)
}
Unfortunately I can't compare the tupel of the reference row with the other rows, as
‘==’ only defined for equally-sized data frames
And now I'm stuck. I could for sure write each column separately - but that would take ages. Do I need to cast that dataframe into a list or vector or //insert sthg here// (vector and list doesn't work.) Is it at all possible to compare one row of values with many other rows for equality? I don't think having a duplicate of the row would be the solution, besides usually R does simply loop through the entries when he does not have anything to compare anymore. Why not here?
I read several threads about comparing several columns with each other, but did not succeed in transferring the solutions to my problem. e.g.: wants to look up one value for the boolish value, I have multiple TRUE values , same , wants to convert to a c() - which I could do too and compare those, but kind of a hard way, isn't it?
At last (from that last link) I was now even thinking of converting the boolish values to a number (adding indices so that we have
id creator att1 ... index
a1 person1 1 2 0 ... 3
a2 person2 1 2 3 ... 6
and compare that index. Should work. But kind of feel like that is an ugly workaround. Also when thinking of having data other than boolean, like several strings, I'd still in the long run like to able to compare a tupel of columns against each other independent of their content.
What am I missing? :)
Thanks for your help!
as asked for in the comment, here short script to create a similar dataframe. Keep in mind though that there are way more columns to compare.
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
r loops boolean comparison
Hi akrun, thanks! You can just cut it off there. it's just there to notice that a solution like this nrow(data[data$att1 == row$att1 & data$att2 == row att2 & data$att3 == row$att3]) would not be practical. The issue particularly evolves through the size of different combinations in about 500 columns.
– 3bbing
Nov 20 '18 at 18:25
1
@akrun Above I added some code to create examplary dataframe. Thanks!
– 3bbing
Nov 20 '18 at 18:47
Something likem1 <- combn(names(data)[-(1:2)], 2, FUN = function(x) rowSums(data[x])); colnames(m1) <- combn(names(data)[-(1:2)], 2, FUN = paste, collapse="_")
– akrun
Nov 20 '18 at 19:00
add a comment |
little crossword puzzle. As always I think I'm missing something. I have a dataframe like this:
id creator att1 att2 att3 att... att500
a1 person1 TRUE TRUE FALSE ...
a2 person2 TRUE TRUE TRUE ...
a3 person1 TRUE FALSE FALSE ...
a4 person1 TRUE TRUE FALSE ...
a5 person2 TRUE TRUE FALSE ...
And so on. I want to count the occurences of the same attribute combination (about 500 boolish values) by different creators and do this for each line, adding the count to the repective line. In the above example hence I want to have count=1 for the first row (a1) because in a5 a different person has done the very same attribute combination. Notice that a4 does not count, because it is the same combination but by the same person. Think of self mixed cocktails and the frequency they are mixed by different persons independent of each other. row a2 shall have a count of 0, so shall a3 (no same attribute combination) and a4 respectively count = 1 because of a5. a5 has a count of 1 too. However, if other persons mix the same cocktail several times, this shall be counted. I don't want to simply remove duplicates.
My plan is hence to loop through the rows, exclude all cocktails by the same creator of the row, take the attribute combination and compare it with all the rows in the temporary dataset:
for (row in 1:nrow(data)){
# for each row in data
creator <- row$creator
# get creator
attr_tupel <- row[1, 3:500]
#return the attribute combination of the row
data[row]$count <- nrow(data[data$creator != creator & data[3:500] == attr_tupel])
# into the column $count of the current row write the number of observations that are not from the same creator and match the exact tupel of my ~500 Attributes (equal cocktails by different persons)
}
Unfortunately I can't compare the tupel of the reference row with the other rows, as
‘==’ only defined for equally-sized data frames
And now I'm stuck. I could for sure write each column separately - but that would take ages. Do I need to cast that dataframe into a list or vector or //insert sthg here// (vector and list doesn't work.) Is it at all possible to compare one row of values with many other rows for equality? I don't think having a duplicate of the row would be the solution, besides usually R does simply loop through the entries when he does not have anything to compare anymore. Why not here?
I read several threads about comparing several columns with each other, but did not succeed in transferring the solutions to my problem. e.g.: wants to look up one value for the boolish value, I have multiple TRUE values , same , wants to convert to a c() - which I could do too and compare those, but kind of a hard way, isn't it?
At last (from that last link) I was now even thinking of converting the boolish values to a number (adding indices so that we have
id creator att1 ... index
a1 person1 1 2 0 ... 3
a2 person2 1 2 3 ... 6
and compare that index. Should work. But kind of feel like that is an ugly workaround. Also when thinking of having data other than boolean, like several strings, I'd still in the long run like to able to compare a tupel of columns against each other independent of their content.
What am I missing? :)
Thanks for your help!
as asked for in the comment, here short script to create a similar dataframe. Keep in mind though that there are way more columns to compare.
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
r loops boolean comparison
little crossword puzzle. As always I think I'm missing something. I have a dataframe like this:
id creator att1 att2 att3 att... att500
a1 person1 TRUE TRUE FALSE ...
a2 person2 TRUE TRUE TRUE ...
a3 person1 TRUE FALSE FALSE ...
a4 person1 TRUE TRUE FALSE ...
a5 person2 TRUE TRUE FALSE ...
And so on. I want to count the occurences of the same attribute combination (about 500 boolish values) by different creators and do this for each line, adding the count to the repective line. In the above example hence I want to have count=1 for the first row (a1) because in a5 a different person has done the very same attribute combination. Notice that a4 does not count, because it is the same combination but by the same person. Think of self mixed cocktails and the frequency they are mixed by different persons independent of each other. row a2 shall have a count of 0, so shall a3 (no same attribute combination) and a4 respectively count = 1 because of a5. a5 has a count of 1 too. However, if other persons mix the same cocktail several times, this shall be counted. I don't want to simply remove duplicates.
My plan is hence to loop through the rows, exclude all cocktails by the same creator of the row, take the attribute combination and compare it with all the rows in the temporary dataset:
for (row in 1:nrow(data)){
# for each row in data
creator <- row$creator
# get creator
attr_tupel <- row[1, 3:500]
#return the attribute combination of the row
data[row]$count <- nrow(data[data$creator != creator & data[3:500] == attr_tupel])
# into the column $count of the current row write the number of observations that are not from the same creator and match the exact tupel of my ~500 Attributes (equal cocktails by different persons)
}
Unfortunately I can't compare the tupel of the reference row with the other rows, as
‘==’ only defined for equally-sized data frames
And now I'm stuck. I could for sure write each column separately - but that would take ages. Do I need to cast that dataframe into a list or vector or //insert sthg here// (vector and list doesn't work.) Is it at all possible to compare one row of values with many other rows for equality? I don't think having a duplicate of the row would be the solution, besides usually R does simply loop through the entries when he does not have anything to compare anymore. Why not here?
I read several threads about comparing several columns with each other, but did not succeed in transferring the solutions to my problem. e.g.: wants to look up one value for the boolish value, I have multiple TRUE values , same , wants to convert to a c() - which I could do too and compare those, but kind of a hard way, isn't it?
At last (from that last link) I was now even thinking of converting the boolish values to a number (adding indices so that we have
id creator att1 ... index
a1 person1 1 2 0 ... 3
a2 person2 1 2 3 ... 6
and compare that index. Should work. But kind of feel like that is an ugly workaround. Also when thinking of having data other than boolean, like several strings, I'd still in the long run like to able to compare a tupel of columns against each other independent of their content.
What am I missing? :)
Thanks for your help!
as asked for in the comment, here short script to create a similar dataframe. Keep in mind though that there are way more columns to compare.
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
r loops boolean comparison
r loops boolean comparison
edited Nov 20 '18 at 18:46
3bbing
asked Nov 20 '18 at 18:16
3bbing3bbing
355
355
Hi akrun, thanks! You can just cut it off there. it's just there to notice that a solution like this nrow(data[data$att1 == row$att1 & data$att2 == row att2 & data$att3 == row$att3]) would not be practical. The issue particularly evolves through the size of different combinations in about 500 columns.
– 3bbing
Nov 20 '18 at 18:25
1
@akrun Above I added some code to create examplary dataframe. Thanks!
– 3bbing
Nov 20 '18 at 18:47
Something likem1 <- combn(names(data)[-(1:2)], 2, FUN = function(x) rowSums(data[x])); colnames(m1) <- combn(names(data)[-(1:2)], 2, FUN = paste, collapse="_")
– akrun
Nov 20 '18 at 19:00
add a comment |
Hi akrun, thanks! You can just cut it off there. it's just there to notice that a solution like this nrow(data[data$att1 == row$att1 & data$att2 == row att2 & data$att3 == row$att3]) would not be practical. The issue particularly evolves through the size of different combinations in about 500 columns.
– 3bbing
Nov 20 '18 at 18:25
1
@akrun Above I added some code to create examplary dataframe. Thanks!
– 3bbing
Nov 20 '18 at 18:47
Something likem1 <- combn(names(data)[-(1:2)], 2, FUN = function(x) rowSums(data[x])); colnames(m1) <- combn(names(data)[-(1:2)], 2, FUN = paste, collapse="_")
– akrun
Nov 20 '18 at 19:00
Hi akrun, thanks! You can just cut it off there. it's just there to notice that a solution like this nrow(data[data$att1 == row$att1 & data$att2 == row att2 & data$att3 == row$att3]) would not be practical. The issue particularly evolves through the size of different combinations in about 500 columns.
– 3bbing
Nov 20 '18 at 18:25
Hi akrun, thanks! You can just cut it off there. it's just there to notice that a solution like this nrow(data[data$att1 == row$att1 & data$att2 == row att2 & data$att3 == row$att3]) would not be practical. The issue particularly evolves through the size of different combinations in about 500 columns.
– 3bbing
Nov 20 '18 at 18:25
1
1
@akrun Above I added some code to create examplary dataframe. Thanks!
– 3bbing
Nov 20 '18 at 18:47
@akrun Above I added some code to create examplary dataframe. Thanks!
– 3bbing
Nov 20 '18 at 18:47
Something like
m1 <- combn(names(data)[-(1:2)], 2, FUN = function(x) rowSums(data[x])); colnames(m1) <- combn(names(data)[-(1:2)], 2, FUN = paste, collapse="_")
– akrun
Nov 20 '18 at 19:00
Something like
m1 <- combn(names(data)[-(1:2)], 2, FUN = function(x) rowSums(data[x])); colnames(m1) <- combn(names(data)[-(1:2)], 2, FUN = paste, collapse="_")
– akrun
Nov 20 '18 at 19:00
add a comment |
1 Answer
1
active
oldest
votes
EDIT: Sorry - my first solution misread the question. Try this instead
You can run this using data table:
#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
# Code to run
library(data.table)
setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")
Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set.
dt_out
id names bools1 bools2 bools3 bools4 bools5 drink times_made_others
1: 1 creator 1 FALSE TRUE FALSE TRUE TRUE FALSE_TRUE_FALSE_TRUE_TRUE 3
2: 2 creator 1 FALSE FALSE TRUE TRUE TRUE FALSE_FALSE_TRUE_TRUE_TRUE 1
3: 3 creator 1 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
4: 4 creator 1 TRUE TRUE FALSE FALSE TRUE TRUE_TRUE_FALSE_FALSE_TRUE 0
5: 5 creator 1 TRUE FALSE FALSE FALSE FALSE TRUE_FALSE_FALSE_FALSE_FALSE 3
6: 6 creator 2 TRUE TRUE FALSE FALSE FALSE TRUE_TRUE_FALSE_FALSE_FALSE 2
7: 7 creator 2 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53399121%2fcompare-multiple-boolean-columns-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
EDIT: Sorry - my first solution misread the question. Try this instead
You can run this using data table:
#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
# Code to run
library(data.table)
setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")
Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set.
dt_out
id names bools1 bools2 bools3 bools4 bools5 drink times_made_others
1: 1 creator 1 FALSE TRUE FALSE TRUE TRUE FALSE_TRUE_FALSE_TRUE_TRUE 3
2: 2 creator 1 FALSE FALSE TRUE TRUE TRUE FALSE_FALSE_TRUE_TRUE_TRUE 1
3: 3 creator 1 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
4: 4 creator 1 TRUE TRUE FALSE FALSE TRUE TRUE_TRUE_FALSE_FALSE_TRUE 0
5: 5 creator 1 TRUE FALSE FALSE FALSE FALSE TRUE_FALSE_FALSE_FALSE_FALSE 3
6: 6 creator 2 TRUE TRUE FALSE FALSE FALSE TRUE_TRUE_FALSE_FALSE_FALSE 2
7: 7 creator 2 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
add a comment |
EDIT: Sorry - my first solution misread the question. Try this instead
You can run this using data table:
#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
# Code to run
library(data.table)
setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")
Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set.
dt_out
id names bools1 bools2 bools3 bools4 bools5 drink times_made_others
1: 1 creator 1 FALSE TRUE FALSE TRUE TRUE FALSE_TRUE_FALSE_TRUE_TRUE 3
2: 2 creator 1 FALSE FALSE TRUE TRUE TRUE FALSE_FALSE_TRUE_TRUE_TRUE 1
3: 3 creator 1 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
4: 4 creator 1 TRUE TRUE FALSE FALSE TRUE TRUE_TRUE_FALSE_FALSE_TRUE 0
5: 5 creator 1 TRUE FALSE FALSE FALSE FALSE TRUE_FALSE_FALSE_FALSE_FALSE 3
6: 6 creator 2 TRUE TRUE FALSE FALSE FALSE TRUE_TRUE_FALSE_FALSE_FALSE 2
7: 7 creator 2 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
add a comment |
EDIT: Sorry - my first solution misread the question. Try this instead
You can run this using data table:
#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
# Code to run
library(data.table)
setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")
Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set.
dt_out
id names bools1 bools2 bools3 bools4 bools5 drink times_made_others
1: 1 creator 1 FALSE TRUE FALSE TRUE TRUE FALSE_TRUE_FALSE_TRUE_TRUE 3
2: 2 creator 1 FALSE FALSE TRUE TRUE TRUE FALSE_FALSE_TRUE_TRUE_TRUE 1
3: 3 creator 1 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
4: 4 creator 1 TRUE TRUE FALSE FALSE TRUE TRUE_TRUE_FALSE_FALSE_TRUE 0
5: 5 creator 1 TRUE FALSE FALSE FALSE FALSE TRUE_FALSE_FALSE_FALSE_FALSE 3
6: 6 creator 2 TRUE TRUE FALSE FALSE FALSE TRUE_TRUE_FALSE_FALSE_FALSE 2
7: 7 creator 2 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
EDIT: Sorry - my first solution misread the question. Try this instead
You can run this using data table:
#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)
data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)
# Code to run
library(data.table)
setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")
Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set.
dt_out
id names bools1 bools2 bools3 bools4 bools5 drink times_made_others
1: 1 creator 1 FALSE TRUE FALSE TRUE TRUE FALSE_TRUE_FALSE_TRUE_TRUE 3
2: 2 creator 1 FALSE FALSE TRUE TRUE TRUE FALSE_FALSE_TRUE_TRUE_TRUE 1
3: 3 creator 1 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
4: 4 creator 1 TRUE TRUE FALSE FALSE TRUE TRUE_TRUE_FALSE_FALSE_TRUE 0
5: 5 creator 1 TRUE FALSE FALSE FALSE FALSE TRUE_FALSE_FALSE_FALSE_FALSE 3
6: 6 creator 2 TRUE TRUE FALSE FALSE FALSE TRUE_TRUE_FALSE_FALSE_FALSE 2
7: 7 creator 2 TRUE FALSE FALSE TRUE FALSE TRUE_FALSE_FALSE_TRUE_FALSE 2
edited Nov 20 '18 at 20:32
answered Nov 20 '18 at 20:23
ChrisChris
5,03611941
5,03611941
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
add a comment |
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
amazing. Thank you so much. I have tried working it out with datatable functions and .N before too but didn't manage for some reason. Never tried grouping with two grouping variables and somehow overwrote the old value in each new row. Is so lean and straightforward! Never used melt() before, will read into it. In general need some time to digest your code tbh, but I adapted it to the large dataset, checked some cases and it looks flawless. Great idea collapsing the "recipe" that way btw! That will be very helpful not only here but along the road! Thanks again!
– 3bbing
Nov 20 '18 at 21:17
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
Sidenote: Now with the receipe / multiple columns reduced to one column it is also possible to easily loop through the rows and count. Just in case it's needed for someone: for (row in 1:nrow(data)){ data$count[row] <- nrow(data[data$recipe == data$recipe[row]) } If there is more information on the rows this way you can easily adapt the subsetting. Again Thanks Chris!
– 3bbing
Dec 2 '18 at 11:31
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53399121%2fcompare-multiple-boolean-columns-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Hi akrun, thanks! You can just cut it off there. it's just there to notice that a solution like this nrow(data[data$att1 == row$att1 & data$att2 == row att2 & data$att3 == row$att3]) would not be practical. The issue particularly evolves through the size of different combinations in about 500 columns.
– 3bbing
Nov 20 '18 at 18:25
1
@akrun Above I added some code to create examplary dataframe. Thanks!
– 3bbing
Nov 20 '18 at 18:47
Something like
m1 <- combn(names(data)[-(1:2)], 2, FUN = function(x) rowSums(data[x])); colnames(m1) <- combn(names(data)[-(1:2)], 2, FUN = paste, collapse="_")
– akrun
Nov 20 '18 at 19:00