Extract rows from dataframe that have keywords in them (Twitter data in RStudio)











up vote
1
down vote

favorite












I have a large dataframe (~500,000 observations) consisting of structured Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets so I can extract observations that have one or more keywords in the tweet text.



I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.



How do I make a new dataframe containing only observations where one or more of the keywords are present in the text column? Alternatively, can I delete observations where the keywords are not present?





Added info:
My dataframe looks something like this...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
456 , 73 , Interesting article here from Reuters ;
789 , 1367 , Our warming climate is danger #1! ;
345 , 489 , New episode of blue planet! ;


Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
789 , 1367 , Our warming climate is danger #1! ;




My dataframe is called NewCData



dput(droplevels(head(NewCData, 10)))



structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 
5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02",
"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07",
"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32",
"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L,
3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048",
"660209067584016384", "660209072768212992", "660209083505504256",
"660209086143688704", "660209087628578816", "660209102790914048",
"660209119152893952", "660209195162206208", "660209325986549760"
), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L,
5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087",
"325105950", "33398863", "68956490", "808114195", "87712431",
"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L,
2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062",
"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"),
ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L,
7L), .Label = c("-0.309303177803536", "-0.393703659798908",
"-0.795976086971656", "-0.811321629152632", "-0.946143178314071",
"-1.16317298915931", "0.353843466445817", "1.09919837237897",
"2.29286233202781"), class = "factor"), text = structure(c(2L,
9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ",
"Expert briefing on #disarmament #SDGs @NMUN ",
"I see red people Bill Gates says that only socialism can save us from climate change ",
"RT: Oddly enough, some Republicans think climate change is real: Oddly enough,… #UniteBlue ",
"Ted Cruz: ‘Climate change is not science, it’s religion’ via @glennbeck",
"This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"",
"Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming ",
"What do the remaining Republican candidates have to say about climate change? #FixGov",
"Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!",
"Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"
), class = "factor")), .Names = c("timestamp", "id_str",
"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA,
10L), class = "data.frame")









share|improve this question
























  • Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
    – Wiktor Stribiżew
    Nov 8 at 7:47












  • My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
    – Jason B
    Nov 8 at 9:48










  • See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
    – Wiktor Stribiżew
    Nov 8 at 9:48










  • I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
    – Jason B
    Nov 8 at 11:13












  • It is not usable. Add the output you got fromdput as is. You do not need to understand it.
    – Wiktor Stribiżew
    Nov 8 at 11:14

















up vote
1
down vote

favorite












I have a large dataframe (~500,000 observations) consisting of structured Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets so I can extract observations that have one or more keywords in the tweet text.



I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.



How do I make a new dataframe containing only observations where one or more of the keywords are present in the text column? Alternatively, can I delete observations where the keywords are not present?





Added info:
My dataframe looks something like this...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
456 , 73 , Interesting article here from Reuters ;
789 , 1367 , Our warming climate is danger #1! ;
345 , 489 , New episode of blue planet! ;


Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
789 , 1367 , Our warming climate is danger #1! ;




My dataframe is called NewCData



dput(droplevels(head(NewCData, 10)))



structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 
5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02",
"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07",
"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32",
"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L,
3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048",
"660209067584016384", "660209072768212992", "660209083505504256",
"660209086143688704", "660209087628578816", "660209102790914048",
"660209119152893952", "660209195162206208", "660209325986549760"
), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L,
5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087",
"325105950", "33398863", "68956490", "808114195", "87712431",
"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L,
2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062",
"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"),
ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L,
7L), .Label = c("-0.309303177803536", "-0.393703659798908",
"-0.795976086971656", "-0.811321629152632", "-0.946143178314071",
"-1.16317298915931", "0.353843466445817", "1.09919837237897",
"2.29286233202781"), class = "factor"), text = structure(c(2L,
9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ",
"Expert briefing on #disarmament #SDGs @NMUN ",
"I see red people Bill Gates says that only socialism can save us from climate change ",
"RT: Oddly enough, some Republicans think climate change is real: Oddly enough,… #UniteBlue ",
"Ted Cruz: ‘Climate change is not science, it’s religion’ via @glennbeck",
"This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"",
"Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming ",
"What do the remaining Republican candidates have to say about climate change? #FixGov",
"Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!",
"Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"
), class = "factor")), .Names = c("timestamp", "id_str",
"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA,
10L), class = "data.frame")









share|improve this question
























  • Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
    – Wiktor Stribiżew
    Nov 8 at 7:47












  • My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
    – Jason B
    Nov 8 at 9:48










  • See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
    – Wiktor Stribiżew
    Nov 8 at 9:48










  • I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
    – Jason B
    Nov 8 at 11:13












  • It is not usable. Add the output you got fromdput as is. You do not need to understand it.
    – Wiktor Stribiżew
    Nov 8 at 11:14















up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a large dataframe (~500,000 observations) consisting of structured Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets so I can extract observations that have one or more keywords in the tweet text.



I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.



How do I make a new dataframe containing only observations where one or more of the keywords are present in the text column? Alternatively, can I delete observations where the keywords are not present?





Added info:
My dataframe looks something like this...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
456 , 73 , Interesting article here from Reuters ;
789 , 1367 , Our warming climate is danger #1! ;
345 , 489 , New episode of blue planet! ;


Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
789 , 1367 , Our warming climate is danger #1! ;




My dataframe is called NewCData



dput(droplevels(head(NewCData, 10)))



structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 
5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02",
"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07",
"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32",
"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L,
3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048",
"660209067584016384", "660209072768212992", "660209083505504256",
"660209086143688704", "660209087628578816", "660209102790914048",
"660209119152893952", "660209195162206208", "660209325986549760"
), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L,
5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087",
"325105950", "33398863", "68956490", "808114195", "87712431",
"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L,
2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062",
"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"),
ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L,
7L), .Label = c("-0.309303177803536", "-0.393703659798908",
"-0.795976086971656", "-0.811321629152632", "-0.946143178314071",
"-1.16317298915931", "0.353843466445817", "1.09919837237897",
"2.29286233202781"), class = "factor"), text = structure(c(2L,
9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ",
"Expert briefing on #disarmament #SDGs @NMUN ",
"I see red people Bill Gates says that only socialism can save us from climate change ",
"RT: Oddly enough, some Republicans think climate change is real: Oddly enough,… #UniteBlue ",
"Ted Cruz: ‘Climate change is not science, it’s religion’ via @glennbeck",
"This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"",
"Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming ",
"What do the remaining Republican candidates have to say about climate change? #FixGov",
"Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!",
"Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"
), class = "factor")), .Names = c("timestamp", "id_str",
"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA,
10L), class = "data.frame")









share|improve this question















I have a large dataframe (~500,000 observations) consisting of structured Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets so I can extract observations that have one or more keywords in the tweet text.



I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.



How do I make a new dataframe containing only observations where one or more of the keywords are present in the text column? Alternatively, can I delete observations where the keywords are not present?





Added info:
My dataframe looks something like this...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
456 , 73 , Interesting article here from Reuters ;
789 , 1367 , Our warming climate is danger #1! ;
345 , 489 , New episode of blue planet! ;


Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...



 id_num,    follower_count,    text                                        ;
123 , 135 , Climate change is not science, it’s religion;
789 , 1367 , Our warming climate is danger #1! ;




My dataframe is called NewCData



dput(droplevels(head(NewCData, 10)))



structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 
5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02",
"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07",
"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32",
"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L,
3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048",
"660209067584016384", "660209072768212992", "660209083505504256",
"660209086143688704", "660209087628578816", "660209102790914048",
"660209119152893952", "660209195162206208", "660209325986549760"
), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L,
5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087",
"325105950", "33398863", "68956490", "808114195", "87712431",
"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L,
2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062",
"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"),
ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L,
7L), .Label = c("-0.309303177803536", "-0.393703659798908",
"-0.795976086971656", "-0.811321629152632", "-0.946143178314071",
"-1.16317298915931", "0.353843466445817", "1.09919837237897",
"2.29286233202781"), class = "factor"), text = structure(c(2L,
9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ",
"Expert briefing on #disarmament #SDGs @NMUN ",
"I see red people Bill Gates says that only socialism can save us from climate change ",
"RT: Oddly enough, some Republicans think climate change is real: Oddly enough,… #UniteBlue ",
"Ted Cruz: ‘Climate change is not science, it’s religion’ via @glennbeck",
"This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"",
"Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming ",
"What do the remaining Republican candidates have to say about climate change? #FixGov",
"Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!",
"Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"
), class = "factor")), .Names = c("timestamp", "id_str",
"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA,
10L), class = "data.frame")






twitter rstudio extract keyword text-mining






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 11:33

























asked Nov 7 at 11:27









Jason B

245




245












  • Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
    – Wiktor Stribiżew
    Nov 8 at 7:47












  • My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
    – Jason B
    Nov 8 at 9:48










  • See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
    – Wiktor Stribiżew
    Nov 8 at 9:48










  • I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
    – Jason B
    Nov 8 at 11:13












  • It is not usable. Add the output you got fromdput as is. You do not need to understand it.
    – Wiktor Stribiżew
    Nov 8 at 11:14




















  • Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
    – Wiktor Stribiżew
    Nov 8 at 7:47












  • My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
    – Jason B
    Nov 8 at 9:48










  • See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
    – Wiktor Stribiżew
    Nov 8 at 9:48










  • I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
    – Jason B
    Nov 8 at 11:13












  • It is not usable. Add the output you got fromdput as is. You do not need to understand it.
    – Wiktor Stribiżew
    Nov 8 at 11:14


















Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47






Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47














My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48




My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48












See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48




See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48












I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13






I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13














It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14






It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14














1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










You may use



new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]


See the R demo online



The point here is to combine the keywords into a pattern like



b(?:climate change|climate|climatechange|global warming|globalwarming)b


It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53188579%2fextract-rows-from-dataframe-that-have-keywords-in-them-twitter-data-in-rstudio%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    You may use



    new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]


    See the R demo online



    The point here is to combine the keywords into a pattern like



    b(?:climate change|climate|climatechange|global warming|globalwarming)b


    It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.






    share|improve this answer

























      up vote
      1
      down vote



      accepted










      You may use



      new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]


      See the R demo online



      The point here is to combine the keywords into a pattern like



      b(?:climate change|climate|climatechange|global warming|globalwarming)b


      It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.






      share|improve this answer























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        You may use



        new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]


        See the R demo online



        The point here is to combine the keywords into a pattern like



        b(?:climate change|climate|climatechange|global warming|globalwarming)b


        It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.






        share|improve this answer












        You may use



        new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]


        See the R demo online



        The point here is to combine the keywords into a pattern like



        b(?:climate change|climate|climatechange|global warming|globalwarming)b


        It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 8 at 11:52









        Wiktor Stribiżew

        301k16122197




        301k16122197






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53188579%2fextract-rows-from-dataframe-that-have-keywords-in-them-twitter-data-in-rstudio%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Hercules Kyvelos

            Tangent Lines Diagram Along Smooth Curve

            Yusuf al-Mu'taman ibn Hud