Extract rows from dataframe that have keywords in them (Twitter data in RStudio)

up vote
1
down vote

favorite

I have a large dataframe (~500,000 observations) consisting of structured Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets so I can extract observations that have one or more keywords in the tweet text.

I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.

How do I make a new dataframe containing only observations where one or more of the keywords are present in the text column? Alternatively, can I delete observations where the keywords are not present?

Added info:
My dataframe looks something like this...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 456 ,      73   ,             Interesting article here from Reuters       ;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

 345 ,      489  ,             New episode of blue planet!                 ;

Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

My dataframe is called NewCData

dput(droplevels(head(NewCData, 10)))

structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 

5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02", 

"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07", 

"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32", 

"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L, 

3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048", 

"660209067584016384", "660209072768212992", "660209083505504256", 

"660209086143688704", "660209087628578816", "660209102790914048", 

"660209119152893952", "660209195162206208", "660209325986549760"

), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L, 

5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087", 

"325105950", "33398863", "68956490", "808114195", "87712431", 

"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L, 

2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062", 

"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"), 

    ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L, 

    7L), .Label = c("-0.309303177803536", "-0.393703659798908", 

    "-0.795976086971656", "-0.811321629152632", "-0.946143178314071", 

    "-1.16317298915931", "0.353843466445817", "1.09919837237897", 

    "2.29286233202781"), class = "factor"), text = structure(c(2L, 

    9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ", 

    "Expert briefing on  #disarmament #SDGs @NMUN ", 

    "I see red people Bill Gates says that only socialism can save us from climate change ", 

    "RT: Oddly enough, some Republicans think climate change is real: Oddly enough,…  #UniteBlue ", 

    "Ted Cruz: ‘Climate change is not science, it’s religion’  via @glennbeck", 

    "This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"", 

    "Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming  ", 

    "What do the remaining Republican candidates have to say about climate change? #FixGov", 

    "Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!", 

    "Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"

    ), class = "factor")), .Names = c("timestamp", "id_str", 

"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA, 

10L), class = "data.frame")

edited Nov 8 at 11:33

asked Nov 7 at 11:27

Jason B

245

Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47

My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48

See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48

I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13

It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14

|
show 3 more comments

up vote
1
down vote

favorite

I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.

Added info:
My dataframe looks something like this...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 456 ,      73   ,             Interesting article here from Reuters       ;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

 345 ,      489  ,             New episode of blue planet!                 ;

Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

My dataframe is called NewCData

dput(droplevels(head(NewCData, 10)))

structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 

5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02", 

"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07", 

"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32", 

"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L, 

3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048", 

"660209067584016384", "660209072768212992", "660209083505504256", 

"660209086143688704", "660209087628578816", "660209102790914048", 

"660209119152893952", "660209195162206208", "660209325986549760"

), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L, 

5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087", 

"325105950", "33398863", "68956490", "808114195", "87712431", 

"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L, 

2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062", 

"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"), 

    ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L, 

    7L), .Label = c("-0.309303177803536", "-0.393703659798908", 

    "-0.795976086971656", "-0.811321629152632", "-0.946143178314071", 

    "-1.16317298915931", "0.353843466445817", "1.09919837237897", 

    "2.29286233202781"), class = "factor"), text = structure(c(2L, 

    9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ", 

    "Expert briefing on  #disarmament #SDGs @NMUN ", 

    "I see red people Bill Gates says that only socialism can save us from climate change ", 

    "RT: Oddly enough, some Republicans think climate change is real: Oddly enough,…  #UniteBlue ", 

    "Ted Cruz: ‘Climate change is not science, it’s religion’  via @glennbeck", 

    "This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"", 

    "Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming  ", 

    "What do the remaining Republican candidates have to say about climate change? #FixGov", 

    "Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!", 

    "Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"

    ), class = "factor")), .Names = c("timestamp", "id_str", 

"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA, 

10L), class = "data.frame")

edited Nov 8 at 11:33

asked Nov 7 at 11:27

Jason B

245

Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47

My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48

See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48

I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13

It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14

|
show 3 more comments

up vote
1
down vote

favorite

I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.

Added info:
My dataframe looks something like this...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 456 ,      73   ,             Interesting article here from Reuters       ;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

 345 ,      489  ,             New episode of blue planet!                 ;

Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

My dataframe is called NewCData

dput(droplevels(head(NewCData, 10)))

structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 

5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02", 

"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07", 

"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32", 

"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L, 

3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048", 

"660209067584016384", "660209072768212992", "660209083505504256", 

"660209086143688704", "660209087628578816", "660209102790914048", 

"660209119152893952", "660209195162206208", "660209325986549760"

), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L, 

5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087", 

"325105950", "33398863", "68956490", "808114195", "87712431", 

"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L, 

2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062", 

"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"), 

    ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L, 

    7L), .Label = c("-0.309303177803536", "-0.393703659798908", 

    "-0.795976086971656", "-0.811321629152632", "-0.946143178314071", 

    "-1.16317298915931", "0.353843466445817", "1.09919837237897", 

    "2.29286233202781"), class = "factor"), text = structure(c(2L, 

    9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ", 

    "Expert briefing on  #disarmament #SDGs @NMUN ", 

    "I see red people Bill Gates says that only socialism can save us from climate change ", 

    "RT: Oddly enough, some Republicans think climate change is real: Oddly enough,…  #UniteBlue ", 

    "Ted Cruz: ‘Climate change is not science, it’s religion’  via @glennbeck", 

    "This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"", 

    "Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming  ", 

    "What do the remaining Republican candidates have to say about climate change? #FixGov", 

    "Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!", 

    "Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"

    ), class = "factor")), .Names = c("timestamp", "id_str", 

"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA, 

10L), class = "data.frame")

edited Nov 8 at 11:33

asked Nov 7 at 11:27

Jason B

245

I have uploaded my keywords as keywords_C <- c("climate change","climate","climatechange","global warming","globalwarming") . Tweet text is stored in my dataframe in a column labelled text.

Added info:
My dataframe looks something like this...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 456 ,      73   ,             Interesting article here from Reuters       ;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

 345 ,      489  ,             New episode of blue planet!                 ;

Using the keywords_C value, I'm hoping to write a code that will extract rows that contain the keywords and create a new dataframe. So in this example, the new dataframe would be...

 id_num,    follower_count,    text                                        ;

 123 ,      135  ,             Climate change is not science, it’s religion;

 789 ,      1367 ,             Our warming climate is danger #1!           ;

My dataframe is called NewCData

dput(droplevels(head(NewCData, 10)))

structure(list(timestamp = structure(c(1L, 3L, 2L, 6L, 4L, 4L, 

5L, 8L, 7L, 9L), .Label = c("2015-10-30 21:37:58", "2015-10-30 21:38:02", 

"2015-10-30 21:38:03", "2015-10-30 21:38:06", "2015-10-30 21:38:07", 

"2015-10-30 21:38:10", "2015-10-30 21:38:14", "2015-10-30 21:38:32", 

"2015-10-30 21:39:04"), class = "factor"), id_str = structure(c(1L, 

3L, 2L, 7L, 4L, 5L, 6L, 9L, 8L, 10L), .Label = c("660209050429186048", 

"660209067584016384", "660209072768212992", "660209083505504256", 

"660209086143688704", "660209087628578816", "660209102790914048", 

"660209119152893952", "660209195162206208", "660209325986549760"

), class = "factor"), user.id_str = structure(c(1L, 3L, 8L, 5L, 

5L, 2L, 4L, 6L, 9L, 7L), .Label = c("277335277", "32380087", 

"325105950", "33398863", "68956490", "808114195", "87712431", 

"90280824", "949996219"), class = "factor"), user.followers_count = structure(c(7L, 

2L, 8L, 4L, 4L, 3L, 6L, 9L, 5L, 1L), .Label = c("10212", "1062", 

"1389", "15227", "2214", "2851", "38", "4137", "55"), class = "factor"), 

    ideology = structure(c(2L, 4L, 3L, 9L, 9L, 5L, 8L, 6L, 1L, 

    7L), .Label = c("-0.309303177803536", "-0.393703659798908", 

    "-0.795976086971656", "-0.811321629152632", "-0.946143178314071", 

    "-1.16317298915931", "0.353843466445817", "1.09919837237897", 

    "2.29286233202781"), class = "factor"), text = structure(c(2L, 

    9L, 4L, 1L, 3L, 10L, 5L, 7L, 6L, 8L), .Label = c("Better Dead than Red! Bill Gates says that only socialism can save us ", 

    "Expert briefing on  #disarmament #SDGs @NMUN ", 

    "I see red people Bill Gates says that only socialism can save us from climate change ", 

    "RT: Oddly enough, some Republicans think climate change is real: Oddly enough,…  #UniteBlue ", 

    "Ted Cruz: ‘Climate change is not science, it’s religion’  via @glennbeck", 

    "This is an amusing headline: "Bill Gates says that only socialism can save us from climate change"", 

    "Unusual Weather Kills Gulf of Maine Cod : Discovery News #globalwarming  ", 

    "What do the remaining Republican candidates have to say about climate change? #FixGov", 

    "Who Uses #NASA Earth Science Data? He looks at impact of #aerosols on #climate #weather!", 

    "Why go for ecosystem basses conservation! #ClimateChange #Raajje #Maldives"

    ), class = "factor")), .Names = c("timestamp", "id_str", 

"user.id_str", "user.followers_count", "ideology", "text"), row.names = c(NA, 

10L), class = "data.frame")

twitter rstudio extract keyword text-mining

edited Nov 8 at 11:33

asked Nov 7 at 11:27

Jason B

245

edited Nov 8 at 11:33

asked Nov 7 at 11:27

Jason B

245

edited Nov 8 at 11:33

asked Nov 7 at 11:27

Jason B

245

asked Nov 7 at 11:27

Jason B

245

asked Nov 7 at 11:27

Jason B

245

Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47

My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48

See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48

I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13

It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14

|
show 3 more comments

Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47

My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48

See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48

I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13

It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14

Can you please share a reproducible example? Use dput(head(twitterData,10)) and add the result to the question. Or dput(droplevels(head(twitterData, 10))) if your data frame has a factor with many levels. See How to make a great R reproducible example
– Wiktor Stribiżew
Nov 8 at 7:47

My apologies, but I'm not sure I understand your request. Perhaps you could elaborate on what information you need from me?
– Jason B
Nov 8 at 9:48

See stackoverflow.com/questions/5963269/…. In order to help you a portion of your input data is necessary together with the expected result.
– Wiktor Stribiżew
Nov 8 at 9:48

I couldn't get an understandable output from the dput function you suggested (sorry, I'm quite inexperienced), but I added more info in my question. Please let me know if this makes sense.
– Jason B
Nov 8 at 11:13

It is not usable. Add the output you got fromdput as is. You do not need to understand it.
– Wiktor Stribiżew
Nov 8 at 11:14

|
show 3 more comments

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You may use

new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]

See the R demo online

The point here is to combine the keywords into a pattern like

b(?:climate change|climate|climatechange|global warming|globalwarming)b

It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53188579%2fextract-rows-from-dataframe-that-have-keywords-in-them-twitter-data-in-rstudio%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You may use

new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]

See the R demo online

The point here is to combine the keywords into a pattern like

b(?:climate change|climate|climatechange|global warming|globalwarming)b

It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

add a comment |

up vote
1
down vote

accepted

You may use

new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]

See the R demo online

The point here is to combine the keywords into a pattern like

b(?:climate change|climate|climatechange|global warming|globalwarming)b

It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

add a comment |

up vote
1
down vote

accepted

You may use

new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]

See the R demo online

The point here is to combine the keywords into a pattern like

b(?:climate change|climate|climatechange|global warming|globalwarming)b

It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

You may use

new_df <- NewCData[with(NewCData, grepl(paste0("\b(?:",paste(keywords_C, collapse="|"),")\b"), text)),]

See the R demo online

The point here is to combine the keywords into a pattern like

b(?:climate change|climate|climatechange|global warming|globalwarming)b

It will match the words as whole words and if there is a match in the text column, the row will be returned, else, the row will get discarded.

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

answered Nov 8 at 11:52

Wiktor Stribiżew

301k16122197

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Wsrtjtyk