Parsing a string into a nested data.table












3














I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:



item A...1
item B...2
item C...3
item D...2
1=foo
2=bar
3=baz


My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:



foo | bar | baz
----+-----+----
1 | 2 | 1


I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:



code | reason
-----+-------
1 | foo
2 | bar
3 | baz


Here's how I tried to do it, using data.table and stringr.



encounter_alerts[, whys := lapply(
str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
)]


I am very confused by the error message I get when I try to do this:



Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code",  :
file not found: 1=foo


I am explicitly using text rather than file so I'm not sure how it's trying to interpret the line of text as a filename!



When I test this with a single row, it seems to work fine:



> fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
code reason
1: 1 foo
2: 2 bar


What am I doing wrong? Is there a better way to do this?



Thanks!










share|improve this question





























    3














    I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:



    item A...1
    item B...2
    item C...3
    item D...2
    1=foo
    2=bar
    3=baz


    My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:



    foo | bar | baz
    ----+-----+----
    1 | 2 | 1


    I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:



    code | reason
    -----+-------
    1 | foo
    2 | bar
    3 | baz


    Here's how I tried to do it, using data.table and stringr.



    encounter_alerts[, whys := lapply(
    str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
    FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
    )]


    I am very confused by the error message I get when I try to do this:



    Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code",  :
    file not found: 1=foo


    I am explicitly using text rather than file so I'm not sure how it's trying to interpret the line of text as a filename!



    When I test this with a single row, it seems to work fine:



    > fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
    code reason
    1: 1 foo
    2: 2 bar


    What am I doing wrong? Is there a better way to do this?



    Thanks!










    share|improve this question



























      3












      3








      3







      I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:



      item A...1
      item B...2
      item C...3
      item D...2
      1=foo
      2=bar
      3=baz


      My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:



      foo | bar | baz
      ----+-----+----
      1 | 2 | 1


      I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:



      code | reason
      -----+-------
      1 | foo
      2 | bar
      3 | baz


      Here's how I tried to do it, using data.table and stringr.



      encounter_alerts[, whys := lapply(
      str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
      FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
      )]


      I am very confused by the error message I get when I try to do this:



      Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code",  :
      file not found: 1=foo


      I am explicitly using text rather than file so I'm not sure how it's trying to interpret the line of text as a filename!



      When I test this with a single row, it seems to work fine:



      > fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
      code reason
      1: 1 foo
      2: 2 bar


      What am I doing wrong? Is there a better way to do this?



      Thanks!










      share|improve this question















      I have data in a table in which one cell in every row is a multiline string, which is formatted a a bit like a document with references at the end of it. For example, one of those strings looks like:



      item A...1
      item B...2
      item C...3
      item D...2
      1=foo
      2=bar
      3=baz


      My eventual goal is to extract foo/bar/baz into columns and count the matching items. So for the above, I'd end up with a row including:



      foo | bar | baz
      ----+-----+----
      1 | 2 | 1


      I tried to start by extracting the "reference" mappings, as a nested data.table looking like this:



      code | reason
      -----+-------
      1 | foo
      2 | bar
      3 | baz


      Here's how I tried to do it, using data.table and stringr.



      encounter_alerts[, whys := lapply(
      str_extract_all(text, regex('^[0-9].*$', multiline = TRUE)),
      FUN = function (s) { fread(text = s, sep = '=', header = FALSE, col.names = c('code', 'reason')) }
      )]


      I am very confused by the error message I get when I try to do this:



      Error in fread(text = s, sep = "=", header = FALSE, col.names = c("code",  :
      file not found: 1=foo


      I am explicitly using text rather than file so I'm not sure how it's trying to interpret the line of text as a filename!



      When I test this with a single row, it seems to work fine:



      > fread(text = str_extract_all(encounter_alerts[989]$text, regex('^[0-9].*$', multiline = TRUE))[[1]], sep = '=', header = FALSE, col.names = c('code', 'reason'))
      code reason
      1: 1 foo
      2: 2 bar


      What am I doing wrong? Is there a better way to do this?



      Thanks!







      r data.table






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 1:53







      Nicholas Riley

















      asked Nov 13 '18 at 3:55









      Nicholas RileyNicholas Riley

      37.1k586114




      37.1k586114
























          4 Answers
          4






          active

          oldest

          votes


















          2














          Note: Edited after reading comments



          From your comment, I tried to reproduce what I understand your data might look like.



          library(tidyverse)

          df <- tibble(
          strings = c("item A...1
          item B...2
          item C...3
          item D...2
          1=foo
          2=bar
          3=baz",
          "item A...2
          item B...2
          item C...3
          item D...1
          1=toto
          2=foo
          3=lala",
          "item A...3
          item B...3
          item C...3
          item D...1
          1=tutu
          3=ttt")
          )


          Code:



          get_ref <- function(string) {
          string %>%
          str_split("n") %>%
          unlist() %>%
          str_subset("=") %>%
          str_split_fixed("=", 2) %>%
          as_tibble() %>%
          rename(code = V1, reason = V2)
          }

          list1 <- map(df$strings, get_ref)

          get_value <- function(string) {
          string %>%
          str_split("n") %>%
          unlist() %>%
          str_subset("\.\.\.") %>%
          str_replace_all(".*\.\.\.", "") %>%
          as_tibble() %>%
          rename(code = value)
          }

          list2 <- map(df$strings, get_value)

          get_result <- function(df1, df2) {
          left_join(df1, df2) %>%
          count(reason) %>%
          spread(reason, n)
          }

          result <- map2_df(list1, list2, get_result)

          result[is.na(result)] <- 0

          result


          Result



          # A tibble: 3 x 7
          bar baz foo lala toto ttt tutu
          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          1 2 1 1 0 0 0 0
          2 0 0 2 1 1 0 0
          3 0 0 0 0 0 3 1





          share|improve this answer























          • I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
            – Nicholas Riley
            Nov 13 '18 at 14:32










          • So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
            – prosoitos
            Nov 13 '18 at 16:05










          • Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
            – prosoitos
            Nov 13 '18 at 16:06












          • If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
            – prosoitos
            Nov 13 '18 at 16:09












          • I edited my answer to match what I now understand your data might look like
            – prosoitos
            Nov 13 '18 at 17:01



















          0














          using stringr and dplyr you can do it easily



          library(stringr)
          library(dplyr)
          v <- as.data.frame(c( "item A...1",
          "item B...2",
          "item C...3",
          "item D...2"))
          colnames(v)<- "items"

          matching <- c( "1",
          "2",
          "3")
          Mapping <- read.table(text="code reason
          1 foo
          2 bar
          3 baz
          ", header = T)

          ## Answer
          df1<- v %>%
          mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
          str(df1)
          str(Mapping)
          df1$code <- as.numeric(df1$code )

          df1 <- left_join(df1,Mapping)


          please have a look






          share|improve this answer





























            0














            There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).



            sample_str <- 'item A...1
            item B...2
            item C...3
            item D...2
            1=foo
            2=bar
            3=baz'

            lines <- stringr::str_split(sample_str, 'n', simplify = T)

            extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]

            dfs_list <- lapply(extracted_strs, function(x) {
            str_parts <- stringr::str_split(x, '=', simplify = T)
            df_args = list()
            df_args[[str_parts[2]]] = as.integer(str_parts[1])
            df_args[['stringsAsFactors']] = F

            do.call(data.frame, df_args)
            })


            df <- do.call(cbind, dfs)





            share|improve this answer





















            • Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
              – Nicholas Riley
              Nov 13 '18 at 14:30



















            0














            Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...



            get_code_reason_mapping <- function(alert_text) {
            alert_text %>%
            str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
            unlist() %>%
            str_split_fixed("=", 2) %>%
            as.data.table() %>%
            setnames(c('code', 'reason'))
            }

            encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

            get_why_codes <- function(alert_text) {
            alert_text %>%
            str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
            unlist() %>%
            str_sub(-1) %>%
            as.data.table() %>%
            setnames(c('code'))
            }

            encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

            get_code_counts <- function(df1, df2) {
            left_join(df1, df2) %>%
            count(reason) %>%
            spread(reason, n)
            }

            code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

            code_counts[is.na(code_counts)] <- 0

            code_counts





            share|improve this answer





















              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53273549%2fparsing-a-string-into-a-nested-data-table%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              4 Answers
              4






              active

              oldest

              votes








              4 Answers
              4






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              Note: Edited after reading comments



              From your comment, I tried to reproduce what I understand your data might look like.



              library(tidyverse)

              df <- tibble(
              strings = c("item A...1
              item B...2
              item C...3
              item D...2
              1=foo
              2=bar
              3=baz",
              "item A...2
              item B...2
              item C...3
              item D...1
              1=toto
              2=foo
              3=lala",
              "item A...3
              item B...3
              item C...3
              item D...1
              1=tutu
              3=ttt")
              )


              Code:



              get_ref <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("=") %>%
              str_split_fixed("=", 2) %>%
              as_tibble() %>%
              rename(code = V1, reason = V2)
              }

              list1 <- map(df$strings, get_ref)

              get_value <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("\.\.\.") %>%
              str_replace_all(".*\.\.\.", "") %>%
              as_tibble() %>%
              rename(code = value)
              }

              list2 <- map(df$strings, get_value)

              get_result <- function(df1, df2) {
              left_join(df1, df2) %>%
              count(reason) %>%
              spread(reason, n)
              }

              result <- map2_df(list1, list2, get_result)

              result[is.na(result)] <- 0

              result


              Result



              # A tibble: 3 x 7
              bar baz foo lala toto ttt tutu
              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
              1 2 1 1 0 0 0 0
              2 0 0 2 1 1 0 0
              3 0 0 0 0 0 3 1





              share|improve this answer























              • I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
                – Nicholas Riley
                Nov 13 '18 at 14:32










              • So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
                – prosoitos
                Nov 13 '18 at 16:05










              • Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
                – prosoitos
                Nov 13 '18 at 16:06












              • If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
                – prosoitos
                Nov 13 '18 at 16:09












              • I edited my answer to match what I now understand your data might look like
                – prosoitos
                Nov 13 '18 at 17:01
















              2














              Note: Edited after reading comments



              From your comment, I tried to reproduce what I understand your data might look like.



              library(tidyverse)

              df <- tibble(
              strings = c("item A...1
              item B...2
              item C...3
              item D...2
              1=foo
              2=bar
              3=baz",
              "item A...2
              item B...2
              item C...3
              item D...1
              1=toto
              2=foo
              3=lala",
              "item A...3
              item B...3
              item C...3
              item D...1
              1=tutu
              3=ttt")
              )


              Code:



              get_ref <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("=") %>%
              str_split_fixed("=", 2) %>%
              as_tibble() %>%
              rename(code = V1, reason = V2)
              }

              list1 <- map(df$strings, get_ref)

              get_value <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("\.\.\.") %>%
              str_replace_all(".*\.\.\.", "") %>%
              as_tibble() %>%
              rename(code = value)
              }

              list2 <- map(df$strings, get_value)

              get_result <- function(df1, df2) {
              left_join(df1, df2) %>%
              count(reason) %>%
              spread(reason, n)
              }

              result <- map2_df(list1, list2, get_result)

              result[is.na(result)] <- 0

              result


              Result



              # A tibble: 3 x 7
              bar baz foo lala toto ttt tutu
              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
              1 2 1 1 0 0 0 0
              2 0 0 2 1 1 0 0
              3 0 0 0 0 0 3 1





              share|improve this answer























              • I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
                – Nicholas Riley
                Nov 13 '18 at 14:32










              • So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
                – prosoitos
                Nov 13 '18 at 16:05










              • Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
                – prosoitos
                Nov 13 '18 at 16:06












              • If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
                – prosoitos
                Nov 13 '18 at 16:09












              • I edited my answer to match what I now understand your data might look like
                – prosoitos
                Nov 13 '18 at 17:01














              2












              2








              2






              Note: Edited after reading comments



              From your comment, I tried to reproduce what I understand your data might look like.



              library(tidyverse)

              df <- tibble(
              strings = c("item A...1
              item B...2
              item C...3
              item D...2
              1=foo
              2=bar
              3=baz",
              "item A...2
              item B...2
              item C...3
              item D...1
              1=toto
              2=foo
              3=lala",
              "item A...3
              item B...3
              item C...3
              item D...1
              1=tutu
              3=ttt")
              )


              Code:



              get_ref <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("=") %>%
              str_split_fixed("=", 2) %>%
              as_tibble() %>%
              rename(code = V1, reason = V2)
              }

              list1 <- map(df$strings, get_ref)

              get_value <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("\.\.\.") %>%
              str_replace_all(".*\.\.\.", "") %>%
              as_tibble() %>%
              rename(code = value)
              }

              list2 <- map(df$strings, get_value)

              get_result <- function(df1, df2) {
              left_join(df1, df2) %>%
              count(reason) %>%
              spread(reason, n)
              }

              result <- map2_df(list1, list2, get_result)

              result[is.na(result)] <- 0

              result


              Result



              # A tibble: 3 x 7
              bar baz foo lala toto ttt tutu
              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
              1 2 1 1 0 0 0 0
              2 0 0 2 1 1 0 0
              3 0 0 0 0 0 3 1





              share|improve this answer














              Note: Edited after reading comments



              From your comment, I tried to reproduce what I understand your data might look like.



              library(tidyverse)

              df <- tibble(
              strings = c("item A...1
              item B...2
              item C...3
              item D...2
              1=foo
              2=bar
              3=baz",
              "item A...2
              item B...2
              item C...3
              item D...1
              1=toto
              2=foo
              3=lala",
              "item A...3
              item B...3
              item C...3
              item D...1
              1=tutu
              3=ttt")
              )


              Code:



              get_ref <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("=") %>%
              str_split_fixed("=", 2) %>%
              as_tibble() %>%
              rename(code = V1, reason = V2)
              }

              list1 <- map(df$strings, get_ref)

              get_value <- function(string) {
              string %>%
              str_split("n") %>%
              unlist() %>%
              str_subset("\.\.\.") %>%
              str_replace_all(".*\.\.\.", "") %>%
              as_tibble() %>%
              rename(code = value)
              }

              list2 <- map(df$strings, get_value)

              get_result <- function(df1, df2) {
              left_join(df1, df2) %>%
              count(reason) %>%
              spread(reason, n)
              }

              result <- map2_df(list1, list2, get_result)

              result[is.na(result)] <- 0

              result


              Result



              # A tibble: 3 x 7
              bar baz foo lala toto ttt tutu
              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
              1 2 1 1 0 0 0 0
              2 0 0 2 1 1 0 0
              3 0 0 0 0 0 3 1






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 13 '18 at 17:00

























              answered Nov 13 '18 at 7:16









              prosoitosprosoitos

              935219




              935219












              • I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
                – Nicholas Riley
                Nov 13 '18 at 14:32










              • So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
                – prosoitos
                Nov 13 '18 at 16:05










              • Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
                – prosoitos
                Nov 13 '18 at 16:06












              • If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
                – prosoitos
                Nov 13 '18 at 16:09












              • I edited my answer to match what I now understand your data might look like
                – prosoitos
                Nov 13 '18 at 17:01


















              • I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
                – Nicholas Riley
                Nov 13 '18 at 14:32










              • So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
                – prosoitos
                Nov 13 '18 at 16:05










              • Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
                – prosoitos
                Nov 13 '18 at 16:06












              • If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
                – prosoitos
                Nov 13 '18 at 16:09












              • I edited my answer to match what I now understand your data might look like
                – prosoitos
                Nov 13 '18 at 17:01
















              I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
              – Nicholas Riley
              Nov 13 '18 at 14:32




              I'm sorry my question was confusing! You are correct in 'note 2'; there is such a multi-line string in every row of my data. I think this approach will work for me but need to correct for some assumptions you (understandably!) made that aren't the case in my real data…
              – Nicholas Riley
              Nov 13 '18 at 14:32












              So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
              – prosoitos
              Nov 13 '18 at 16:05




              So, is your data in a data frame since you talk about "row"? Sorry I am still a little confused about what your data looks like
              – prosoitos
              Nov 13 '18 at 16:05












              Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
              – prosoitos
              Nov 13 '18 at 16:06






              Would you mind giving a little more information about its structure so that I could help you adapt my code to match your data structure?
              – prosoitos
              Nov 13 '18 at 16:06














              If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
              – prosoitos
              Nov 13 '18 at 16:09






              If you have a data frame with one variable consisting of a vector of strings similar to the one you pasted in your question, you could wrap my code in a function and pass it to pmap_df() to apply it to every row and output a data frame with one result per row. I'll be happy to write this up if you give me enough info on your data frame.
              – prosoitos
              Nov 13 '18 at 16:09














              I edited my answer to match what I now understand your data might look like
              – prosoitos
              Nov 13 '18 at 17:01




              I edited my answer to match what I now understand your data might look like
              – prosoitos
              Nov 13 '18 at 17:01













              0














              using stringr and dplyr you can do it easily



              library(stringr)
              library(dplyr)
              v <- as.data.frame(c( "item A...1",
              "item B...2",
              "item C...3",
              "item D...2"))
              colnames(v)<- "items"

              matching <- c( "1",
              "2",
              "3")
              Mapping <- read.table(text="code reason
              1 foo
              2 bar
              3 baz
              ", header = T)

              ## Answer
              df1<- v %>%
              mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
              str(df1)
              str(Mapping)
              df1$code <- as.numeric(df1$code )

              df1 <- left_join(df1,Mapping)


              please have a look






              share|improve this answer


























                0














                using stringr and dplyr you can do it easily



                library(stringr)
                library(dplyr)
                v <- as.data.frame(c( "item A...1",
                "item B...2",
                "item C...3",
                "item D...2"))
                colnames(v)<- "items"

                matching <- c( "1",
                "2",
                "3")
                Mapping <- read.table(text="code reason
                1 foo
                2 bar
                3 baz
                ", header = T)

                ## Answer
                df1<- v %>%
                mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
                str(df1)
                str(Mapping)
                df1$code <- as.numeric(df1$code )

                df1 <- left_join(df1,Mapping)


                please have a look






                share|improve this answer
























                  0












                  0








                  0






                  using stringr and dplyr you can do it easily



                  library(stringr)
                  library(dplyr)
                  v <- as.data.frame(c( "item A...1",
                  "item B...2",
                  "item C...3",
                  "item D...2"))
                  colnames(v)<- "items"

                  matching <- c( "1",
                  "2",
                  "3")
                  Mapping <- read.table(text="code reason
                  1 foo
                  2 bar
                  3 baz
                  ", header = T)

                  ## Answer
                  df1<- v %>%
                  mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
                  str(df1)
                  str(Mapping)
                  df1$code <- as.numeric(df1$code )

                  df1 <- left_join(df1,Mapping)


                  please have a look






                  share|improve this answer












                  using stringr and dplyr you can do it easily



                  library(stringr)
                  library(dplyr)
                  v <- as.data.frame(c( "item A...1",
                  "item B...2",
                  "item C...3",
                  "item D...2"))
                  colnames(v)<- "items"

                  matching <- c( "1",
                  "2",
                  "3")
                  Mapping <- read.table(text="code reason
                  1 foo
                  2 bar
                  3 baz
                  ", header = T)

                  ## Answer
                  df1<- v %>%
                  mutate(code = str_extract(v$items, str_c(matching, collapse = "|")))
                  str(df1)
                  str(Mapping)
                  df1$code <- as.numeric(df1$code )

                  df1 <- left_join(df1,Mapping)


                  please have a look







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 13 '18 at 4:05









                  HunaidkhanHunaidkhan

                  804113




                  804113























                      0














                      There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).



                      sample_str <- 'item A...1
                      item B...2
                      item C...3
                      item D...2
                      1=foo
                      2=bar
                      3=baz'

                      lines <- stringr::str_split(sample_str, 'n', simplify = T)

                      extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]

                      dfs_list <- lapply(extracted_strs, function(x) {
                      str_parts <- stringr::str_split(x, '=', simplify = T)
                      df_args = list()
                      df_args[[str_parts[2]]] = as.integer(str_parts[1])
                      df_args[['stringsAsFactors']] = F

                      do.call(data.frame, df_args)
                      })


                      df <- do.call(cbind, dfs)





                      share|improve this answer





















                      • Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
                        – Nicholas Riley
                        Nov 13 '18 at 14:30
















                      0














                      There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).



                      sample_str <- 'item A...1
                      item B...2
                      item C...3
                      item D...2
                      1=foo
                      2=bar
                      3=baz'

                      lines <- stringr::str_split(sample_str, 'n', simplify = T)

                      extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]

                      dfs_list <- lapply(extracted_strs, function(x) {
                      str_parts <- stringr::str_split(x, '=', simplify = T)
                      df_args = list()
                      df_args[[str_parts[2]]] = as.integer(str_parts[1])
                      df_args[['stringsAsFactors']] = F

                      do.call(data.frame, df_args)
                      })


                      df <- do.call(cbind, dfs)





                      share|improve this answer





















                      • Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
                        – Nicholas Riley
                        Nov 13 '18 at 14:30














                      0












                      0








                      0






                      There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).



                      sample_str <- 'item A...1
                      item B...2
                      item C...3
                      item D...2
                      1=foo
                      2=bar
                      3=baz'

                      lines <- stringr::str_split(sample_str, 'n', simplify = T)

                      extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]

                      dfs_list <- lapply(extracted_strs, function(x) {
                      str_parts <- stringr::str_split(x, '=', simplify = T)
                      df_args = list()
                      df_args[[str_parts[2]]] = as.integer(str_parts[1])
                      df_args[['stringsAsFactors']] = F

                      do.call(data.frame, df_args)
                      })


                      df <- do.call(cbind, dfs)





                      share|improve this answer












                      There's probably a nicer way to do this, but here's a solution that doesn't require any additional libraries (beyond stringr, which you're already using).



                      sample_str <- 'item A...1
                      item B...2
                      item C...3
                      item D...2
                      1=foo
                      2=bar
                      3=baz'

                      lines <- stringr::str_split(sample_str, 'n', simplify = T)

                      extracted_strs <- lines[stringr::str_detect(lines, '^\d=\w+$')]

                      dfs_list <- lapply(extracted_strs, function(x) {
                      str_parts <- stringr::str_split(x, '=', simplify = T)
                      df_args = list()
                      df_args[[str_parts[2]]] = as.integer(str_parts[1])
                      df_args[['stringsAsFactors']] = F

                      do.call(data.frame, df_args)
                      })


                      df <- do.call(cbind, dfs)






                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Nov 13 '18 at 5:33









                      Eric BurdenEric Burden

                      512




                      512












                      • Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
                        – Nicholas Riley
                        Nov 13 '18 at 14:30


















                      • Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
                        – Nicholas Riley
                        Nov 13 '18 at 14:30
















                      Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
                      – Nicholas Riley
                      Nov 13 '18 at 14:30




                      Thanks. This creates a data.frame with a variable for foo, bar and bar. Would this be easier to use later on than what I created as my example?
                      – Nicholas Riley
                      Nov 13 '18 at 14:30











                      0














                      Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...



                      get_code_reason_mapping <- function(alert_text) {
                      alert_text %>%
                      str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
                      unlist() %>%
                      str_split_fixed("=", 2) %>%
                      as.data.table() %>%
                      setnames(c('code', 'reason'))
                      }

                      encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

                      get_why_codes <- function(alert_text) {
                      alert_text %>%
                      str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
                      unlist() %>%
                      str_sub(-1) %>%
                      as.data.table() %>%
                      setnames(c('code'))
                      }

                      encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

                      get_code_counts <- function(df1, df2) {
                      left_join(df1, df2) %>%
                      count(reason) %>%
                      spread(reason, n)
                      }

                      code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

                      code_counts[is.na(code_counts)] <- 0

                      code_counts





                      share|improve this answer


























                        0














                        Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...



                        get_code_reason_mapping <- function(alert_text) {
                        alert_text %>%
                        str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
                        unlist() %>%
                        str_split_fixed("=", 2) %>%
                        as.data.table() %>%
                        setnames(c('code', 'reason'))
                        }

                        encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

                        get_why_codes <- function(alert_text) {
                        alert_text %>%
                        str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
                        unlist() %>%
                        str_sub(-1) %>%
                        as.data.table() %>%
                        setnames(c('code'))
                        }

                        encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

                        get_code_counts <- function(df1, df2) {
                        left_join(df1, df2) %>%
                        count(reason) %>%
                        spread(reason, n)
                        }

                        code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

                        code_counts[is.na(code_counts)] <- 0

                        code_counts





                        share|improve this answer
























                          0












                          0








                          0






                          Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...



                          get_code_reason_mapping <- function(alert_text) {
                          alert_text %>%
                          str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
                          unlist() %>%
                          str_split_fixed("=", 2) %>%
                          as.data.table() %>%
                          setnames(c('code', 'reason'))
                          }

                          encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

                          get_why_codes <- function(alert_text) {
                          alert_text %>%
                          str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
                          unlist() %>%
                          str_sub(-1) %>%
                          as.data.table() %>%
                          setnames(c('code'))
                          }

                          encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

                          get_code_counts <- function(df1, df2) {
                          left_join(df1, df2) %>%
                          count(reason) %>%
                          spread(reason, n)
                          }

                          code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

                          code_counts[is.na(code_counts)] <- 0

                          code_counts





                          share|improve this answer












                          Thanks so much to @prosoitos for helping with this. Here's the final code I ended up using, highly based on the accepted answer — it's a mix of different packages and so forth which I hope to clean up eventually, but deadlines happen...



                          get_code_reason_mapping <- function(alert_text) {
                          alert_text %>%
                          str_extract_all(regex('^[0-9]=(.*)$', multiline = T)) %>%
                          unlist() %>%
                          str_split_fixed("=", 2) %>%
                          as.data.table() %>%
                          setnames(c('code', 'reason'))
                          }

                          encounter_alerts$code_reason_mapping <- map(encounter_alerts$alert_text, get_code_reason_mapping)

                          get_why_codes <- function(alert_text) {
                          alert_text %>%
                          str_extract_all(regex('[/n][0-9e][0-9>][0-9]$', multiline = TRUE)) %>%
                          unlist() %>%
                          str_sub(-1) %>%
                          as.data.table() %>%
                          setnames(c('code'))
                          }

                          encounter_alerts$why_codes <- map(encounter_alerts$alert_text, get_why_codes)

                          get_code_counts <- function(df1, df2) {
                          left_join(df1, df2) %>%
                          count(reason) %>%
                          spread(reason, n)
                          }

                          code_counts <- map2_df(encounter_alerts$code_reason_mapping, encounter_alerts$why_codes, get_code_counts)

                          code_counts[is.na(code_counts)] <- 0

                          code_counts






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 15 '18 at 2:10









                          Nicholas RileyNicholas Riley

                          37.1k586114




                          37.1k586114






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53273549%2fparsing-a-string-into-a-nested-data-table%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              Hercules Kyvelos

                              Tangent Lines Diagram Along Smooth Curve

                              Yusuf al-Mu'taman ibn Hud