(Note: This follows an earlier mistaken reply just to Duncan) Multiple "amens!" to Duncan's comments... However: Here is a start at my interpretation of how to do what you want. Note first that your "example" listed 4 fields in the line, but you showed only 3. I modified your example for 3 text fields, only one of which has brackets ([...]) in it I assume. Here is a little example of how to use regex's to replace the commas within the brackets by "-", which would presumably then allow you to easily convert the text into a data frame e.g. using textConnection() and read.csv. Obviously, if this is not what you meant, read no further. ##Example txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101", "Mary, [Stats, Designer, R], t14") wh <- grep("\\[.+\\]",txt) ## which records need to be modified? fixup <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ## bracketed expressions, changing "," to "-" ## Unfortunately, the "replacement" argument in sub() is not vectorized, se we need a loop: for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ## replace original bracketed text with fixed up bracketed text> txt[1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 " [2] "Jan, DBA, R101" [3] "Mary, [HadoopAnalyst-DBA-Developer], t14" Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 06/04/2019 10:03 a.m., Amit Govil wrote: > > Hi, > > > > I have a bunch of csv files to read in R. I'm unable to read them > correctly > > because in some of the files, there is a column ("Role") which has comma > in > > the values. > > > > Sample data: > > > > User, Role, Rule, GAPId > > Sam, [HadoopAnalyst, DBA, Developer], R46443 > > > > I'm trying to play with the below code but it doesnt work: > > Since you didn't give a reproducible example, you should at least say > what "doesn't work" means. > > But here's some general advice: if you want to debug code, don't write > huge expressions like the chain of functions below, put things in > temporary variables and make sure you get what you were expecting at > each stage. > > Instead of > > > > files <- list.files(pattern='.*REDUNDANT(.*).csv$') > > > > tbl <- sapply(files, function(f) { > > gsub('\\[|\\]', '"', readLines(f)) %>% > > read.csv(text = ., check.names = FALSE) > > }) %>% > > bind_rows(.id = "id") %>% > > select(id, User, Rule) %>% > > distinct() > > try > > > files <- list.files(pattern='.*REDUNDANT(.*).csv$') > > tmp1 <- sapply(files, function(f) { > gsub('\\[|\\]', '"', readLines(f)) %>% > read.csv(text = ., check.names = FALSE) > }) > > tmp2 <- tmp1 %>% bind_rows(.id = "id") > > tmp3 <- tmp2 %>% select(id, User, Rule) > > tbl <- tmp3 %>% distinct() > > (You don't need pipes here, but it will make it easier to put the giant > expression back together at the end.) > > Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went > wrong. > > Duncan Murdoch > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
... and here's another perhaps simpler, perhaps more efficient (??) way of doing it using strsplit().Note that it uses the fixed field position, 2, of the bracketed roles. Adjust as needed. A better solution would be a regex that avoids the loops (here, the sapply) altogether, but I don't know how to do this. Maybe someone cleverer will offer such a solution. txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101", "Mary, [Stats, Designer, R], t14") wh <- grep("\\[.+\\]", txt) spl <- strsplit(txt[wh], "\\[|\\]") txt[wh] <- sapply(spl, function(y) paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))> txt[1] "Sam, HadoopAnalyst-DBA-Developer, R46443 " [2] "Jan, DBA, R101" [3] "Mary, Stats-Designer-R, t14" Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 at gmail.com> wrote:> (Note: This follows an earlier mistaken reply just to Duncan) > > Multiple "amens!" to Duncan's comments... > > However: > > Here is a start at my interpretation of how to do what you want. Note > first that your "example" listed 4 fields in the line, but you showed only > 3. I modified your example for 3 text fields, only one of which has > brackets ([...]) in it I assume. Here is a little example of how to use > regex's to replace the commas within the brackets by "-", which would > presumably then allow you to easily convert the text into a data frame e.g. > using textConnection() and read.csv. Obviously, if this is not what you > meant, read no further. > > ##Example > txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101", > "Mary, [Stats, Designer, R], t14") > > wh <- grep("\\[.+\\]",txt) ## which records need to be modified? > fixup <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ## > bracketed expressions, changing "," to "-" > > ## Unfortunately, the "replacement" argument in sub() is not vectorized, > se we need a loop: > > for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ## replace > original bracketed text with fixed up bracketed text > > > txt > [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 " > [2] "Jan, DBA, R101" > [3] "Mary, [HadoopAnalyst-DBA-Developer], t14" > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > >> On 06/04/2019 10:03 a.m., Amit Govil wrote: >> > Hi, >> > >> > I have a bunch of csv files to read in R. I'm unable to read them >> correctly >> > because in some of the files, there is a column ("Role") which has >> comma in >> > the values. >> > >> > Sample data: >> > >> > User, Role, Rule, GAPId >> > Sam, [HadoopAnalyst, DBA, Developer], R46443 >> > >> > I'm trying to play with the below code but it doesnt work: >> >> Since you didn't give a reproducible example, you should at least say >> what "doesn't work" means. >> >> But here's some general advice: if you want to debug code, don't write >> huge expressions like the chain of functions below, put things in >> temporary variables and make sure you get what you were expecting at >> each stage. >> >> Instead of >> > >> > files <- list.files(pattern='.*REDUNDANT(.*).csv$') >> > >> > tbl <- sapply(files, function(f) { >> > gsub('\\[|\\]', '"', readLines(f)) %>% >> > read.csv(text = ., check.names = FALSE) >> > }) %>% >> > bind_rows(.id = "id") %>% >> > select(id, User, Rule) %>% >> > distinct() >> >> try >> >> >> files <- list.files(pattern='.*REDUNDANT(.*).csv$') >> >> tmp1 <- sapply(files, function(f) { >> gsub('\\[|\\]', '"', readLines(f)) %>% >> read.csv(text = ., check.names = FALSE) >> }) >> >> tmp2 <- tmp1 %>% bind_rows(.id = "id") >> >> tmp3 <- tmp2 %>% select(id, User, Rule) >> >> tbl <- tmp3 %>% distinct() >> >> (You don't need pipes here, but it will make it easier to put the giant >> expression back together at the end.) >> >> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went >> wrong. >> >> Duncan Murdoch >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >[[alternative HTML version deleted]]
... and if anyone cares, here's a way to do it using vectorization (no loops) by working only on the subvector containing bracketed text and using the brackets to break up the strings into 3 separate pieces, replacing the commas in the middle piece with dashes, and then reassembling. Quite clumsy, so a better solution is still needed, but here it is: txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101", "Mary, [Stats, Designer, R], t14") wh <- grep("\\[.+\\]",txt) txt1 <- sub("(.+), *\\[.+","\\1",txt[wh]) ## before "[" txt2 <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ## bracketed part txt3 <- sub(".*\\], *(.+?) *$","\\1",txt[wh]) ## after "]" txt[wh]<- paste(txt1, txt2, txt3, sep = ", ")> txt[1] "Sam, [HadoopAnalyst-DBA-Developer], R46443" [2] "Jan, DBA, R101" [3] "Mary, [Stats-Designer-R], t14" Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Apr 7, 2019 at 10:35 AM Bert Gunter <bgunter.4567 at gmail.com> wrote:> ... and here's another perhaps simpler, perhaps more efficient (??) way of > doing it using strsplit().Note that it uses the fixed field position, 2, of > the bracketed roles. Adjust as needed. > > A better solution would be a regex that avoids the loops (here, the > sapply) altogether, but I don't know how to do this. Maybe someone cleverer > will offer such a solution. > > txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101", > "Mary, [Stats, Designer, R], t14") > > wh <- grep("\\[.+\\]", txt) > spl <- strsplit(txt[wh], "\\[|\\]") > txt[wh] <- sapply(spl, function(y) > paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)])) > > > txt > [1] "Sam, HadoopAnalyst-DBA-Developer, R46443 " > [2] "Jan, DBA, R101" > [3] "Mary, Stats-Designer-R, t14" > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 at gmail.com> wrote: > >> (Note: This follows an earlier mistaken reply just to Duncan) >> >> Multiple "amens!" to Duncan's comments... >> >> However: >> >> Here is a start at my interpretation of how to do what you want. Note >> first that your "example" listed 4 fields in the line, but you showed only >> 3. I modified your example for 3 text fields, only one of which has >> brackets ([...]) in it I assume. Here is a little example of how to use >> regex's to replace the commas within the brackets by "-", which would >> presumably then allow you to easily convert the text into a data frame e.g. >> using textConnection() and read.csv. Obviously, if this is not what you >> meant, read no further. >> >> ##Example >> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101", >> "Mary, [Stats, Designer, R], t14") >> >> wh <- grep("\\[.+\\]",txt) ## which records need to be modified? >> fixup <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ## >> bracketed expressions, changing "," to "-" >> >> ## Unfortunately, the "replacement" argument in sub() is not vectorized, >> se we need a loop: >> >> for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ## replace >> original bracketed text with fixed up bracketed text >> >> > txt >> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 " >> [2] "Jan, DBA, R101" >> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14" >> >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at gmail.com> >> wrote: >> >>> On 06/04/2019 10:03 a.m., Amit Govil wrote: >>> > Hi, >>> > >>> > I have a bunch of csv files to read in R. I'm unable to read them >>> correctly >>> > because in some of the files, there is a column ("Role") which has >>> comma in >>> > the values. >>> > >>> > Sample data: >>> > >>> > User, Role, Rule, GAPId >>> > Sam, [HadoopAnalyst, DBA, Developer], R46443 >>> > >>> > I'm trying to play with the below code but it doesnt work: >>> >>> Since you didn't give a reproducible example, you should at least say >>> what "doesn't work" means. >>> >>> But here's some general advice: if you want to debug code, don't write >>> huge expressions like the chain of functions below, put things in >>> temporary variables and make sure you get what you were expecting at >>> each stage. >>> >>> Instead of >>> > >>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$') >>> > >>> > tbl <- sapply(files, function(f) { >>> > gsub('\\[|\\]', '"', readLines(f)) %>% >>> > read.csv(text = ., check.names = FALSE) >>> > }) %>% >>> > bind_rows(.id = "id") %>% >>> > select(id, User, Rule) %>% >>> > distinct() >>> >>> try >>> >>> >>> files <- list.files(pattern='.*REDUNDANT(.*).csv$') >>> >>> tmp1 <- sapply(files, function(f) { >>> gsub('\\[|\\]', '"', readLines(f)) %>% >>> read.csv(text = ., check.names = FALSE) >>> }) >>> >>> tmp2 <- tmp1 %>% bind_rows(.id = "id") >>> >>> tmp3 <- tmp2 %>% select(id, User, Rule) >>> >>> tbl <- tmp3 %>% distinct() >>> >>> (You don't need pipes here, but it will make it easier to put the giant >>> expression back together at the end.) >>> >>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went >>> wrong. >>> >>> Duncan Murdoch >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>[[alternative HTML version deleted]]