(Note: This follows an earlier mistaken reply just to Duncan)
Multiple "amens!" to Duncan's comments...
However:
Here is a start at my interpretation of how to do what you want. Note first
that your "example" listed 4 fields in the line, but you showed only
3. I
modified your example for 3 text fields, only one of which has brackets
([...]) in it I assume. Here is a little example of how to use regex's to
replace the commas within the brackets by "-", which would presumably
then
allow you to easily convert the text into a data frame e.g. using
textConnection() and read.csv. Obviously, if this is not what you meant,
read no further.
##Example
txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan,
DBA, R101",
"Mary, [Stats, Designer, R], t14")
wh <- grep("\\[.+\\]",txt) ## which records need to be modified?
fixup <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
## bracketed
expressions, changing "," to "-"
## Unfortunately, the "replacement" argument in sub() is not
vectorized, se
we need a loop:
for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ##
replace
original bracketed text with fixed up bracketed text
> txt
[1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
[2] "Jan, DBA, R101"
[3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 06/04/2019 10:03 a.m., Amit Govil wrote:
> > Hi,
> >
> > I have a bunch of csv files to read in R. I'm unable to read them
> correctly
> > because in some of the files, there is a column ("Role")
which has comma
> in
> > the values.
> >
> > Sample data:
> >
> > User, Role, Rule, GAPId
> > Sam, [HadoopAnalyst, DBA, Developer], R46443
> >
> > I'm trying to play with the below code but it doesnt work:
>
> Since you didn't give a reproducible example, you should at least say
> what "doesn't work" means.
>
> But here's some general advice: if you want to debug code, don't
write
> huge expressions like the chain of functions below, put things in
> temporary variables and make sure you get what you were expecting at
> each stage.
>
> Instead of
> >
> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
> >
> > tbl <- sapply(files, function(f) {
> > gsub('\\[|\\]', '"', readLines(f)) %>%
> > read.csv(text = ., check.names = FALSE)
> > }) %>%
> > bind_rows(.id = "id") %>%
> > select(id, User, Rule) %>%
> > distinct()
>
> try
>
>
> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>
> tmp1 <- sapply(files, function(f) {
> gsub('\\[|\\]', '"', readLines(f)) %>%
> read.csv(text = ., check.names = FALSE)
> })
>
> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>
> tmp3 <- tmp2 %>% select(id, User, Rule)
>
> tbl <- tmp3 %>% distinct()
>
> (You don't need pipes here, but it will make it easier to put the giant
> expression back together at the end.)
>
> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went
> wrong.
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
... and here's another perhaps simpler, perhaps more efficient (??) way of
doing it using strsplit().Note that it uses the fixed field position, 2, of
the bracketed roles. Adjust as needed.
A better solution would be a regex that avoids the loops (here, the sapply)
altogether, but I don't know how to do this. Maybe someone cleverer will
offer such a solution.
txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan,
DBA, R101",
"Mary, [Stats, Designer, R], t14")
wh <- grep("\\[.+\\]", txt)
spl <- strsplit(txt[wh], "\\[|\\]")
txt[wh] <- sapply(spl, function(y)
paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))
> txt
[1] "Sam, HadoopAnalyst-DBA-Developer, R46443 "
[2] "Jan, DBA, R101"
[3] "Mary, Stats-Designer-R, t14"
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
> (Note: This follows an earlier mistaken reply just to Duncan)
>
> Multiple "amens!" to Duncan's comments...
>
> However:
>
> Here is a start at my interpretation of how to do what you want. Note
> first that your "example" listed 4 fields in the line, but you
showed only
> 3. I modified your example for 3 text fields, only one of which has
> brackets ([...]) in it I assume. Here is a little example of how to use
> regex's to replace the commas within the brackets by "-",
which would
> presumably then allow you to easily convert the text into a data frame e.g.
> using textConnection() and read.csv. Obviously, if this is not what you
> meant, read no further.
>
> ##Example
> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443
","Jan, DBA, R101",
> "Mary, [Stats, Designer, R], t14")
>
> wh <- grep("\\[.+\\]",txt) ## which records need to be
modified?
> fixup <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
##
> bracketed expressions, changing "," to "-"
>
> ## Unfortunately, the "replacement" argument in sub() is not
vectorized,
> se we need a loop:
>
> for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]])
## replace
> original bracketed text with fixed up bracketed text
>
> > txt
> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
> [2] "Jan, DBA, R101"
> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
> wrote:
>
>> On 06/04/2019 10:03 a.m., Amit Govil wrote:
>> > Hi,
>> >
>> > I have a bunch of csv files to read in R. I'm unable to read
them
>> correctly
>> > because in some of the files, there is a column ("Role")
which has
>> comma in
>> > the values.
>> >
>> > Sample data:
>> >
>> > User, Role, Rule, GAPId
>> > Sam, [HadoopAnalyst, DBA, Developer], R46443
>> >
>> > I'm trying to play with the below code but it doesnt work:
>>
>> Since you didn't give a reproducible example, you should at least
say
>> what "doesn't work" means.
>>
>> But here's some general advice: if you want to debug code,
don't write
>> huge expressions like the chain of functions below, put things in
>> temporary variables and make sure you get what you were expecting at
>> each stage.
>>
>> Instead of
>> >
>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>> >
>> > tbl <- sapply(files, function(f) {
>> > gsub('\\[|\\]', '"', readLines(f)) %>%
>> > read.csv(text = ., check.names = FALSE)
>> > }) %>%
>> > bind_rows(.id = "id") %>%
>> > select(id, User, Rule) %>%
>> > distinct()
>>
>> try
>>
>>
>> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>
>> tmp1 <- sapply(files, function(f) {
>> gsub('\\[|\\]', '"', readLines(f)) %>%
>> read.csv(text = ., check.names = FALSE)
>> })
>>
>> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>>
>> tmp3 <- tmp2 %>% select(id, User, Rule)
>>
>> tbl <- tmp3 %>% distinct()
>>
>> (You don't need pipes here, but it will make it easier to put the
giant
>> expression back together at the end.)
>>
>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went
>> wrong.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
[[alternative HTML version deleted]]
... and if anyone cares, here's a way to do it using vectorization (no
loops) by working only on the subvector containing bracketed text and
using the brackets to break up the strings into 3 separate pieces,
replacing the commas in the middle piece with dashes, and then
reassembling. Quite clumsy, so a better solution is still needed, but here
it is:
txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan,
DBA, R101",
"Mary, [Stats, Designer, R], t14")
wh <- grep("\\[.+\\]",txt)
txt1 <- sub("(.+), *\\[.+","\\1",txt[wh]) ## before
"["
txt2 <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
## bracketed
part
txt3 <- sub(".*\\], *(.+?) *$","\\1",txt[wh]) ## after
"]"
txt[wh]<- paste(txt1, txt2, txt3, sep = ", ")
> txt
[1] "Sam, [HadoopAnalyst-DBA-Developer], R46443"
[2] "Jan, DBA, R101"
[3] "Mary, [Stats-Designer-R], t14"
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sun, Apr 7, 2019 at 10:35 AM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
> ... and here's another perhaps simpler, perhaps more efficient (??) way
of
> doing it using strsplit().Note that it uses the fixed field position, 2, of
> the bracketed roles. Adjust as needed.
>
> A better solution would be a regex that avoids the loops (here, the
> sapply) altogether, but I don't know how to do this. Maybe someone
cleverer
> will offer such a solution.
>
> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443
","Jan, DBA, R101",
> "Mary, [Stats, Designer, R], t14")
>
> wh <- grep("\\[.+\\]", txt)
> spl <- strsplit(txt[wh], "\\[|\\]")
> txt[wh] <- sapply(spl, function(y)
> paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))
>
> > txt
> [1] "Sam, HadoopAnalyst-DBA-Developer, R46443 "
> [2] "Jan, DBA, R101"
> [3] "Mary, Stats-Designer-R, t14"
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>
>> (Note: This follows an earlier mistaken reply just to Duncan)
>>
>> Multiple "amens!" to Duncan's comments...
>>
>> However:
>>
>> Here is a start at my interpretation of how to do what you want. Note
>> first that your "example" listed 4 fields in the line, but
you showed only
>> 3. I modified your example for 3 text fields, only one of which has
>> brackets ([...]) in it I assume. Here is a little example of how to use
>> regex's to replace the commas within the brackets by "-",
which would
>> presumably then allow you to easily convert the text into a data frame
e.g.
>> using textConnection() and read.csv. Obviously, if this is not what you
>> meant, read no further.
>>
>> ##Example
>> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443
","Jan, DBA, R101",
>> "Mary, [Stats, Designer, R], t14")
>>
>> wh <- grep("\\[.+\\]",txt) ## which records need to be
modified?
>> fixup <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
##
>> bracketed expressions, changing "," to "-"
>>
>> ## Unfortunately, the "replacement" argument in sub() is not
vectorized,
>> se we need a loop:
>>
>> for(i in wh) txt[wh[i]] <-
sub("\\[.+\\]",fixup[i],txt[wh[i]]) ## replace
>> original bracketed text with fixed up bracketed text
>>
>> > txt
>> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
>> [2] "Jan, DBA, R101"
>> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
>> wrote:
>>
>>> On 06/04/2019 10:03 a.m., Amit Govil wrote:
>>> > Hi,
>>> >
>>> > I have a bunch of csv files to read in R. I'm unable to
read them
>>> correctly
>>> > because in some of the files, there is a column
("Role") which has
>>> comma in
>>> > the values.
>>> >
>>> > Sample data:
>>> >
>>> > User, Role, Rule, GAPId
>>> > Sam, [HadoopAnalyst, DBA, Developer], R46443
>>> >
>>> > I'm trying to play with the below code but it doesnt work:
>>>
>>> Since you didn't give a reproducible example, you should at
least say
>>> what "doesn't work" means.
>>>
>>> But here's some general advice: if you want to debug code,
don't write
>>> huge expressions like the chain of functions below, put things in
>>> temporary variables and make sure you get what you were expecting
at
>>> each stage.
>>>
>>> Instead of
>>> >
>>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>> >
>>> > tbl <- sapply(files, function(f) {
>>> > gsub('\\[|\\]', '"', readLines(f))
%>%
>>> > read.csv(text = ., check.names = FALSE)
>>> > }) %>%
>>> > bind_rows(.id = "id") %>%
>>> > select(id, User, Rule) %>%
>>> > distinct()
>>>
>>> try
>>>
>>>
>>> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>>
>>> tmp1 <- sapply(files, function(f) {
>>> gsub('\\[|\\]', '"', readLines(f)) %>%
>>> read.csv(text = ., check.names = FALSE)
>>> })
>>>
>>> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>>>
>>> tmp3 <- tmp2 %>% select(id, User, Rule)
>>>
>>> tbl <- tmp3 %>% distinct()
>>>
>>> (You don't need pipes here, but it will make it easier to put
the giant
>>> expression back together at the end.)
>>>
>>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things
went
>>> wrong.
>>>
>>> Duncan Murdoch
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
[[alternative HTML version deleted]]