thr3ads.net - R help - [R] Unable to read csv files with comma in values [Apr 2019]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2019-Apr-07 16:55 UTC

[R] Unable to read csv files with comma in values

(Note: This follows an earlier mistaken reply just to Duncan)

Multiple "amens!" to Duncan's comments...

However:

Here is a start at my interpretation of how to do what you want. Note first
that your "example" listed 4 fields in the line, but you showed only
3. I
modified your example for 3 text fields, only one of which has brackets
([...]) in it I assume. Here is a little example of how to use regex's to
replace the commas within the brackets by "-", which would presumably
then
allow you to easily convert the text into a data frame e.g. using
textConnection() and read.csv. Obviously, if this is not what you meant,
read no further.

##Example
txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan,
DBA, R101",
        "Mary, [Stats, Designer, R], t14")

wh <- grep("\\[.+\\]",txt)  ## which records need to be modified?
fixup <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
## bracketed
expressions, changing "," to "-"

## Unfortunately, the "replacement" argument in sub() is not
vectorized, se
we need a loop:

for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ##
replace
original bracketed text with fixed up bracketed text
> txt[1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
[2] "Jan, DBA, R101"
[3] "Mary, [HadoopAnalyst-DBA-Developer], t14"


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 06/04/2019 10:03 a.m., Amit Govil wrote:
> > Hi,
> >
> > I have a bunch of csv files to read in R. I'm unable to read them
> correctly
> > because in some of the files, there is a column ("Role")
which has comma
> in
> > the values.
> >
> > Sample data:
> >
> > User, Role, Rule, GAPId
> > Sam, [HadoopAnalyst, DBA, Developer], R46443
> >
> > I'm trying to play with the below code but it doesnt work:
>
> Since you didn't give a reproducible example, you should at least say
> what "doesn't work" means.
>
> But here's some general advice:  if you want to debug code, don't
write
> huge expressions like the chain of functions below, put things in
> temporary variables and make sure you get what you were expecting at
> each stage.
>
> Instead of
> >
> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
> >
> > tbl <- sapply(files, function(f) {
> >    gsub('\\[|\\]', '"', readLines(f)) %>%
> >      read.csv(text = ., check.names = FALSE)
> > }) %>%
> >    bind_rows(.id = "id") %>%
> >    select(id, User, Rule) %>%
> >    distinct()
>
> try
>
>
> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>
> tmp1 <- sapply(files, function(f) {
>    gsub('\\[|\\]', '"', readLines(f)) %>%
>      read.csv(text = ., check.names = FALSE)
> })
>
> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>
> tmp3 <- tmp2 %>% select(id, User, Rule)
>
> tbl <- tmp3 %>% distinct()
>
> (You don't need pipes here, but it will make it easier to put the giant
> expression back together at the end.)
>
> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went
> wrong.
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Bert Gunter

2019-Apr-07 17:35 UTC

head link

[R] Unable to read csv files with comma in values

... and here's another perhaps simpler, perhaps more efficient (??) way of
doing it using strsplit().Note that it uses the fixed field position, 2, of
the bracketed roles. Adjust as needed.

A better solution would be a regex that avoids the loops (here, the sapply)
altogether, but I don't know how to do this. Maybe someone cleverer will
offer such a solution.

txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan,
DBA, R101",
        "Mary, [Stats, Designer, R], t14")

wh <-  grep("\\[.+\\]", txt)
spl <-  strsplit(txt[wh], "\\[|\\]")
txt[wh] <-  sapply(spl, function(y)
   paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))
> txt[1] "Sam, HadoopAnalyst-DBA-Developer, R46443 "
[2] "Jan, DBA, R101"
[3] "Mary, Stats-Designer-R, t14"

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
> (Note: This follows an earlier mistaken reply just to Duncan)
>
> Multiple "amens!" to Duncan's comments...
>
> However:
>
> Here is a start at my interpretation of how to do what you want. Note
> first that your "example" listed 4 fields in the line, but you
showed only
> 3. I modified your example for 3 text fields, only one of which has
> brackets ([...]) in it I assume. Here is a little example of how to use
> regex's to replace the commas within the brackets by "-",
which would
> presumably then allow you to easily convert the text into a data frame e.g.
> using textConnection() and read.csv. Obviously, if this is not what you
> meant, read no further.
>
> ##Example
> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443
","Jan, DBA, R101",
>         "Mary, [Stats, Designer, R], t14")
>
> wh <- grep("\\[.+\\]",txt)  ## which records need to be
modified?
> fixup <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
##
> bracketed expressions, changing "," to "-"
>
> ## Unfortunately, the "replacement" argument in sub() is not
vectorized,
> se we need a loop:
>
> for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]])
## replace
> original bracketed text with fixed up bracketed text
>
> > txt
> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
> [2] "Jan, DBA, R101"
> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
> wrote:
>
>> On 06/04/2019 10:03 a.m., Amit Govil wrote:
>> > Hi,
>> >
>> > I have a bunch of csv files to read in R. I'm unable to read
them
>> correctly
>> > because in some of the files, there is a column ("Role")
which has
>> comma in
>> > the values.
>> >
>> > Sample data:
>> >
>> > User, Role, Rule, GAPId
>> > Sam, [HadoopAnalyst, DBA, Developer], R46443
>> >
>> > I'm trying to play with the below code but it doesnt work:
>>
>> Since you didn't give a reproducible example, you should at least
say
>> what "doesn't work" means.
>>
>> But here's some general advice:  if you want to debug code,
don't write
>> huge expressions like the chain of functions below, put things in
>> temporary variables and make sure you get what you were expecting at
>> each stage.
>>
>> Instead of
>> >
>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>> >
>> > tbl <- sapply(files, function(f) {
>> >    gsub('\\[|\\]', '"', readLines(f)) %>%
>> >      read.csv(text = ., check.names = FALSE)
>> > }) %>%
>> >    bind_rows(.id = "id") %>%
>> >    select(id, User, Rule) %>%
>> >    distinct()
>>
>> try
>>
>>
>> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>
>> tmp1 <- sapply(files, function(f) {
>>    gsub('\\[|\\]', '"', readLines(f)) %>%
>>      read.csv(text = ., check.names = FALSE)
>> })
>>
>> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>>
>> tmp3 <- tmp2 %>% select(id, User, Rule)
>>
>> tbl <- tmp3 %>% distinct()
>>
>> (You don't need pipes here, but it will make it easier to put the
giant
>> expression back together at the end.)
>>
>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went
>> wrong.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
	[[alternative HTML version deleted]]

Bert Gunter

2019-Apr-07 19:30 UTC

head link

[R] Unable to read csv files with comma in values

... and if anyone cares, here's a way to do it using vectorization (no
loops) by working only on the subvector containing bracketed text  and
using the brackets to break up the strings into 3 separate pieces,
replacing the commas in the middle piece with dashes, and then
reassembling. Quite clumsy, so a better solution is still needed, but here
it is:

txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan,
DBA, R101",
        "Mary, [Stats, Designer, R], t14")
wh <- grep("\\[.+\\]",txt)
txt1 <- sub("(.+), *\\[.+","\\1",txt[wh]) ## before
"["
txt2 <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
## bracketed
part
txt3 <- sub(".*\\], *(.+?) *$","\\1",txt[wh]) ## after
"]"
txt[wh]<- paste(txt1, txt2, txt3, sep = ", ")
> txt[1] "Sam, [HadoopAnalyst-DBA-Developer], R46443"
[2] "Jan, DBA, R101"
[3] "Mary, [Stats-Designer-R], t14"


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Apr 7, 2019 at 10:35 AM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
> ... and here's another perhaps simpler, perhaps more efficient (??) way
of
> doing it using strsplit().Note that it uses the fixed field position, 2, of
> the bracketed roles. Adjust as needed.
>
> A better solution would be a regex that avoids the loops (here, the
> sapply) altogether, but I don't know how to do this. Maybe someone
cleverer
> will offer such a solution.
>
> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443
","Jan, DBA, R101",
>         "Mary, [Stats, Designer, R], t14")
>
> wh <-  grep("\\[.+\\]", txt)
> spl <-  strsplit(txt[wh], "\\[|\\]")
> txt[wh] <-  sapply(spl, function(y)
>    paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))
>
> > txt
> [1] "Sam, HadoopAnalyst-DBA-Developer, R46443 "
> [2] "Jan, DBA, R101"
> [3] "Mary, Stats-Designer-R, t14"
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>
>> (Note: This follows an earlier mistaken reply just to Duncan)
>>
>> Multiple "amens!" to Duncan's comments...
>>
>> However:
>>
>> Here is a start at my interpretation of how to do what you want. Note
>> first that your "example" listed 4 fields in the line, but
you showed only
>> 3. I modified your example for 3 text fields, only one of which has
>> brackets ([...]) in it I assume. Here is a little example of how to use
>> regex's to replace the commas within the brackets by "-",
which would
>> presumably then allow you to easily convert the text into a data frame
e.g.
>> using textConnection() and read.csv. Obviously, if this is not what you
>> meant, read no further.
>>
>> ##Example
>> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443
","Jan, DBA, R101",
>>         "Mary, [Stats, Designer, R], t14")
>>
>> wh <- grep("\\[.+\\]",txt)  ## which records need to be
modified?
>> fixup <- gsub(" *,
*","-",sub(".+(\\[.+\\]).+","\\1",txt[wh]))
##
>> bracketed expressions, changing "," to "-"
>>
>> ## Unfortunately, the "replacement" argument in sub() is not
vectorized,
>> se we need a loop:
>>
>> for(i in wh) txt[wh[i]] <-
sub("\\[.+\\]",fixup[i],txt[wh[i]]) ## replace
>> original bracketed text with fixed up bracketed text
>>
>> > txt
>> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
>> [2] "Jan, DBA, R101"
>> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
>> wrote:
>>
>>> On 06/04/2019 10:03 a.m., Amit Govil wrote:
>>> > Hi,
>>> >
>>> > I have a bunch of csv files to read in R. I'm unable to
read them
>>> correctly
>>> > because in some of the files, there is a column
("Role") which has
>>> comma in
>>> > the values.
>>> >
>>> > Sample data:
>>> >
>>> > User, Role, Rule, GAPId
>>> > Sam, [HadoopAnalyst, DBA, Developer], R46443
>>> >
>>> > I'm trying to play with the below code but it doesnt work:
>>>
>>> Since you didn't give a reproducible example, you should at
least say
>>> what "doesn't work" means.
>>>
>>> But here's some general advice:  if you want to debug code,
don't write
>>> huge expressions like the chain of functions below, put things in
>>> temporary variables and make sure you get what you were expecting
at
>>> each stage.
>>>
>>> Instead of
>>> >
>>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>> >
>>> > tbl <- sapply(files, function(f) {
>>> >    gsub('\\[|\\]', '"', readLines(f))
%>%
>>> >      read.csv(text = ., check.names = FALSE)
>>> > }) %>%
>>> >    bind_rows(.id = "id") %>%
>>> >    select(id, User, Rule) %>%
>>> >    distinct()
>>>
>>> try
>>>
>>>
>>> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>>
>>> tmp1 <- sapply(files, function(f) {
>>>    gsub('\\[|\\]', '"', readLines(f)) %>%
>>>      read.csv(text = ., check.names = FALSE)
>>> })
>>>
>>> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>>>
>>> tmp3 <- tmp2 %>% select(id, User, Rule)
>>>
>>> tbl <- tmp3 %>% distinct()
>>>
>>> (You don't need pipes here, but it will make it easier to put
the giant
>>> expression back together at the end.)
>>>
>>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things
went
>>> wrong.
>>>
>>> Duncan Murdoch
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
	[[alternative HTML version deleted]]

R help - Apr 2019 - Unable to read csv files with comma in values

[R] Unable to read csv files with comma in values

[R] Unable to read csv files with comma in values

[R] Unable to read csv files with comma in values