thr3ads.net - R help - [R] CSV format issues [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Guillaume Meurice

2012-Jul-23 13:06 UTC

[R] CSV format issues

Dear all, 

I have some encoding problem which I'm not familiar with.
Here is the case : 
I'm read data files which can have been generated from a  computer either
with global settings in french or in english.

Here is an exemple ouf data file :

* English output
Time,Value
17,-0.0753953
17.05,-6.352454E-02

* French output.
Time,Value
32,-7,183246E-02
32,05,3,469364E-02

In the first case, I can totally retrieve both columns, splitting each line
using the comma as a separator.
In the second case, it's impossible, since the comma (in french) is also
used to separate decimal. Usually, the CSV french file format add some quote, to
distinguish the comma used as column separator from comma used as decimal, like
the following :

Time,Value
32,"-7,183246E-02"
"32,05","3,469364E-02"

Since I'm expecting 2 numbers, I can set that if there is 3 comma, the first
two number are to be gathered as well as the two lefting ones.
But in case of only two comma, which number is the floating one (I know that it
is the second one, but who this is a great source of bugs ...).

the unix tools "file" returns : 
==$ file P23_RD06_High\ Sensitivity\ DNA\
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv
$ P23_RD06_High Sensitivity DNA
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line
terminators
==

Unfortunately, the raw file doesn't contains the precious quote. So sorry to
bother with this question which is not totally related to R (which I'm
using). Do you know if there any facilities using R to get the data in the good
format ?


Bests,
--
Guillaume Meurice - PhD

jim holtman

2012-Jul-23 14:40 UTC

head link

[R] CSV format issues

try this; looks for strings of numbers with commas and quotes them:

> x <- readLines(textConnection("Time,Value+ 32,-7,183246E-02
+ 32,05,3,469364E-02"))> # process the data putting in quotes on scientific
> x.new1 <- gsub("(-?[0-9]+,[0-9]+E-?[0-9]+)",
'"\\1"', x)
> x.new1[1] "Time,Value"            
"32,\"-7,183246E-02\""  
"32,05,\"3,469364E-02\""> # put quotes on just numbers
> x.new2 <- gsub("(-?[0-9]+,[0-9]+)(,|$)",
'"\\1"\\2', x.new1)
> x.new2[1] "Time,Value"                
"32,\"-7,183246E-02\""
"\"32,05\",\"3,469364E-02\""> temp <- tempfile()
> writeLines(x.new2, temp)
> x.input <- read.csv(temp)
> x.input   Time         Value
1    32 -7,183246E-02
2 32,05  3,469364E-02


On Mon, Jul 23, 2012 at 9:06 AM, Guillaume Meurice
<guillaume.meurice at igr.fr> wrote:> Dear all,
>
> I have some encoding problem which I'm not familiar with.
> Here is the case :
> I'm read data files which can have been generated from a  computer
either with global settings in french or in english.
>
> Here is an exemple ouf data file :
>
> * English output
> Time,Value
> 17,-0.0753953
> 17.05,-6.352454E-02
>
> * French output.
> Time,Value
> 32,-7,183246E-02
> 32,05,3,469364E-02
>
> In the first case, I can totally retrieve both columns, splitting each line
using the comma as a separator.
> In the second case, it's impossible, since the comma (in french) is
also used to separate decimal. Usually, the CSV french file format add some
quote, to distinguish the comma used as column separator from comma used as
decimal, like the following :
>
> Time,Value
> 32,"-7,183246E-02"
> "32,05","3,469364E-02"
>
> Since I'm expecting 2 numbers, I can set that if there is 3 comma, the
first two number are to be gathered as well as the two lefting ones.
> But in case of only two comma, which number is the floating one (I know
that it is the second one, but who this is a great source of bugs ...).
>
> the unix tools "file" returns :
> ==> $ file P23_RD06_High\ Sensitivity\ DNA\
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv
> $ P23_RD06_High Sensitivity DNA
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line
terminators
> ==>
>
> Unfortunately, the raw file doesn't contains the precious quote. So
sorry to bother with this question which is not totally related to R (which
I'm using). Do you know if there any facilities using R to get the data in
the good format ?
>
>
> Bests,
> --
> Guillaume Meurice - PhD
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

Duncan Murdoch

2012-Jul-23 14:52 UTC

head link

[R] CSV format issues

On 23/07/2012 9:06 AM, Guillaume Meurice wrote:> Dear all,
>
> I have some encoding problem which I'm not familiar with.
> Here is the case :
> I'm read data files which can have been generated from a  computer
either with global settings in french or in english.
>
> Here is an exemple ouf data file :
>
> * English output
> Time,Value
> 17,-0.0753953
> 17.05,-6.352454E-02
>
> * French output.
> Time,Value
> 32,-7,183246E-02
> 32,05,3,469364E-02
>
> In the first case, I can totally retrieve both columns, splitting each line
using the comma as a separator.
> In the second case, it's impossible, since the comma (in french) is
also used to separate decimal. Usually, the CSV french file format add some
quote, to distinguish the comma used as column separator from comma used as
decimal, like the following :
You could convert columns to character to get this.  R also has the 
read.csv2 and write.csv2 files which take a different approach, i.e. 
using a semicolon as a separator.

If you want to automate this, you can use the Sys.localeconv() function 
to find the details of the current locale.

Duncan Murdoch>
> Time,Value
> 32,"-7,183246E-02"
> "32,05","3,469364E-02"
>
> Since I'm expecting 2 numbers, I can set that if there is 3 comma, the
first two number are to be gathered as well as the two lefting ones.
> But in case of only two comma, which number is the floating one (I know
that it is the second one, but who this is a great source of bugs ...).
>
> the unix tools "file" returns :
> ==> $ file P23_RD06_High\ Sensitivity\ DNA\
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv
> $ P23_RD06_High Sensitivity DNA
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line
terminators
> ==>
>
> Unfortunately, the raw file doesn't contains the precious quote. So
sorry to bother with this question which is not totally related to R (which
I'm using). Do you know if there any facilities using R to get the data in
the good format ?
>
>
> Bests,
> --
> Guillaume Meurice - PhD
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

peter dalgaard

2012-Jul-23 14:54 UTC

head link

[R] CSV format issues

On Jul 23, 2012, at 15:06 , Guillaume Meurice wrote:
> Dear all, 
> 
> I have some encoding problem which I'm not familiar with.
> Here is the case : 
> I'm read data files which can have been generated from a  computer
either with global settings in french or in english.
> 
> Here is an exemple ouf data file :
> 
> * English output
> Time,Value
> 17,-0.0753953
> 17.05,-6.352454E-02
> 
> * French output.
> Time,Value
> 32,-7,183246E-02
> 32,05,3,469364E-02
> 
> In the first case, I can totally retrieve both columns, splitting each line
using the comma as a separator.
> In the second case, it's impossible, since the comma (in french) is
also used to separate decimal. Usually, the CSV french file format add some
quote, to distinguish the comma used as column separator from comma used as
decimal, like the following :
> 
> Time,Value
> 32,"-7,183246E-02"
> "32,05","3,469364E-02"
> 
> Since I'm expecting 2 numbers, I can set that if there is 3 comma, the
first two number are to be gathered as well as the two lefting ones.
> But in case of only two comma, which number is the floating one (I know
that it is the second one, but who this is a great source of bugs ...).
> 
> the unix tools "file" returns : 
> ==> $ file P23_RD06_High\ Sensitivity\ DNA\
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv
> $ P23_RD06_High Sensitivity DNA
Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line
terminators
> ==> 
> 
> Unfortunately, the raw file doesn't contains the precious quote. So
sorry to bother with this question which is not totally related to R (which
I'm using). Do you know if there any facilities using R to get the data in
the good format ?
As you already observe, there can't be. There's just no way of seeing
whether 32,7,8 is 32.7, 8.0 or 32.0, 7.8. That's why the "usual"
CSV format in jurisdictions with comma as decimal separator has semicolon as the
field separator.

You may be able to scrape through with various heuristics, such as

- a leading 0 likely belongs to a decimal part 
- if there's an exponent, then there is likely a decimal point
- there can't be a sign in the decimal part
- times should be (roughly) equidistant and in increasing sequence

R is fairly well equipped with tools to let you create code to do this, look at
strsplit(), count.fields(), grep() etc., but it will be a fair bit of work, and
you may still end up with a handful of truly ambiguous cases.

Ultimately, however, the issue is that someone messed up the collection of data
and now try to make it your problem. In a consulting situation, that should cost
serious extra money. Other options include

- redo the data, this time with all computers set to English (if there's an
internal storage format, this could be more realistic than you think)
- return the faulty data collecting software (some time back in the 1990's,
the Paradox data base had the same stupid bug of double usage of commas, but it
is really unheard of in 2012)

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Maybe Matching Threads

Search for more seemingly similar threads

R help - Jul 2012 - CSV format issues

[R] CSV format issues

[R] CSV format issues

[R] CSV format issues

[R] CSV format issues

Maybe Matching Threads