thr3ads.net - R help - [R] Excel [Aug 2007]

If this information is useful, please help other people find it:
Share via:

David Scott

2007-Aug-28 02:11 UTC

[R] Excel

A common process when data is obtained in an Excel spreadsheet is to save 
the spreadsheet as a .csv file then read it into R. Experienced users 
might have learned to be wary of dates (as I have) but possibly have not 
experienced what just happened to me. I thought I might just share it with 
r-help as a cautionary tale.

I received an Excel file giving patient details. Each patient had an ID 
code in the form of three letters followed by four digits. (Actually a New 
Zealand National Health Identification.) I saved the .xls file as .csv. 
Then I opened up the .csv (with Excel) to look at it. In the column of ID 
codes I saw: Aug-99. Clicking on that entry it showed 1/08/2699.

In a column of character data, Excel had interpreted AUG2699 as a date.

The .csv did not actually have a date in that cell, but if I had saved the 
.csv file it would have.

David Scott

_________________________________________________________________
David Scott	Department of Statistics, Tamaki Campus
 		The University of Auckland, PB 92019
 		Auckland 1142,    NEW ZEALAND
Phone: +64 9 373 7599 ext 86830		Fax: +64 9 373 7000
Email:	d.scott at auckland.ac.nz

Graduate Officer, Department of Statistics
Director of Consulting, Department of Statistics

Moshe Olshansky

2007-Aug-28 02:19 UTC

head link

[R] Excel

This is very consistent with Microsoft's philosophy:
they know better than you what you want to do.

--- David Scott <d.scott at auckland.ac.nz> wrote:
> 
> A common process when data is obtained in an Excel
> spreadsheet is to save 
> the spreadsheet as a .csv file then read it into R.
> Experienced users 
> might have learned to be wary of dates (as I have)
> but possibly have not 
> experienced what just happened to me. I thought I
> might just share it with 
> r-help as a cautionary tale.
> 
> I received an Excel file giving patient details.
> Each patient had an ID 
> code in the form of three letters followed by four
> digits. (Actually a New 
> Zealand National Health Identification.) I saved the
> .xls file as .csv. 
> Then I opened up the .csv (with Excel) to look at
> it. In the column of ID 
> codes I saw: Aug-99. Clicking on that entry it
> showed 1/08/2699.
> 
> In a column of character data, Excel had interpreted
> AUG2699 as a date.
> 
> The .csv did not actually have a date in that cell,
> but if I had saved the 
> .csv file it would have.
> 
> David Scott
> 
>
_________________________________________________________________> David Scott	Department of Statistics, Tamaki Campus
>  		The University of Auckland, PB 92019
>  		Auckland 1142,    NEW ZEALAND
> Phone: +64 9 373 7599 ext 86830		Fax: +64 9 373 7000
> Email:	d.scott at auckland.ac.nz
> 
> Graduate Officer, Department of Statistics
> Director of Consulting, Department of Statistics
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>

Robert A LaBudde

2007-Aug-28 04:49 UTC

head link

[R] Excel

If you format the column as "Text", you won't have this problem.
By
leaving the cells as "General", you leave it up to Excel to guess at 
the correct interpretation.

You will note that the conversion to a date occurs immediately in 
Excel when you enter the value. There are many formats to enter dates.

Either pre-format the column as Text, or prefix the individual entry 
with an ' to indicate text.

A similar problem occurs in R's read.table() function when a factor 
has levels that can be interpreted as numbers.

At 10:11 PM 8/27/2007, David wrote:
>A common process when data is obtained in an Excel spreadsheet is to save
>the spreadsheet as a .csv file then read it into R. Experienced users
>might have learned to be wary of dates (as I have) but possibly have not
>experienced what just happened to me. I thought I might just share it with
>r-help as a cautionary tale.
>
>I received an Excel file giving patient details. Each patient had an ID
>code in the form of three letters followed by four digits. (Actually a New
>Zealand National Health Identification.) I saved the .xls file as .csv.
>Then I opened up the .csv (with Excel) to look at it. In the column of ID
>codes I saw: Aug-99. Clicking on that entry it showed 1/08/2699.
>
>In a column of character data, Excel had interpreted AUG2699 as a date.
>
>The .csv did not actually have a date in that cell, but if I had saved the
>.csv file it would have.
>
>David Scott
===============================================================Robert A.
LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"

bogdan romocea

2007-Aug-28 18:16 UTC

head link

[R] Excel

On a related note, there's one other amazingly stupid thing that Excel
(2002 SP3) does - it exports to CSV the numbers as you see them
displayed, and not as they were entered/imported in the first place.
For example, 1.2345678 will be exported to CSV/tab delimited as 1.23
if that column is formatted to show 2 decimals. Whoever doesn't pay
attention gets what s/he deserves for trusting Excel in the first
place.

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of David Scott
> Sent: Monday, August 27, 2007 10:11 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Excel
>
>
> A common process when data is obtained in an Excel
> spreadsheet is to save
> the spreadsheet as a .csv file then read it into R. Experienced users
> might have learned to be wary of dates (as I have) but
> possibly have not
> experienced what just happened to me. I thought I might just
> share it with
> r-help as a cautionary tale.
>
> I received an Excel file giving patient details. Each patient
> had an ID
> code in the form of three letters followed by four digits.
> (Actually a New
> Zealand National Health Identification.) I saved the .xls
> file as .csv.
> Then I opened up the .csv (with Excel) to look at it. In the
> column of ID
> codes I saw: Aug-99. Clicking on that entry it showed 1/08/2699.
>
> In a column of character data, Excel had interpreted AUG2699
> as a date.
>
> The .csv did not actually have a date in that cell, but if I
> had saved the
> .csv file it would have.
>
> David Scott
>
> _________________________________________________________________
> David Scott Department of Statistics, Tamaki Campus
>   The University of Auckland, PB 92019
>   Auckland 1142,    NEW ZEALAND
> Phone: +64 9 373 7599 ext 86830 Fax: +64 9 373 7000
> Email: d.scott at auckland.ac.nz
>
> Graduate Officer, Department of Statistics
> Director of Consulting, Department of Statistics
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Erich Neuwirth

2007-Aug-29 22:46 UTC

head link

[R] Excel

Greg Snow wrote:> >
> > Or do you trust all of your clients to know to use R(D)COM as well as
> > how to install and use it?
Do you trust your clients to be fluent enough in R to use it?
For most of my clients, that is not true.
For this kind of users, the following strategy works.
They have their data in Excel, and I write some macros in Excel which
use RExcel to compute results and put them in the sheet.
The clients press buttons similar to what they would do
with the Analysis toolpack in Excel, but they get the full power of
R.
> >
>> >> The formula
>> >> =RApply("var",A1:A1000) in an Excel cell for example
will use
>> >> R to compute the variance of the data in column A in Excel.
>> >> If you change any of the values in the range A1:A1000 will
>> >> automatically recompute the variance.
> >
> > And what happens when you enter a value in cell A1001?
If you use the following formula
=RApply("var",DownFrom(A1))

where DownFrom ist he following function defined in VBA

Function DownFrom(startcell As Range) As Range
    Application.Volatile
    Set DownFrom = Range(startcell, startcell.End(xlDown))
End Function

Then writing a value in A1001 will a extend the range to which the
function "var" is applied.

Arguments of worksheet function can be dynamic ranges, but sadly this is
a relatively unknown fact.

> >
> > And what happens if you set cell B1 to =A1 and do the magic copy so
that
> > b2=a2, b3=a3, ... Then put
> > =Rapply("var",B1:B1000) in a cell, does the answer match
with
> > =Rapply("var",A1:A1000)?
Of course it does, what else would you expect to happen?
and if you copy the the formula in the cell containing
=Rapply("var",A1:A1000) into the cell to its right,
this cell it will automatically contain the formula
=Rapply("var",B1:B1000)
> >
> > Yes, the auto-recompute could be considered a nice feature, but does
it
> > really save that much work compared to running a script in R after
> > updating the data? (a couple of clicks in Rgui, a couple of keystrokes
> > in ESS, one line of code in an R terminal)
Auto-recomputing can do more.
My favorite demo is the display of a kernel density estimator.
You move a slider controlling the window width. The slider controls
the value in a cell. When the cell contents change, R recomputes the
kernel density estimator in some Excel cells and Excel then
updates the graph. So you have an animated display in Excel
which allows you to instantly see the changes in the graph.
This is user controlled animation. It can be done using for example by
Tcl/Tk also, but it is much more work.

> >
>> >> There is one feature in Excel which is extremely convenient,
>> >> Pivot tables. Anybody doing any work as statistical
>> >> consultant really ought to know about Pivot tables, and I am
>> >> still surprised how many statisticians do not know about it.
>> >> Neither Gnumeric nor OpenOffice Calc offer comparably
>> >> convenient ways working with multidimensional tables.
> >
> > I will admit that I have not learned the power of the pivot table, but
> > the main reason for that is everytime someone demonstrates the power
of
> > the pivot table to me it is by creating a table, then showing that if
> > that is not what you want you can click here, here, here, then here
and
> > it is a different table, then you click here and here, ....  At which
> > point I am completely lost as to which clicks did which.  There are
> > tools in R (the reshape package has been mentioned) than give you the
> > power to create the tables, but with a history rather than having to
> > remember which clicks are needed.  My experience with pivot tables so
> > far are that they are a post-hoc kludge to a poorly designed original
> > table.  I prefer working with a script where if the result is not what
I
> > want, I can fix the source rather than admire how pretty the band-aid
> > is.  (Maybe there are ways to script a pivot table without the
clicking,
> > but everyone who has tried to market them to me have been more
impressed
> > with the band-aid than in preventing the original injury).
> >
> > What happens when you make a nice pivot table for a client, then they
> > come back 2 months latter with a new spreadsheet and want a similar
> > table for this data?  I don't think I could remember the set of
clicks
> > needed, but with an R script I can pull up the set of commands I used
> > before, change the variable and table names, and have the results in a
> > couple of minutes.
One you are used to the mechanics of creating pivot tables, it will be
extremely easy to create a similar table just by having a quick look at
the original table.

Of course it is possible to design terrible spreadsheets.
But it is also quite possible to write terrible R code.

What makes spreadsheets very special is autorecalculation and
autoupdating of charts, but also the point-and-click method of creating
formulas (using relative and absolute references).

And of course, knowing one tool always is dangerous, but this is also
true in the case where the single known tool is R.


-- Erich Neuwirth, University of Vienna Faculty of Computer Science
Computer Supported Didactics Working Group Visit our SunSITE at
http://sunsite.univie.ac.at Phone: +43-1-4277-39464 Fax: +43-1-4277-39459
-- 
Erich Neuwirth, University of Vienna
Faculty of Computer Science
Computer Supported Didactics Working Group
Visit our SunSITE at http://sunsite.univie.ac.at
Phone: +43-1-4277-39464 Fax: +43-1-4277-39459

Austin, Matt

2007-Aug-30 19:25 UTC

head link

[R] Excel

Ah . . . the hammer analogy.  In a conversation like this it's not a
question of will somebody drop it, it's when will it be dropped.

--Matt

Matt Austin
Statistician
Amgen, Inc.

-----Original Message-----
From: r-help-bounces@stat.math.ethz.ch
[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Greg Snow
Sent: Thursday, August 30, 2007 12:14 PM
To: Erich Neuwirth; r-help
Subject: Re: [R] Excel

Earlier this week I was doing some work at our house and since my wife was
at the dentist office our 3 year old son was "helping" me.  He really
wanted
to use the hammer, so I showed him where to tap and he was excited to tap
(not doing much good, but also not doing any damage).  He liked that so much
that he started to look for other things that he could use the hammer on,
some were benign, others made me jump in and stop him before he did major
damage.

What does this have to do with Excel and R?  Well I see Excel as being very
much like my hammer and different users like the different users of the
hammer.

There are some carpenters who can use a hammer along with other tools to
make things of pure beauty.  Simillarly, I expect there are people who can
use excel/spreadsheets along with other tools to make useful and beautiful
things (I expect that you (Erich) may be one of those).

My use of the hammer is far less than that of the experts, I can about 9
times out of 10 survive the use without blood and major bruising, fix or
make something that works, but is far  from a work of art.  My use of
spreadsheets is similar, I do use them for some specific purposes (holding
the bus schedule on my pda, amortization tables), but I know better than to
use them where I would cause more harm than good.

I fear that many users of Excel and other spreadsheets are much like my
toddler, they have a tool and they want to use it, ignorant of whether they
will cause more damage than good.  This is part of why I put in the
reference to Dr. Burns page, he does not say get rid of spreadsheet
altogether, but he points out that there are several cases where other tools
work better (It is possible to embed a screw in a piece of wood using a
hammer, but a screwdriver generally works better).

Erich Neuwirth wrote:
> -----Original Message-----
> From: r-help-bounces@stat.math.ethz.ch 
> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Erich Neuwirth
> Sent: Wednesday, August 29, 2007 4:46 PM
> To: r-help
> Subject: Re: [R] Excel
> 
> Greg Snow wrote:
> > >
> > > Or do you trust all of your clients to know to use 
> R(D)COM as well 
> > > as how to install and use it?
> 
> Do you trust your clients to be fluent enough in R to use it?
I admitted that I don't trust some of my clients with their own data (I
think my record is 9 different sexes), I certainly don't expect them to use
R.  I trust them to bring their questions and data to me, then I use R (or
one of my coworkers uses their favorite stats package) to help them.  That
way I know that the correct things are being done (on the flip side, they
trust me to leave patient care up to them and not try surgery myself, it is
much better for the patients that way).
> For most of my clients, that is not true.
> For this kind of users, the following strategy works.
> They have their data in Excel, and I write some macros in 
> Excel which use RExcel to compute results and put them in the sheet.
> The clients press buttons similar to what they would do with 
> the Analysis toolpack in Excel, but they get the full power of R.
Who installs R and sets up R dcom on all these computers?

If the above clients want something in addition to what you prepared, do you
add additional macros for them?  Do they know enough R to write their own
Rapply statement? Or do they go looking in the Excel help and use the wrong
set of tools?  Or do you have them trained enough that they don't ask any
additional questions?

I have no problem with you using this approach, I just think that if you are
going to advocate that others use excel with clients, that you also let them
know of the extra work that they may be committing themselves to.

> 
> > >
> >> >> The formula
> >> >> =RApply("var",A1:A1000) in an Excel cell for
example
> will use R to 
> >> >> compute the variance of the data in column A in Excel.
> >> >> If you change any of the values in the range A1:A1000
will
> >> >> automatically recompute the variance.
> > >
> > > And what happens when you enter a value in cell A1001?
> 
> If you use the following formula
> =RApply("var",DownFrom(A1))
> 
> where DownFrom ist he following function defined in VBA
> 
> Function DownFrom(startcell As Range) As Range
>     Application.Volatile
>     Set DownFrom = Range(startcell, startcell.End(xlDown)) 
> End Function
> 
> Then writing a value in A1001 will a extend the range to 
> which the function "var" is applied.
So to answer a known inconsistancy in spreadsheets, you are now bringing in
a 3rd program/language.  At what point is it just simpler to use R directly?
(for me it was back a couple of steps).

> Arguments of worksheet function can be dynamic ranges, but 
> sadly this is a relatively unknown fact.
And this is one of my main complaints about Excel as well as other MS
products and MS wannabees.  They may have implemented tools that allow you
to do the correct thing, but in practice they make doing the wrong thing so
much easier.  I now encourage my clients to use at least access rather than
excel for data entry because it forces them to think about what the data
will look like before entering any values (one of the examples on Dr. Burns
spreadsheet addiction page came from my group, life would have been much
easier if the client had not trusted excel, but used access or another
database program).
> 
> > >
> > > And what happens if you set cell B1 to =A1 and do the 
> magic copy so 
> > > that b2=a2, b3=a3, ... Then put
> > > =Rapply("var",B1:B1000) in a cell, does the answer
match with
> > > =Rapply("var",A1:A1000)?
> 
> Of course it does, what else would you expect to happen?
> and if you copy the the formula in the cell containing
> =Rapply("var",A1:A1000) into the cell to its right, this cell 
> it will automatically contain the formula
> =Rapply("var",B1:B1000)
Well the naïve expectation is that they would match, but the naïve
expectation is that =var(A1:A1000) would also match =var(B1:B1000) (but it
doesn't).  I expected yours to suffer the same fate since I thougth that
excel would interpret the blanks vs. 0's before passing the info to R.  If
you have done better at this than excel, then great job (I wanted to try
this out myself, but am having problems installing the Rexcel tools).
> > >
> > > Yes, the auto-recompute could be considered a nice 
> feature, but does 
> > > it really save that much work compared to running a script in R 
> > > after updating the data? (a couple of clicks in Rgui, a couple of
> > > keystrokes in ESS, one line of code in an R terminal)
> 
> Auto-recomputing can do more.
> My favorite demo is the display of a kernel density estimator.
> You move a slider controlling the window width. The slider 
> controls the value in a cell. When the cell contents change, 
> R recomputes the kernel density estimator in some Excel cells 
> and Excel then updates the graph. So you have an animated 
> display in Excel which allows you to instantly see the 
> changes in the graph.
> This is user controlled animation. It can be done using for 
> example by Tcl/Tk also, but it is much more work.
I believe that which is more work depends on the person and their
experience.

Here is some R code to do similar:

library(tkrplot)

my.data <- c(rnorm(20, 10, 2), rnorm(30, 15, 3))

my.adjust <- 1

tt <- tktoplevel()

img <- tkrplot(tt, function() plot(density(my.data, adjust=my.adjust)))
f <- function(...) {
	my.a <- as.numeric(tclvalue('adjust'))
	if(my.a != my.adjust){
		my.adjust <<- my.a
		tkrreplot(img)
	}
}

s <- tkscale(tt, command=f, from=0.05, to=3.00, variable='adjust',
	resolution=0.05, orient="horiz")
tkpack(img,s)

The above code is only a small modification of the example for tkrplot.

One of these days I'm going to rewrite the Tk demos in the TeachingDemos
package to use tkrplot, it makes these things quite easy (maybe my son can
use the hammer to help me, maybe not).

I do have to say that when I read your description of having R do the
computations, but then having excel create the plot, I shuddered at the
thought (and not in a good way).

[snip] I think there has been enough discussion on the pivot tables and
right now we should probably agree to disagree.
> Of course it is possible to design terrible spreadsheets.
> But it is also quite possible to write terrible R code.
Yes that is true, but one major difference is that I have never known an R
user that saw R as the only available tool.  Personally I recommend that
stats students know at least 3 stats packages (and I ask about this when I
interview) as well as other computer tools.  Too many people that only know
excel think that excel can solve all problems.  A while back I was talking
to my boss (non-statistician) about something and she suggested creating a
spreadsheet to track some information, I asked why not use access instead.
Her reply was that she knew how to make a spreadsheet, but did not know
access (or other db), then asked what the difference was.  As I explained
some of the differences, she realized that a db was the appropriate way to
do what she wanted.
> What makes spreadsheets very special is autorecalculation and 
> autoupdating of charts, but also the point-and-click method 
> of creating formulas (using relative and absolute references).
I agree with the above statement, but I think that I am thinking of a
different definition of 'special' than what you intended.
> And of course, knowing one tool always is dangerous, but this 
> is also true in the case where the single known tool is R.
Here we agree

Actually, I like the R dcom interface idea.  A few years ago I considered
using it for an introductory class (reasoning that the students would
already be familiar with excel), I don't remember all the reasons why I
ended up not going with it.  I have some co-workers that I may recommend it
to as well (once they have learned enough R).  I am just very wary of
recommending it to the general public, or recommending statisticians use it
with clients without the additional warnings of probable consequences.

For me personally, I prefer typing to clicking, so many of the arguments for
it don't sway me (but they may be fine for others who prefer clicking to
typing).

> 
> -- Erich Neuwirth, University of Vienna Faculty of Computer 
> Science Computer Supported Didactics Working Group Visit our 
> SunSITE at http://sunsite.univie.ac.at Phone: 
> +43-1-4277-39464 Fax: +43-1-4277-39459
> --
> Erich Neuwirth, University of Vienna
> Faculty of Computer Science
> Computer Supported Didactics Working Group Visit our SunSITE 
> at http://sunsite.univie.ac.at
> Phone: +43-1-4277-39464 Fax: +43-1-4277-39459
> 

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow@intermountainmail.org
(801) 408-8111

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

Antony Unwin

2007-Aug-30 21:10 UTC

head link

[R] Excel

Flame wars are usually vituperative, often entertaining, and  
occasionally productive.  Excel is good for accounts and for taking  
notes, sometimes for back-of-the-envelope calculations.  It is not so  
suitable for statistics and its formulae can be incomprehensible when  
you try to understand what they mean later (what on earth might C2*B 
$4 once have meant?).  R is great for statistics, but frustratingly  
linear (though Philippe Grosjean sees linearity as an advantage;  
people work in different ways).  Having previously run scripts and  
being able to rerun them is only useful if you can remember why they  
were written that way in the first place, so that you can adjust them  
for different situations.  As with all software, it is an advantage  
to know what you are doing.  Familiarity with the software, whatever  
you use, helps.

Erich Neuwirth recommended pivot tables and they can be quite  
effective.  I prefer the graphical alternative of interactive mosaic  
plots, as in iPlots.  Both are needed.  Erich's more important point  
is that you need to speak the language of the people you cooperate  
with and often that language includes Excel.

Antony Unwin

Erich Neuwirth

2007-Aug-31 08:41 UTC

head link

[R] Excel

The quickest solution is to additionally install the package rcom from CRAN.
A more detailed account can be found on our wiki at
http://rcom.univie.ac.at
especially on
http://learnserver.csd.univie.ac.at/rcomwiki/doku.php?id=version_information_and_links


Greg Snow wrote:> Erich,
> 
> I just downloaded the latest version of the R dcom server etc from CRAN
> (version 2.5, I had looked at the previous version, but wanted to try
> the latest version (I will be responding to your e-mail, just wanted to
> test some things rather than speak from ignorance (not that I always
> take that approach))).
> 
> When installing and further when running excel after the install I get
> an error and a visual basic debugger pops up.  The error window says:
> "Compile error: Can't find project or library", The code
window is
> titled: "Rexcel.xla - MenuTools (Code)" and the line:
"Function
> ExcelVersionMajor() As Integer" is highlighted in yellow, a couple of
> lines below that the word "Left" is highlighted/selected in the
line
> "ExcelVersionMajor = Cint(Left(Application.Version, pos - 1))"
> 
> I have Excel 2000.
> 
> 
> 

-- 
Erich Neuwirth, University of Vienna
Faculty of Computer Science
Computer Supported Didactics Working Group
Visit our SunSITE at http://sunsite.univie.ac.at
Phone: +43-1-4277-39464 Fax: +43-1-4277-39459

Maybe Matching Threads

Search for more apparently analagous threads

R help - Aug 2007 - Excel

[R] Excel

[R] Excel

[R] Excel

[R] Excel

[R] Excel

[R] Excel

[R] Excel

[R] Excel

Maybe Matching Threads