A common process when data is obtained in an Excel spreadsheet is to save the spreadsheet as a .csv file then read it into R. Experienced users might have learned to be wary of dates (as I have) but possibly have not experienced what just happened to me. I thought I might just share it with r-help as a cautionary tale. I received an Excel file giving patient details. Each patient had an ID code in the form of three letters followed by four digits. (Actually a New Zealand National Health Identification.) I saved the .xls file as .csv. Then I opened up the .csv (with Excel) to look at it. In the column of ID codes I saw: Aug-99. Clicking on that entry it showed 1/08/2699. In a column of character data, Excel had interpreted AUG2699 as a date. The .csv did not actually have a date in that cell, but if I had saved the .csv file it would have. David Scott _________________________________________________________________ David Scott Department of Statistics, Tamaki Campus The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 373 7599 ext 86830 Fax: +64 9 373 7000 Email: d.scott at auckland.ac.nz Graduate Officer, Department of Statistics Director of Consulting, Department of Statistics
This is very consistent with Microsoft's philosophy: they know better than you what you want to do. --- David Scott <d.scott at auckland.ac.nz> wrote:> > A common process when data is obtained in an Excel > spreadsheet is to save > the spreadsheet as a .csv file then read it into R. > Experienced users > might have learned to be wary of dates (as I have) > but possibly have not > experienced what just happened to me. I thought I > might just share it with > r-help as a cautionary tale. > > I received an Excel file giving patient details. > Each patient had an ID > code in the form of three letters followed by four > digits. (Actually a New > Zealand National Health Identification.) I saved the > .xls file as .csv. > Then I opened up the .csv (with Excel) to look at > it. In the column of ID > codes I saw: Aug-99. Clicking on that entry it > showed 1/08/2699. > > In a column of character data, Excel had interpreted > AUG2699 as a date. > > The .csv did not actually have a date in that cell, > but if I had saved the > .csv file it would have. > > David Scott > >_________________________________________________________________> David Scott Department of Statistics, Tamaki Campus > The University of Auckland, PB 92019 > Auckland 1142, NEW ZEALAND > Phone: +64 9 373 7599 ext 86830 Fax: +64 9 373 7000 > Email: d.scott at auckland.ac.nz > > Graduate Officer, Department of Statistics > Director of Consulting, Department of Statistics > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. >
If you format the column as "Text", you won't have this problem. By leaving the cells as "General", you leave it up to Excel to guess at the correct interpretation. You will note that the conversion to a date occurs immediately in Excel when you enter the value. There are many formats to enter dates. Either pre-format the column as Text, or prefix the individual entry with an ' to indicate text. A similar problem occurs in R's read.table() function when a factor has levels that can be interpreted as numbers. At 10:11 PM 8/27/2007, David wrote:>A common process when data is obtained in an Excel spreadsheet is to save >the spreadsheet as a .csv file then read it into R. Experienced users >might have learned to be wary of dates (as I have) but possibly have not >experienced what just happened to me. I thought I might just share it with >r-help as a cautionary tale. > >I received an Excel file giving patient details. Each patient had an ID >code in the form of three letters followed by four digits. (Actually a New >Zealand National Health Identification.) I saved the .xls file as .csv. >Then I opened up the .csv (with Excel) to look at it. In the column of ID >codes I saw: Aug-99. Clicking on that entry it showed 1/08/2699. > >In a column of character data, Excel had interpreted AUG2699 as a date. > >The .csv did not actually have a date in that cell, but if I had saved the >.csv file it would have. > >David Scott===============================================================Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral at lcfltd.com Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947 "Vere scire est per causas scire"
On a related note, there's one other amazingly stupid thing that Excel (2002 SP3) does - it exports to CSV the numbers as you see them displayed, and not as they were entered/imported in the first place. For example, 1.2345678 will be exported to CSV/tab delimited as 1.23 if that column is formatted to show 2 decimals. Whoever doesn't pay attention gets what s/he deserves for trusting Excel in the first place.> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of David Scott > Sent: Monday, August 27, 2007 10:11 PM > To: r-help at stat.math.ethz.ch > Subject: [R] Excel > > > A common process when data is obtained in an Excel > spreadsheet is to save > the spreadsheet as a .csv file then read it into R. Experienced users > might have learned to be wary of dates (as I have) but > possibly have not > experienced what just happened to me. I thought I might just > share it with > r-help as a cautionary tale. > > I received an Excel file giving patient details. Each patient > had an ID > code in the form of three letters followed by four digits. > (Actually a New > Zealand National Health Identification.) I saved the .xls > file as .csv. > Then I opened up the .csv (with Excel) to look at it. In the > column of ID > codes I saw: Aug-99. Clicking on that entry it showed 1/08/2699. > > In a column of character data, Excel had interpreted AUG2699 > as a date. > > The .csv did not actually have a date in that cell, but if I > had saved the > .csv file it would have. > > David Scott > > _________________________________________________________________ > David Scott Department of Statistics, Tamaki Campus > The University of Auckland, PB 92019 > Auckland 1142, NEW ZEALAND > Phone: +64 9 373 7599 ext 86830 Fax: +64 9 373 7000 > Email: d.scott at auckland.ac.nz > > Graduate Officer, Department of Statistics > Director of Consulting, Department of Statistics > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Greg Snow wrote:> > > > Or do you trust all of your clients to know to use R(D)COM as well as > > how to install and use it?Do you trust your clients to be fluent enough in R to use it? For most of my clients, that is not true. For this kind of users, the following strategy works. They have their data in Excel, and I write some macros in Excel which use RExcel to compute results and put them in the sheet. The clients press buttons similar to what they would do with the Analysis toolpack in Excel, but they get the full power of R.> > >> >> The formula >> >> =RApply("var",A1:A1000) in an Excel cell for example will use >> >> R to compute the variance of the data in column A in Excel. >> >> If you change any of the values in the range A1:A1000 will >> >> automatically recompute the variance. > > > > And what happens when you enter a value in cell A1001?If you use the following formula =RApply("var",DownFrom(A1)) where DownFrom ist he following function defined in VBA Function DownFrom(startcell As Range) As Range Application.Volatile Set DownFrom = Range(startcell, startcell.End(xlDown)) End Function Then writing a value in A1001 will a extend the range to which the function "var" is applied. Arguments of worksheet function can be dynamic ranges, but sadly this is a relatively unknown fact.> > > > And what happens if you set cell B1 to =A1 and do the magic copy so that > > b2=a2, b3=a3, ... Then put > > =Rapply("var",B1:B1000) in a cell, does the answer match with > > =Rapply("var",A1:A1000)?Of course it does, what else would you expect to happen? and if you copy the the formula in the cell containing =Rapply("var",A1:A1000) into the cell to its right, this cell it will automatically contain the formula =Rapply("var",B1:B1000)> > > > Yes, the auto-recompute could be considered a nice feature, but does it > > really save that much work compared to running a script in R after > > updating the data? (a couple of clicks in Rgui, a couple of keystrokes > > in ESS, one line of code in an R terminal)Auto-recomputing can do more. My favorite demo is the display of a kernel density estimator. You move a slider controlling the window width. The slider controls the value in a cell. When the cell contents change, R recomputes the kernel density estimator in some Excel cells and Excel then updates the graph. So you have an animated display in Excel which allows you to instantly see the changes in the graph. This is user controlled animation. It can be done using for example by Tcl/Tk also, but it is much more work.> > >> >> There is one feature in Excel which is extremely convenient, >> >> Pivot tables. Anybody doing any work as statistical >> >> consultant really ought to know about Pivot tables, and I am >> >> still surprised how many statisticians do not know about it. >> >> Neither Gnumeric nor OpenOffice Calc offer comparably >> >> convenient ways working with multidimensional tables. > > > > I will admit that I have not learned the power of the pivot table, but > > the main reason for that is everytime someone demonstrates the power of > > the pivot table to me it is by creating a table, then showing that if > > that is not what you want you can click here, here, here, then here and > > it is a different table, then you click here and here, .... At which > > point I am completely lost as to which clicks did which. There are > > tools in R (the reshape package has been mentioned) than give you the > > power to create the tables, but with a history rather than having to > > remember which clicks are needed. My experience with pivot tables so > > far are that they are a post-hoc kludge to a poorly designed original > > table. I prefer working with a script where if the result is not what I > > want, I can fix the source rather than admire how pretty the band-aid > > is. (Maybe there are ways to script a pivot table without the clicking, > > but everyone who has tried to market them to me have been more impressed > > with the band-aid than in preventing the original injury). > > > > What happens when you make a nice pivot table for a client, then they > > come back 2 months latter with a new spreadsheet and want a similar > > table for this data? I don't think I could remember the set of clicks > > needed, but with an R script I can pull up the set of commands I used > > before, change the variable and table names, and have the results in a > > couple of minutes.One you are used to the mechanics of creating pivot tables, it will be extremely easy to create a similar table just by having a quick look at the original table. Of course it is possible to design terrible spreadsheets. But it is also quite possible to write terrible R code. What makes spreadsheets very special is autorecalculation and autoupdating of charts, but also the point-and-click method of creating formulas (using relative and absolute references). And of course, knowing one tool always is dangerous, but this is also true in the case where the single known tool is R. -- Erich Neuwirth, University of Vienna Faculty of Computer Science Computer Supported Didactics Working Group Visit our SunSITE at http://sunsite.univie.ac.at Phone: +43-1-4277-39464 Fax: +43-1-4277-39459 -- Erich Neuwirth, University of Vienna Faculty of Computer Science Computer Supported Didactics Working Group Visit our SunSITE at http://sunsite.univie.ac.at Phone: +43-1-4277-39464 Fax: +43-1-4277-39459
Ah . . . the hammer analogy. In a conversation like this it's not a question of will somebody drop it, it's when will it be dropped. --Matt Matt Austin Statistician Amgen, Inc. -----Original Message----- From: r-help-bounces@stat.math.ethz.ch [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Greg Snow Sent: Thursday, August 30, 2007 12:14 PM To: Erich Neuwirth; r-help Subject: Re: [R] Excel Earlier this week I was doing some work at our house and since my wife was at the dentist office our 3 year old son was "helping" me. He really wanted to use the hammer, so I showed him where to tap and he was excited to tap (not doing much good, but also not doing any damage). He liked that so much that he started to look for other things that he could use the hammer on, some were benign, others made me jump in and stop him before he did major damage. What does this have to do with Excel and R? Well I see Excel as being very much like my hammer and different users like the different users of the hammer. There are some carpenters who can use a hammer along with other tools to make things of pure beauty. Simillarly, I expect there are people who can use excel/spreadsheets along with other tools to make useful and beautiful things (I expect that you (Erich) may be one of those). My use of the hammer is far less than that of the experts, I can about 9 times out of 10 survive the use without blood and major bruising, fix or make something that works, but is far from a work of art. My use of spreadsheets is similar, I do use them for some specific purposes (holding the bus schedule on my pda, amortization tables), but I know better than to use them where I would cause more harm than good. I fear that many users of Excel and other spreadsheets are much like my toddler, they have a tool and they want to use it, ignorant of whether they will cause more damage than good. This is part of why I put in the reference to Dr. Burns page, he does not say get rid of spreadsheet altogether, but he points out that there are several cases where other tools work better (It is possible to embed a screw in a piece of wood using a hammer, but a screwdriver generally works better). Erich Neuwirth wrote:> -----Original Message----- > From: r-help-bounces@stat.math.ethz.ch > [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Erich Neuwirth > Sent: Wednesday, August 29, 2007 4:46 PM > To: r-help > Subject: Re: [R] Excel > > Greg Snow wrote: > > > > > > Or do you trust all of your clients to know to use > R(D)COM as well > > > as how to install and use it? > > Do you trust your clients to be fluent enough in R to use it?I admitted that I don't trust some of my clients with their own data (I think my record is 9 different sexes), I certainly don't expect them to use R. I trust them to bring their questions and data to me, then I use R (or one of my coworkers uses their favorite stats package) to help them. That way I know that the correct things are being done (on the flip side, they trust me to leave patient care up to them and not try surgery myself, it is much better for the patients that way).> For most of my clients, that is not true. > For this kind of users, the following strategy works. > They have their data in Excel, and I write some macros in > Excel which use RExcel to compute results and put them in the sheet. > The clients press buttons similar to what they would do with > the Analysis toolpack in Excel, but they get the full power of R.Who installs R and sets up R dcom on all these computers? If the above clients want something in addition to what you prepared, do you add additional macros for them? Do they know enough R to write their own Rapply statement? Or do they go looking in the Excel help and use the wrong set of tools? Or do you have them trained enough that they don't ask any additional questions? I have no problem with you using this approach, I just think that if you are going to advocate that others use excel with clients, that you also let them know of the extra work that they may be committing themselves to.> > > > > >> >> The formula > >> >> =RApply("var",A1:A1000) in an Excel cell for example > will use R to > >> >> compute the variance of the data in column A in Excel. > >> >> If you change any of the values in the range A1:A1000 will > >> >> automatically recompute the variance. > > > > > > And what happens when you enter a value in cell A1001? > > If you use the following formula > =RApply("var",DownFrom(A1)) > > where DownFrom ist he following function defined in VBA > > Function DownFrom(startcell As Range) As Range > Application.Volatile > Set DownFrom = Range(startcell, startcell.End(xlDown)) > End Function > > Then writing a value in A1001 will a extend the range to > which the function "var" is applied.So to answer a known inconsistancy in spreadsheets, you are now bringing in a 3rd program/language. At what point is it just simpler to use R directly? (for me it was back a couple of steps).> Arguments of worksheet function can be dynamic ranges, but > sadly this is a relatively unknown fact.And this is one of my main complaints about Excel as well as other MS products and MS wannabees. They may have implemented tools that allow you to do the correct thing, but in practice they make doing the wrong thing so much easier. I now encourage my clients to use at least access rather than excel for data entry because it forces them to think about what the data will look like before entering any values (one of the examples on Dr. Burns spreadsheet addiction page came from my group, life would have been much easier if the client had not trusted excel, but used access or another database program).> > > > > > > And what happens if you set cell B1 to =A1 and do the > magic copy so > > > that b2=a2, b3=a3, ... Then put > > > =Rapply("var",B1:B1000) in a cell, does the answer match with > > > =Rapply("var",A1:A1000)? > > Of course it does, what else would you expect to happen? > and if you copy the the formula in the cell containing > =Rapply("var",A1:A1000) into the cell to its right, this cell > it will automatically contain the formula > =Rapply("var",B1:B1000)Well the naïve expectation is that they would match, but the naïve expectation is that =var(A1:A1000) would also match =var(B1:B1000) (but it doesn't). I expected yours to suffer the same fate since I thougth that excel would interpret the blanks vs. 0's before passing the info to R. If you have done better at this than excel, then great job (I wanted to try this out myself, but am having problems installing the Rexcel tools).> > > > > > Yes, the auto-recompute could be considered a nice > feature, but does > > > it really save that much work compared to running a script in R > > > after updating the data? (a couple of clicks in Rgui, a couple of > > > keystrokes in ESS, one line of code in an R terminal) > > Auto-recomputing can do more. > My favorite demo is the display of a kernel density estimator. > You move a slider controlling the window width. The slider > controls the value in a cell. When the cell contents change, > R recomputes the kernel density estimator in some Excel cells > and Excel then updates the graph. So you have an animated > display in Excel which allows you to instantly see the > changes in the graph. > This is user controlled animation. It can be done using for > example by Tcl/Tk also, but it is much more work.I believe that which is more work depends on the person and their experience. Here is some R code to do similar: library(tkrplot) my.data <- c(rnorm(20, 10, 2), rnorm(30, 15, 3)) my.adjust <- 1 tt <- tktoplevel() img <- tkrplot(tt, function() plot(density(my.data, adjust=my.adjust))) f <- function(...) { my.a <- as.numeric(tclvalue('adjust')) if(my.a != my.adjust){ my.adjust <<- my.a tkrreplot(img) } } s <- tkscale(tt, command=f, from=0.05, to=3.00, variable='adjust', resolution=0.05, orient="horiz") tkpack(img,s) The above code is only a small modification of the example for tkrplot. One of these days I'm going to rewrite the Tk demos in the TeachingDemos package to use tkrplot, it makes these things quite easy (maybe my son can use the hammer to help me, maybe not). I do have to say that when I read your description of having R do the computations, but then having excel create the plot, I shuddered at the thought (and not in a good way). [snip] I think there has been enough discussion on the pivot tables and right now we should probably agree to disagree.> Of course it is possible to design terrible spreadsheets. > But it is also quite possible to write terrible R code.Yes that is true, but one major difference is that I have never known an R user that saw R as the only available tool. Personally I recommend that stats students know at least 3 stats packages (and I ask about this when I interview) as well as other computer tools. Too many people that only know excel think that excel can solve all problems. A while back I was talking to my boss (non-statistician) about something and she suggested creating a spreadsheet to track some information, I asked why not use access instead. Her reply was that she knew how to make a spreadsheet, but did not know access (or other db), then asked what the difference was. As I explained some of the differences, she realized that a db was the appropriate way to do what she wanted.> What makes spreadsheets very special is autorecalculation and > autoupdating of charts, but also the point-and-click method > of creating formulas (using relative and absolute references).I agree with the above statement, but I think that I am thinking of a different definition of 'special' than what you intended.> And of course, knowing one tool always is dangerous, but this > is also true in the case where the single known tool is R.Here we agree Actually, I like the R dcom interface idea. A few years ago I considered using it for an introductory class (reasoning that the students would already be familiar with excel), I don't remember all the reasons why I ended up not going with it. I have some co-workers that I may recommend it to as well (once they have learned enough R). I am just very wary of recommending it to the general public, or recommending statisticians use it with clients without the additional warnings of probable consequences. For me personally, I prefer typing to clicking, so many of the arguments for it don't sway me (but they may be fine for others who prefer clicking to typing).> > -- Erich Neuwirth, University of Vienna Faculty of Computer > Science Computer Supported Didactics Working Group Visit our > SunSITE at http://sunsite.univie.ac.at Phone: > +43-1-4277-39464 Fax: +43-1-4277-39459 > -- > Erich Neuwirth, University of Vienna > Faculty of Computer Science > Computer Supported Didactics Working Group Visit our SunSITE > at http://sunsite.univie.ac.at > Phone: +43-1-4277-39464 Fax: +43-1-4277-39459 >-- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow@intermountainmail.org (801) 408-8111 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
Flame wars are usually vituperative, often entertaining, and occasionally productive. Excel is good for accounts and for taking notes, sometimes for back-of-the-envelope calculations. It is not so suitable for statistics and its formulae can be incomprehensible when you try to understand what they mean later (what on earth might C2*B $4 once have meant?). R is great for statistics, but frustratingly linear (though Philippe Grosjean sees linearity as an advantage; people work in different ways). Having previously run scripts and being able to rerun them is only useful if you can remember why they were written that way in the first place, so that you can adjust them for different situations. As with all software, it is an advantage to know what you are doing. Familiarity with the software, whatever you use, helps. Erich Neuwirth recommended pivot tables and they can be quite effective. I prefer the graphical alternative of interactive mosaic plots, as in iPlots. Both are needed. Erich's more important point is that you need to speak the language of the people you cooperate with and often that language includes Excel. Antony Unwin
The quickest solution is to additionally install the package rcom from CRAN. A more detailed account can be found on our wiki at http://rcom.univie.ac.at especially on http://learnserver.csd.univie.ac.at/rcomwiki/doku.php?id=version_information_and_links Greg Snow wrote:> Erich, > > I just downloaded the latest version of the R dcom server etc from CRAN > (version 2.5, I had looked at the previous version, but wanted to try > the latest version (I will be responding to your e-mail, just wanted to > test some things rather than speak from ignorance (not that I always > take that approach))). > > When installing and further when running excel after the install I get > an error and a visual basic debugger pops up. The error window says: > "Compile error: Can't find project or library", The code window is > titled: "Rexcel.xla - MenuTools (Code)" and the line: "Function > ExcelVersionMajor() As Integer" is highlighted in yellow, a couple of > lines below that the word "Left" is highlighted/selected in the line > "ExcelVersionMajor = Cint(Left(Application.Version, pos - 1))" > > I have Excel 2000. > > >-- Erich Neuwirth, University of Vienna Faculty of Computer Science Computer Supported Didactics Working Group Visit our SunSITE at http://sunsite.univie.ac.at Phone: +43-1-4277-39464 Fax: +43-1-4277-39459