Deall all, We need to perform a statistical analysis of a large database (40,000 entries with approximately 500 fields in each entry) currently handled in Oracle. The data contains categorical variables only. At the current stage we suggest classification and clustering analysis. We are planning to perform the analysis in R and would be very grateful for any recommendations/suggestions/references regarding the packages/tools appropriate for this task. Thank you in advance for your attention, Vicky Landsman. [[alternative HTML version deleted]]
Hi, for your analysis use the package: ROracle Oracle database interface for R http://microarrays.unife.it/CRAN/src/contrib/Descriptions/ROracle.html see also: Diego Kuonen, Introduction au data mining avec R : vers la reconqu??te du `knowledge discovery in databases' par les statisticiens. Bulletin of the Swiss Statistical Society, 40:3-7, 2001. http://www.statoo.com/en/publications/2001.R.SSS.40/ Diego Kuonen and Reinhard Furrer, Data mining avec R dans un monde libre. Flash Informatique Sp??cial ??t??, pages 45-50, sep 2001. http://sawww.epfl.ch/SIC/SA/publications/FI01/fi-sp-1/sp-1-page45.html R Development Core Team, R Data Import/Export, versione 1.9.0, aprile 2004, pagg. 11-18 http://cran.r-project.org/doc/manuals/R-data.pdf Brian D. Ripley, Datamining: Large Databases and Methods, in Proceedings of "useR! 2004 - The R User Conference", maggio 2004 http://www.ci.tuwien.ac.at/Conferences/useR-2004/Keynotes/Ripley.pdf Brian D. Ripley, Using Databases with R, R News, Gennaio 2001, pagg. 18-20 http://cran.r-project.org/doc/Rnews/Rnews_2001-1.pdf B. D. Ripley, R. M. Ripley, Applications of R Clients and Servers in Proceedings of the Distributed Statistical Computing 2001 Workshop, 2001, Vienna University of Technology. http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/Ripley.pdf Torsten Hothorn, David A. James, Brian D. Ripley, R/S Interfaces to Databases in Proceedings of the Distributed Statistical Computing 2001 Workshop, 2001,Vienna University of Technology. http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/HothornJamesRipley.pdf Lu??s Torgo, Data Mining with R. Learning by case studies, Maggio 2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/ I hope I give you a little help. Best Vito You wrote: Deall all, We need to perform a statistical analysis of a large database (40,000 entries with approximately 500 fields in each entry) currently handled in Oracle. The data contains categorical variables only. At the current stage we suggest classification and clustering analysis. We are planning to perform the analysis in R and would be very grateful for any recommendations/suggestions/references regarding the packages/tools appropriate for this task. Thank you in advance for your attention, Vicky Landsman ====Diventare costruttori di soluzioni "The business of the statistician is to catalyze the scientific learning process." George E. P. Box Visitate il portale http://www.modugno.it/ e in particolare la sezione su Palese http://www.modugno.it/archivio/cat_palese.shtml
I thought that maybe authors of books on R should be allowed (encouraged ?) to announce availability/revisions of their books via the R-packages list? For example I'd be very interested to have another look at Dr. Torgo's book when it becomes more complete and I'd appreciate a revision notice via the list. Just a suggestion. Thanks, Vadim> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Luis Torgo > Sent: Wednesday, October 13, 2004 12:03 PM > To: Prof Brian Ripley > Cc: Vito Ricci; r-help at stat.math.ethz.ch > Subject: Re: [R] Statistical analysis of a large database > > On Tue, 2004-10-12 at 08:36, Prof Brian Ripley wrote: > > > Lu??s Torgo, Data Mining with R. Learning by case studies, Maggio > > > 2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/ > > > > Please note that that reference is not about large > datasets, nor about > > `data mining' in the generally used sense. It has two studies, one > > incomplete, on linear regression (with 200 samples) and on > time series. > > I would like to add a few information on these incomplete > comments on the book I'm writing. The book is unfinished as > mentioned on its Web page. It has currently two reasonably > finished chapters: an introduction to R and MySQL and a case > study. As mentioned in the book, the first case study is > small by data mining standards (200 observations) and has the > goal of illustrating techniques that are shared by data > mining and other disciplines as well as smoothly introducing > the reader to R and its power. It addresses data > pre-processing techniques, data visualization, model > construction (yes, linear regression but also regression > trees), and model evaluation, selection and combination, so I > think it is a bit incorrect to say that it is about linear > regression that corresponds to 5 of the 50 pages of that chapter. > > The third (unfinished) chapter (2nd case study) is about > financial trading. It includes topics like connections to > data bases as well as many other components of a knowledge > discovery process. Among those components it includes model > construction that involves obviously time series models given > the nature of the data. The chapter will include other steps > like issues concerning moving from predictions into actions, > creation of variables from the original time series, etc.. It > is currently being re-written and I expect to upload soon a > new revised version of this chapter. > > The book will include at least two further cases studies that > will be larger. Still, I would note that the financial > trading case study is potentially very large, as it is a > problem where data is constantly growing. The final version > of that chapter addresses this issue of having a system that > is online in the sense that it is receiving new data in real > time (also known as mining data streams in the data mining field). > > I'm sorry for being so long, but I think it is dangerous to > try to resume around 200 pages of an unfinished work in two > lines of text. > > Still, all comments on this on going project are very well > welcome and I would like to take this opportunity to thank > all people that have been sending me encouraging comments/emails. > > Luis Torgo > > -- > Luis Torgo > FEP/LIACC, University of Porto Phone : (+351) 22 607 88 30 > Machine Learning Group Fax : (+351) 22 600 36 54 > R. Campo Alegre, 823 email : ltorgo at liacc.up.pt > 4150 PORTO - PORTUGAL WWW : > http://www.liacc.up.pt/~ltorgo > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >