With those kinds of numbers, I would think a database would be appropriate
(instead of spreadsheets).
You can begin to assess performance of R with 90,000 observations with
experiments like this:
mydat <- list()
for (i in 1:30) mydat[[i]] <- sample(letters, size=90000, replace=TRUE)
mydat2 <- as.data.frame(mydat, stringsAsFactors=FALSE)
dim(mydat2)[1] 90000 30
lapply(mydat2, table)
-Don
--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
On 3/7/14 7:46 AM, "Marco Barb?ra" <jabbba at gmail.com> wrote:
>Dear UseRs,
>
>I am going to be involved in the analysis of a cohort of about 90,000
>people. I still didn't have the data at hand, but I know that right now
>they are archived into spreadsheet files. So far I only analysed data
>sets of very small size. I probably will be able to work on a
>relatively fast pc, an i7 with 8 or (i hope) 16 GB RAM. I don't know
>the number of variables but I think I shouldn't have the need to use
>other than "standard" R (i.e. holding the entire data frame in
RAM)
>evev if I probably will have to use some non-parametric tools which
>should be a bit more computer-intensive.
>
>Still, since I have no previous experience, it'd be of great help if
>someone could give me some advice on which ways could be most
>convenient to work in, both from the point of you of databases and of
>data access, or otherwise if there is simply no reason for me to bother
>at all.
>
>I'm not asking for prepackaged solutions, rather for help in
>documentation seeking and links to useful documentation or other
>threads (for example: is it worthwhile using parallel computing?)
>
>Thank you to anyone for reading this email.
>Marco Barb?ra.
>
>P.S.: I work on a Debian system, but this shouldn't matter.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.