Hello Apologies if this is a simple question, I have searched the help and have not managed to work out a solution. Does anybody know an efficient method for reading many text files of the same format into one table/dataframe? I have around 90 files that contain continuous data over 3 months but that are split into individual days data and I need the whole 3 months in one file for analysis. Each days file contains a large amount of data (approx 30MB each) and so I need a memory efficient method to merge all of the files into the one dataframe object. From what I have read I will probably want to avoid using for loops etc? All files are in the same directory, none have a header row, and each contain around 180,000 rows and the same 25 columns/variables. Any suggested packages/routines would be very useful. Thanks Jennifer ----------------------------------------- *******************************************************************If you are not the intended recipient, please notify our Help Desk at Email postmaster@nats.co.uk immediately. You should not copy or use this email or attachment(s) for any purpose nor disclose their contents to any other person. NATS computer systems may be monitored and communications carried on them recorded, to secure the effective operation of the system and for other lawful purposes. Please note that neither NATS nor the sender accepts any responsibility for viruses or any losses caused as a result of viruses and it is your responsibility to scan or otherwise check this email and any attachments. NATS means NATS (En Route) plc (company number: 4129273), NATS (Services) Ltd (company number 4129270), NATSNAV Ltd (company number: 4164590) or NATS Ltd (company number 3155567) or NATS Holdings Ltd (company number 4138218). All companies are registered in England and their registered office is at 5th Floor, Brettenham House South, Lancaster Place, London, WC2E 7EN. ********************************************************************** [[alternative HTML version deleted]]
I'd first try plyr and see if it's efficient enough,> library(plyr) > > listOfFiles <- list.files(pattern= ".txt") > > d <- ldply(listOfFiles, read.table) > str(d)alternatively,> d <- do.call(rbind, lapply(listOfFiles, read.table))HTH, baptiste On 13 May 2009, at 12:45, SYKES, Jennifer wrote:> Hello > > > > Apologies if this is a simple question, I have searched the help and > have not managed to work out a solution. > > Does anybody know an efficient method for reading many text files of > the > same format into one table/dataframe? > > > > I have around 90 files that contain continuous data over 3 months but > that are split into individual days data and I need the whole 3 months > in one file for analysis. Each days file contains a large amount of > data (approx 30MB each) and so I need a memory efficient method to > merge > all of the files into the one dataframe object. From what I have > read I > will probably want to avoid using for loops etc? All files are in the > same directory, none have a header row, and each contain around > 180,000 > rows and the same 25 columns/variables. Any suggested packages/ > routines > would be very useful. > > > > Thanks > > > > Jennifer > > > > > > > > ----------------------------------------- > *******************************************************************If > you are not the intended recipient, please notify our Help Desk at > Email postmaster at nats.co.uk immediately. You should not copy or use > this email or attachment(s) for any purpose nor disclose their > contents to any other person. NATS computer systems may be > monitored and communications carried on them recorded, to secure > the effective operation of the system and for other lawful > purposes. Please note that neither NATS nor the sender accepts any > responsibility for viruses or any losses caused as a result of > viruses and it is your responsibility to scan or otherwise check > this email and any attachments. NATS means NATS (En Route) plc > (company number: 4129273), NATS (Services) Ltd (company number > 4129270), NATSNAV Ltd (company number: 4164590) or NATS Ltd > (company number 3155567) or NATS Holdings Ltd (company number > 4138218). All companies are registered in England and their > registered office is at 5th Floor, Brettenham House South, > Lancaster Place, London, WC2E 7EN. > ********************************************************************** > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code._____________________________ Baptiste Augui? School of Physics University of Exeter Stocker Road, Exeter, Devon, EX4 4QL, UK Phone: +44 1392 264187 http://newton.ex.ac.uk/research/emag
What types of data are in each file? All numbers, or a mix of numbers and characters? Any missing data or special NA values? On Wed, May 13, 2009 at 7:45 AM, SYKES, Jennifer <Jennifer.SYKES at nats.co.uk> wrote:> Hello > > > > Apologies if this is a simple question, I have searched the help and > have not managed to work out a solution. > > Does anybody know an efficient method for reading many text files of the > same format into one table/dataframe? > > > > I have around 90 files that contain continuous data over 3 months but > that are split into individual days data and I need the whole 3 months > in one file for analysis. ?Each days file contains a large amount of > data (approx 30MB each) and so I need a memory efficient method to merge > all of the files into the one dataframe object. ?From what I have read I > will probably want to avoid using for loops etc? ?All files are in the > same directory, none have a header row, and each contain around 180,000 > rows and the same 25 columns/variables. ?Any suggested packages/routines > would be very useful. > > > > Thanks > > > > Jennifer > > > > > > > > ----------------------------------------- > *******************************************************************If > you are not the intended recipient, please notify our Help Desk at > Email postmaster at nats.co.uk immediately. You should not copy or use > this email or attachment(s) for any purpose nor disclose their > contents to any other person. NATS computer systems may be > monitored and communications carried on them recorded, to secure > the effective operation of the system and for other lawful > purposes. Please note that neither NATS nor the sender accepts any > responsibility for viruses or any losses caused as a result of > viruses and it is your responsibility to scan or otherwise check > this email and any attachments. NATS means NATS (En Route) plc > (company number: 4129273), NATS (Services) Ltd (company number > 4129270), NATSNAV Ltd (company number: 4164590) or NATS Ltd > (company number 3155567) or NATS Holdings Ltd (company number > 4138218). All companies are registered in England and their > registered office is at 5th Floor, Brettenham House South, > Lancaster Place, London, WC2E 7EN. > ********************************************************************** > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Mike Lawrence Graduate Student Department of Psychology Dalhousie University Looking to arrange a meeting? Check my public calendar: http://tr.im/mikes_public_calendar ~ Certainty is folly... I think. ~
can you provide reproducible code please? even a fake example would help. I would 1) set up a loop to read in each file from a directory 2) inside the loop chop up/ aggregate the data, each file in turn and spit each new aggreagated file out to a directory using write.table(). This will reduce the memory needed by only including the info you want. Make sure each file is a data frame with the same names. 3) set up a new loop to read in each new small file and rbind them all together to make your new "master file". The R gurus may have a more parsimonious solution. HTH Simon. ----- Original Message ----- From: "SYKES, Jennifer" <Jennifer.SYKES at nats.co.uk> To: <r-help at r-project.org> Sent: Wednesday, May 13, 2009 11:45 AM Subject: [R] read multiple large files into one dataframe> Hello > > > > Apologies if this is a simple question, I have searched the help and > have not managed to work out a solution. > > Does anybody know an efficient method for reading many text files of the > same format into one table/dataframe? > > > > I have around 90 files that contain continuous data over 3 months but > that are split into individual days data and I need the whole 3 months > in one file for analysis. Each days file contains a large amount of > data (approx 30MB each) and so I need a memory efficient method to merge > all of the files into the one dataframe object. From what I have read I > will probably want to avoid using for loops etc? All files are in the > same directory, none have a header row, and each contain around 180,000 > rows and the same 25 columns/variables. Any suggested packages/routines > would be very useful. > > > > Thanks > > > > Jennifer > > > > > > > > ----------------------------------------- > *******************************************************************If > you are not the intended recipient, please notify our Help Desk at > Email postmaster at nats.co.uk immediately. You should not copy or use > this email or attachment(s) for any purpose nor disclose their > contents to any other person. NATS computer systems may be > monitored and communications carried on them recorded, to secure > the effective operation of the system and for other lawful > purposes. Please note that neither NATS nor the sender accepts any > responsibility for viruses or any losses caused as a result of > viruses and it is your responsibility to scan or otherwise check > this email and any attachments. NATS means NATS (En Route) plc > (company number: 4129273), NATS (Services) Ltd (company number > 4129270), NATSNAV Ltd (company number: 4164590) or NATS Ltd > (company number 3155567) or NATS Holdings Ltd (company number > 4138218). All companies are registered in England and their > registered office is at 5th Floor, Brettenham House South, > Lancaster Place, London, WC2E 7EN. > ********************************************************************** > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
A few points to consider: - If all the data are numeric, then use matrices instead of data frames. - With either data frames or matrices, there is no way (that I'm aware of anyway) in R to stack them without making at least one copy in memory. - Since none of the files has a header row, I would concatenate them into one file outside R (e.g., on *nix, cat * > all.txt) and then read that in. You can also try it inside R with something like read.table(pipe()). You will want to make use of the colClasses argument in read.table() to specify the column types, though, to ensure that read.table() only go through the input once. - You're probably better off getting the data into a database (even something like sqlite) and use an R interface to that database. - 30MB x 90 = 2.7GB. Unless you're on a 64-bit machine with lots of RAM, you're not likely to have much fun with the data even when you manage to get it into R in one piece. Andy From: SYKES, Jennifer> > Hello > > > > Apologies if this is a simple question, I have searched the help and > have not managed to work out a solution. > > Does anybody know an efficient method for reading many text > files of the > same format into one table/dataframe? > > > > I have around 90 files that contain continuous data over 3 months but > that are split into individual days data and I need the whole 3 months > in one file for analysis. Each days file contains a large amount of > data (approx 30MB each) and so I need a memory efficient > method to merge > all of the files into the one dataframe object. From what I > have read I > will probably want to avoid using for loops etc? All files are in the > same directory, none have a header row, and each contain > around 180,000 > rows and the same 25 columns/variables. Any suggested > packages/routines > would be very useful. > > > > Thanks > > > > Jennifer > > > > > > > > ----------------------------------------- > *******************************************************************If > you are not the intended recipient, please notify our Help Desk at > Email postmaster at nats.co.uk immediately. You should not copy or use > this email or attachment(s) for any purpose nor disclose their > contents to any other person. NATS computer systems may be > monitored and communications carried on them recorded, to secure > the effective operation of the system and for other lawful > purposes. Please note that neither NATS nor the sender accepts any > responsibility for viruses or any losses caused as a result of > viruses and it is your responsibility to scan or otherwise check > this email and any attachments. NATS means NATS (En Route) plc > (company number: 4129273), NATS (Services) Ltd (company number > 4129270), NATSNAV Ltd (company number: 4164590) or NATS Ltd > (company number 3155567) or NATS Holdings Ltd (company number > 4138218). All companies are registered in England and their > registered office is at 5th Floor, Brettenham House South, > Lancaster Place, London, WC2E 7EN. > ********************************************************************** > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:12}}
Brad Patrick Schneid
2010-Jan-24 19:34 UTC
[R] read multiple large files into one dataframe
### The following is very helpful ######### listOfFiles <- list.files(pattern= ".txt") d <- do.call(rbind, lapply(listOfFiles, read.table)) ############################### but what if each file contains information corresponding to a different subject and I need to be able to tell where each row came from? i.e.: I need a new row that repeats the original filename for each observation of the former respective files. Any ideas? -- View this message in context: http://n4.nabble.com/read-multiple-large-files-into-one-dataframe-tp891835p1288816.html Sent from the R help mailing list archive at Nabble.com.
Brad Patrick Schneid wrote:> ### The following is very helpful ######### > listOfFiles <- list.files(pattern= ".txt") > d <- do.call(rbind, lapply(listOfFiles, read.table)) > ############################### > > but what if each file contains information corresponding to a different > subject and I need to be able to tell where each row came from? i.e.: I > need a new rowa new column I presume, not a row> that repeats the original filename for each observation of > the former respective files. > > Any ideas? > > >listOfFiles <- list.files(pattern= ".txt") d <- do.call(rbind, lapply(listOfFiles, read.table)) you replace the read.table function by a custom function: listOfFiles <- list.files(pattern= ".txt") d <- do.call(rbind, lapply(listOfFiles, function(fname) { dum = read.table(fname) dum$which_file = fname return(dum) })) Now d has an additional column identifying which filename it originally belonged to. cheers, Paul -- Drs. Paul Hiemstra Department of Physical Geography Faculty of Geosciences University of Utrecht Heidelberglaan 2 P.O. Box 80.115 3508 TC Utrecht Phone: +3130 274 3113 Mon-Tue Phone: +3130 253 5773 Wed-Fri http://intamap.geo.uu.nl/~paul
On Mon, Jan 25, 2010 at 4:43 AM, Paul Hiemstra <p.hiemstra at geo.uu.nl> wrote:> Brad Patrick Schneid wrote: >> >> ### ?The following is very helpful ######### listOfFiles <- >> list.files(pattern= ".txt") d <- do.call(rbind, lapply(listOfFiles, >> read.table)) ############################### >> >> but what if each file contains information corresponding to a different >> subject and I need to be able to tell where each row came from? ?i.e.: I >> need a new row > > a new column I presume, not a row >> >> that repeats the original filename for each observation of >> the former respective files. >> >> Any ideas? >> >> > > listOfFiles <- list.files(pattern= ".txt") > d <- do.call(rbind, lapply(listOfFiles, read.table))Or use the plyr package: listOfFiles <- list.files(pattern= ".txt") names(listOfFiles) <- basename(listOfFiles) d <- ldply(listOfFiles, read.table) See http://had.co.nz/plyr for more info. Hadley -- http://had.co.nz/
Brad Patrick Schneid
2010-Jan-26 05:51 UTC
[R] read multiple large files into one dataframe
Thats it Hadley!!! Thank you. -- View this message in context: http://n4.nabble.com/read-multiple-large-files-into-one-dataframe-tp891835p1290089.html Sent from the R help mailing list archive at Nabble.com.