Hello, I am trying to read the following Xena dataset into R for data analysis: https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz I tried to run the following read.table(gzfile("HumanMethylation450.gz")), but R ended up crashing as a result. Is there perhaps a way to use read.table with fread in some way to do this? Many thanks, Spencer [[alternative HTML version deleted]]
Unsubscribe On Sat, 10 Aug 2019 at 20:30, Spencer Brackett < spbrackett20 at saintjosephhs.com> wrote:> Hello, > > I am trying to read the following Xena dataset into R for data analysis: > > https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz > > I tried to run the following read.table(gzfile("HumanMethylation450.gz")), > but R ended up crashing as a result. > > Is there perhaps a way to use read.table with fread in some way to do this? > > Many thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Have you tried using readLines in the manner illustrated on the ?gzfile help page? David. On 8/10/19 12:29 PM, Spencer Brackett wrote:> Hello, > > I am trying to read the following Xena dataset into R for data analysis: > https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz > > I tried to run the following read.table(gzfile("HumanMethylation450.gz")), > but R ended up crashing as a result. > > Is there perhaps a way to use read.table with fread in some way to do this? > > Many thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Well, let's see about "rules"? ... you posted in HTML when this is a plain text mailing list and then you replied to only me when you are supposed reply to the list (so I'm putting back the list address in my reply: When I copied your code and then attempted to do a bit of debugging I get: > z <- readLines(gzcon(url(?https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz?)), n = 100) Error: unexpected input in "z <- readLines(gzcon(url(?" # that was because you had "smart-quotes" rather than ASCII quotes: > z <- readLines(gzcon(url( 'https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz' )), n = 100) > z[1:10] ?[1] "sample\tTCGA-E1-5319-01\tTCGA-HT-7693-01\tTCGA-CS-6665-01\tTCGA-S9-A7J2-01\tTCGA-FG-A6J3-01\tTCGA-FG-6688-01\tTCGA-S9-A6TX-01\tTCGA-VM-A8C8-01\tTCGA-74-6577-01\tTCGA-06-AABW-11\tTCGA-06-0125-02\tTCGA-HT-A74L-01\tTCGA-26-A7UX-01\tTCGA-DU-A5TS-01\tTCGA-06-6388-01\tTCGA-DB-A4XA-01\tTCGA-06-A7TL-01\tTCGA-HT-A4DV-01\tTCGA-TQ-A7RP-01\tTCGA-E1-5311-01\tTCGA-28-5213-01\tTCGA-E1-A7YI-01\tTCGA-E1-5305-01\tTCGA-F6-A8O4-01\tTCGA-HT-8113-01\tTCGA-DH-A66G-01\tTCGA-76-4932-01\t Snipped hundreds of lines. So this seems to indicate that this is a tab separated file. Don't you have some documentation to refer to? This seems possibly useful: > z <- read.table( text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')), n = 100), header=TRUE, sep="\t") > str(z) 'data.frame':??? 99 obs. of? 686 variables: ?$ sample???????? : Factor w/ 99 levels "cg00036732","cg00651829",..: 53 2 60 41 16 13 37 20 70 21 ... ?$ TCGA.E1.5319.01: num? 0.4019 0.0215 0.053 0.0453 0.515 ... ?$ TCGA.HT.7693.01: num? 0.9364 0.0216 0.0547 0.0819 0.6129 ... ?$ TCGA.CS.6665.01: num? 0.0345 0.0164 0.0719 0.0497 0.6648 ... ?$ TCGA.S9.A7J2.01: num? 0.0295 0.0168 0.0421 0.0867 0.1657 ... ?$ TCGA.FG.A6J3.01: num? 0.0248 0.0161 0.0556 0.0902 0.5042 ... ?$ TCGA.FG.6688.01: num? 0.0203 0.0179 0.0321 0.0513 0.1075 ... ?$ TCGA.S9.A6TX.01: num? 0.0378 0.0199 0.0623 0.0992 0.7662 ... ?$ TCGA.VM.A8C8.01: num? 0.0271 0.0172 0.0466 0.0564 0.3478 ... ?$ TCGA.74.6577.01: num? 0.0237 0.0193 0.0196 0.0961 0.1242 ... ?$ TCGA.06.AABW.11: num? 0.0323 0.0156 0.0395 0.0708 0.1136 ... ?$ TCGA.06.0125.02: num? 0.0238 0.0181 0.039 0.068 0.0796 ... ?$ TCGA.HT.A74L.01: num? 0.7409 0.0221 0.0596 0.0765 0.8157 ... #snipped the output # there seemed to be 686 columns -- David. On 8/10/19 3:07 PM, Spencer Brackett wrote:> I?ve tried z <- > readLines(gzcon(url(?https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz?)), > n = 100) > > Which prints out the indicated 10 rows, but I can not seem to run the > same code excluding the n = 100 without R stalling and me being forced > to close the program. All I am trying to do is ensure that the whole > file is imported into R so that I can proceed with a survival analysis. > > Also, what particular rule of the mailing list did I break? I > apologize in advance, as I thought that code specific queries like the > one I asked were acceptable. > > Many thanks, > > Spencer > > On Sat, Aug 10, 2019 at 5:51 PM David Winsemius > <dwinsemius at comcast.net <mailto:dwinsemius at comcast.net>> wrote: > > Have you tried using readLines in the manner illustrated on the > ?gzfile > help page? > > > David. > > On 8/10/19 12:29 PM, Spencer Brackett wrote: > > Hello, > > > > I am trying to read the following Xena dataset into R for data > analysis: > > > https://tcga.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz > > > > I tried to run the following > read.table(gzfile("HumanMethylation450.gz")), > > but R ended up crashing as a result. > > > > Is there perhaps a way to use read.table with fread in some way > to do this? > > > > Many thanks, > > > > Spencer > > > >? ? ? ?[[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org <mailto:R-help at r-project.org> mailing list > -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >
Further note: After three minutes of waiting? ... not a particularly long wait in my opinion, I get this: > z <- read.table( text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')) ), header=TRUE, sep="\t") > dim(z) [1] 485577??? 686 So almost half a million lines of data in a rather wide dataset for an incompletely described file. I'd say R seems to be "working" properly. data.table::fread was more informative about the process but acheived basically the same result in 1/6th the time: ??fread system.time( z <- fread('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz', sep="\t")? ) #----------- [100%] Downloaded 597770433 bytes... ?? user? system elapsed ?20.682?? 3.322? 29.292 > dim(z) [1] 485577??? 686 -- David. On 8/10/19 5:32 PM, David Winsemius wrote:> Well, let's see about "rules"? ... you posted in HTML when this is a > plain text mailing list and then you replied to only me when you are > supposed reply to the list (so I'm putting back the list address in my > reply: > > > When I copied your code and then attempted to do a bit of debugging I > get: > > > > z <- > readLines(gzcon(url(?https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz?)), > n = 100) > Error: unexpected input in "z <- readLines(gzcon(url(?" > > # that was because you had "smart-quotes" rather than ASCII quotes: > > > > z <- readLines(gzcon(url( > 'https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz' > )), n = 100) > > z[1:10] > ?[1] > "sample\tTCGA-E1-5319-01\tTCGA-HT-7693-01\tTCGA-CS-6665-01\tTCGA-S9-A7J2-01\tTCGA-FG-A6J3-01\tTCGA-FG-6688-01\tTCGA-S9-A6TX-01\tTCGA-VM-A8C8-01\tTCGA-74-6577-01\tTCGA-06-AABW-11\tTCGA-06-0125-02\tTCGA-HT-A74L-01\tTCGA-26-A7UX-01\tTCGA-DU-A5TS-01\tTCGA-06-6388-01\tTCGA-DB-A4XA-01\tTCGA-06-A7TL-01\tTCGA-HT-A4DV-01\tTCGA-TQ-A7RP-01\tTCGA-E1-5311-01\tTCGA-28-5213-01\tTCGA-E1-A7YI-01\tTCGA-E1-5305-01\tTCGA-F6-A8O4-01\tTCGA-HT-8113-01\tTCGA-DH-A66G-01\tTCGA-76-4932-01\t > > Snipped hundreds of lines. So this seems to indicate that this is a > tab separated file. Don't you have some documentation to refer to? > > > This seems possibly useful: > > > > z <- read.table( > text=readLines(gzcon(url('https://TCGA.xenahubs.net/download/TCGA.GBMLGG.sampleMap/HumanMethylation450.gz')), > n = 100), header=TRUE, sep="\t") > > str(z) > 'data.frame':??? 99 obs. of? 686 variables: > ?$ sample???????? : Factor w/ 99 levels "cg00036732","cg00651829",..: > 53 2 60 41 16 13 37 20 70 21 ... > ?$ TCGA.E1.5319.01: num? 0.4019 0.0215 0.053 0.0453 0.515 ... > ?$ TCGA.HT.7693.01: num? 0.9364 0.0216 0.0547 0.0819 0.6129 ... > ?$ TCGA.CS.6665.01: num? 0.0345 0.0164 0.0719 0.0497 0.6648 ... > ?$ TCGA.S9.A7J2.01: num? 0.0295 0.0168 0.0421 0.0867 0.1657 ... > ?$ TCGA.FG.A6J3.01: num? 0.0248 0.0161 0.0556 0.0902 0.5042 ... > ?$ TCGA.FG.6688.01: num? 0.0203 0.0179 0.0321 0.0513 0.1075 ... > ?$ TCGA.S9.A6TX.01: num? 0.0378 0.0199 0.0623 0.0992 0.7662 ... > ?$ TCGA.VM.A8C8.01: num? 0.0271 0.0172 0.0466 0.0564 0.3478 ... > ?$ TCGA.74.6577.01: num? 0.0237 0.0193 0.0196 0.0961 0.1242 ... > ?$ TCGA.06.AABW.11: num? 0.0323 0.0156 0.0395 0.0708 0.1136 ... > ?$ TCGA.06.0125.02: num? 0.0238 0.0181 0.039 0.068 0.0796 ... > ?$ TCGA.HT.A74L.01: num? 0.7409 0.0221 0.0596 0.0765 0.8157 ... > > #snipped the output > > # there seemed to be 686 columns > >