Fix Ace
2017-Aug-29 16:22 UTC
[R] help with read.csv() for files with different number of columns
Thank you very much! Looks like I have to know the length of each record ahead of time. Ace On Monday, August 28, 2017 12:56 AM, Jim Lemon <drjimlemon at gmail.com> wrote: Hi Ace, With tabs as separators: testdf<-read.table("test.txt",header=FALSE,fill=TRUE,sep="\t", col.names=paste("V",1:19,sep=""),stringsAsFactors=FALSE) Also note that I got the number of columns wrong the first time. Jim On Mon, Aug 28, 2017 at 12:56 PM, Fix Ace <acefix at rocketmail.com> wrote:> Hi, Jim, > > Thank you very much for pointing out the format issue. Here is the original > text: > > ==> I have a text file (test.txt) with different number of columns: > > 0610007P14Rik%%% Tcf19 Gtf2i > 0610010O12Rik%%% Ivns1abp Etv6 > 1100001G20Rik%%% Nmi > 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 > 1700003E16Rik%%% Ascl2 Ifnar2 > 1700028J19Rik%%% Musk Nfe2l3 > 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 Sox10 Smarca2 > 1810019D21Rik%%% Asb8 > 1810037I17Rik%%% Zfp612 > 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i > Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6 > > I wold like to read it into R using > >> test=read.csv("test.txt",sep="\t",header=FALSE) > > However, when I check the r object "test", I found that all the rows have 5 > columns: > >> test >? ? ? ? ? ? ? ? ? V1? ? ? ? ? ? V2? ? ? V3? ? V4? ? ? V5 > 1? 0610007P14Rik%%%? ? ? ? Tcf19? Gtf2i > 2? 0610010O12Rik%%%? ? ? Ivns1abp? ? Etv6 > 3? 1100001G20Rik%%%? ? ? ? ? Nmi > 4? 1500015O10Rik%%%? ? ? ? Foxi1? Ascl3? Sirt3 > 5? 1700003E16Rik%%%? ? ? ? Ascl2? Ifnar2 > 6? 1700028J19Rik%%%? ? ? ? ? Musk? Nfe2l3 > 7? 1810011O10Rik%%%? ? ? Ppp1r13b? Bpnt1 Cdkn2c? Foxc1 > 8? ? ? ? ? ? Sox10? ? ? Smarca2 > 9? 1810019D21Rik%%%? ? ? ? ? Asb8 > 10 1810037I17Rik%%%? ? ? ? Zfp612 > 11 1810055G02Rik%%%? ? ? ? Nkx2-3? Maged1? Runx1? ? Ugp2 > 12? ? ? ? ? ? Elk4? ? ? ? Spdef? Tcf19? Isl2? Gtf2i > 13? ? ? ? ? Ctnnbl1? ? ? ? Tcea3? ? Ank2 Zfp612 Creb3l1 > 14? ? ? ? ? ? Nupr1 3632451O06Rik Creb3l4? Lass6 > > Basically it breaks some rows into more than one rows. For example, row 7 in > the original record becomes two rows. Looks like the "test" always has 5 > columns. > > How does this happen? How should I fix it to make one record into one two in > R object? > > => > Please let me know if it is readable now. Thank you very much for your time! > > Kind regards, > > Ace > > > On Sunday, August 27, 2017 7:25 PM, Jim Lemon <drjimlemon at gmail.com> wrote: > > > Hi Ace, > As your example seems to have spaces as separators, > > testdf<-read.table("test.txt",header=FALSE,fill=TRUE, > col.names=paste("V",1:14,sep=""),stringsAsFactors=FALSE) > > By specifying the number of columns with "col.names" and using > "fill=TRUE" you can get a data frame with zero length strings where > values are missing in the input file. > > Jim > > On Mon, Aug 28, 2017 at 6:25 AM, Fix Ace via R-help > <r-help at r-project.org> wrote: >> Dear R community, >> I have a text file (test.txt) with different number of columns: >> 0610007P14Rik%%% Tcf19 Gtf2i 0610010O12Rik%%% Ivns1abp Etv6 >> 1100001G20Rik%%% Nmi 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 1700003E16Rik%%% >> Ascl2 Ifnar2 1700028J19Rik%%% Musk Nfe2l3 1810011O10Rik%%% Ppp1r13b Bpnt1 >> Cdkn2c Foxc1 Sox10 Smarca2 1810019D21Rik%%% Asb8 1810037I17Rik%%% Zfp612 >> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i >> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6 >> I wold like to read it into R using >>? > test=read.csv("test.txt",sep="\t",header=FALSE) >> However, when I check the r object "test", I found that all the rows have >> 5 columns: >>> test? ? ? ? ? ? ? ? V1? ? ? ? ? ? V2? ? ? V3? ? V4? ? ? V51 >>> 0610007P14Rik%%%? ? ? ? Tcf19? Gtf2i? ? ? ? ? ? ? 2? 0610010O12Rik%%% >>> Ivns1abp? ? Etv6? ? ? ? ? ? ? 3? 1100001G20Rik%%%? ? ? ? ? Nmi >>> 4? 1500015O10Rik%%%? ? ? ? Foxi1? Ascl3? Sirt3? ? ? ? 5? 1700003E16Rik%%% >>> Ascl2? Ifnar2? ? ? ? ? ? ? 6? 1700028J19Rik%%%? ? ? ? ? Musk? Nfe2l3 >>> 7? 1810011O10Rik%%%? ? ? Ppp1r13b? Bpnt1 Cdkn2c? Foxc18? ? ? ? ? ? Sox10 >>> Smarca2? ? ? ? ? ? ? ? ? ? ? 9? 1810019D21Rik%%%? ? ? ? ? Asb8 >>> 10 1810037I17Rik%%%? ? ? ? Zfp612? ? ? ? ? ? ? ? ? ? ? 11 1810055G02Rik%%% >>> Nkx2-3? Maged1? Runx1? ? Ugp212? ? ? ? ? ? Elk4? ? ? ? Spdef? Tcf19? Isl2 >>> Gtf2i13? ? ? ? ? Ctnnbl1? ? ? ? Tcea3? ? Ank2 Zfp612 Creb3l114 >>> Nupr1 3632451O06Rik Creb3l4? Lass6 >> Basically it breaks some rows into more than one rows. For example, row 7 >> in the original record becomes two rows. Looks like the "test" always has 5 >> columns. >> How does this happen? How should I fix it to make one record into one two >> in R object? >> Thank you very much! >> Ace > >> >> >> >> >> >> >> >> >>? ? ? ? [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
Jim Lemon
2017-Aug-29 21:59 UTC
[R] help with read.csv() for files with different number of columns
Hi Ace, You can just read the file first to find out: max_fields<-function(file,sep=" ") { rlines<-readLines(file) return(max(unlist(lapply(sapply(rlines,strsplit,sep),length)))) } nmax<-max_fields(test.txt,"\t") Jim On Wed, Aug 30, 2017 at 2:22 AM, Fix Ace <acefix at rocketmail.com> wrote:> Thank you very much! Looks like I have to know the length of each record > ahead of time. > > Ace > > > On Monday, August 28, 2017 12:56 AM, Jim Lemon <drjimlemon at gmail.com> wrote: > > > Hi Ace, > With tabs as separators: > > testdf<-read.table("test.txt",header=FALSE,fill=TRUE,sep="\t", > col.names=paste("V",1:19,sep=""),stringsAsFactors=FALSE) > > Also note that I got the number of columns wrong the first time. > > Jim > > > On Mon, Aug 28, 2017 at 12:56 PM, Fix Ace <acefix at rocketmail.com> wrote: >> Hi, Jim, >> >> Thank you very much for pointing out the format issue. Here is the >> original >> text: >> >> ==>> I have a text file (test.txt) with different number of columns: >> >> 0610007P14Rik%%% Tcf19 Gtf2i >> 0610010O12Rik%%% Ivns1abp Etv6 >> 1100001G20Rik%%% Nmi >> 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 >> 1700003E16Rik%%% Ascl2 Ifnar2 >> 1700028J19Rik%%% Musk Nfe2l3 >> 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 Sox10 Smarca2 >> 1810019D21Rik%%% Asb8 >> 1810037I17Rik%%% Zfp612 >> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i >> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6 >> >> I wold like to read it into R using >> >>> test=read.csv("test.txt",sep="\t",header=FALSE) >> >> However, when I check the r object "test", I found that all the rows have >> 5 >> columns: >> >>> test >> V1 V2 V3 V4 V5 >> 1 0610007P14Rik%%% Tcf19 Gtf2i >> 2 0610010O12Rik%%% Ivns1abp Etv6 >> 3 1100001G20Rik%%% Nmi >> 4 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 >> 5 1700003E16Rik%%% Ascl2 Ifnar2 >> 6 1700028J19Rik%%% Musk Nfe2l3 >> 7 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 >> 8 Sox10 Smarca2 >> 9 1810019D21Rik%%% Asb8 >> 10 1810037I17Rik%%% Zfp612 >> 11 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 >> 12 Elk4 Spdef Tcf19 Isl2 Gtf2i >> 13 Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 >> 14 Nupr1 3632451O06Rik Creb3l4 Lass6 >> >> Basically it breaks some rows into more than one rows. For example, row 7 >> in >> the original record becomes two rows. Looks like the "test" always has 5 >> columns. >> >> How does this happen? How should I fix it to make one record into one two >> in >> R object? >> >> =>> >> Please let me know if it is readable now. Thank you very much for your >> time! >> >> Kind regards, >> >> Ace >> >> >> On Sunday, August 27, 2017 7:25 PM, Jim Lemon <drjimlemon at gmail.com> >> wrote: >> >> >> Hi Ace, >> As your example seems to have spaces as separators, >> >> testdf<-read.table("test.txt",header=FALSE,fill=TRUE, >> col.names=paste("V",1:14,sep=""),stringsAsFactors=FALSE) >> >> By specifying the number of columns with "col.names" and using >> "fill=TRUE" you can get a data frame with zero length strings where >> values are missing in the input file. >> >> Jim >> >> On Mon, Aug 28, 2017 at 6:25 AM, Fix Ace via R-help >> <r-help at r-project.org> wrote: >>> Dear R community, >>> I have a text file (test.txt) with different number of columns: >>> 0610007P14Rik%%% Tcf19 Gtf2i 0610010O12Rik%%% Ivns1abp Etv6 >>> 1100001G20Rik%%% Nmi 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 1700003E16Rik%%% >>> Ascl2 Ifnar2 1700028J19Rik%%% Musk Nfe2l3 1810011O10Rik%%% Ppp1r13b Bpnt1 >>> Cdkn2c Foxc1 Sox10 Smarca2 1810019D21Rik%%% Asb8 1810037I17Rik%%% Zfp612 >>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i >>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6 >>> I wold like to read it into R using >>> > test=read.csv("test.txt",sep="\t",header=FALSE) >>> However, when I check the r object "test", I found that all the rows have >>> 5 columns: >>>> test V1 V2 V3 V4 V51 >>>> 0610007P14Rik%%% Tcf19 Gtf2i 2 0610010O12Rik%%% >>>> Ivns1abp Etv6 3 1100001G20Rik%%% Nmi >>>> 4 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 5 >>>> 1700003E16Rik%%% >>>> Ascl2 Ifnar2 6 1700028J19Rik%%% Musk Nfe2l3 >>>> 7 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc18 Sox10 >>>> Smarca2 9 1810019D21Rik%%% Asb8 >>>> 10 1810037I17Rik%%% Zfp612 11 >>>> 1810055G02Rik%%% >>>> Nkx2-3 Maged1 Runx1 Ugp212 Elk4 Spdef Tcf19 >>>> Isl2 >>>> Gtf2i13 Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l114 >>>> Nupr1 3632451O06Rik Creb3l4 Lass6 >>> Basically it breaks some rows into more than one rows. For example, row 7 >>> in the original record becomes two rows. Looks like the "test" always has >>> 5 >>> columns. >>> How does this happen? How should I fix it to make one record into one two >>> in R object? >>> Thank you very much! >>> Ace >> >>> >>> >>> >>> >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> > >
David Winsemius
2017-Aug-29 22:55 UTC
[R] help with read.csv() for files with different number of columns
> On Aug 29, 2017, at 2:59 PM, Jim Lemon <drjimlemon at gmail.com> wrote: > > Hi Ace, > You can just read the file first to find out: > > max_fields<-function(file,sep=" ") { > rlines<-readLines(file) > return(max(unlist(lapply(sapply(rlines,strsplit,sep),length)))) > } > nmax<-max_fields(test.txt,"\t") > > JimOr just: table( count.fields( readLines(file_name) ) ) May need to play with the 'comment.char' and the 'quotes' to investigate the impact of unmatched single quotes or octothorpes in the raw data. Then you can isolate the aberrant lines with `which` applied to the `count.fields` resultant vector. -- David.> > > > > On Wed, Aug 30, 2017 at 2:22 AM, Fix Ace <acefix at rocketmail.com> wrote: >> Thank you very much! Looks like I have to know the length of each record >> ahead of time. >> >> Ace >> >> >> On Monday, August 28, 2017 12:56 AM, Jim Lemon <drjimlemon at gmail.com> wrote: >> >> >> Hi Ace, >> With tabs as separators: >> >> testdf<-read.table("test.txt",header=FALSE,fill=TRUE,sep="\t", >> col.names=paste("V",1:19,sep=""),stringsAsFactors=FALSE) >> >> Also note that I got the number of columns wrong the first time. >> >> Jim >> >> >> On Mon, Aug 28, 2017 at 12:56 PM, Fix Ace <acefix at rocketmail.com> wrote: >>> Hi, Jim, >>> >>> Thank you very much for pointing out the format issue. Here is the >>> original >>> text: >>> >>> ==>>> I have a text file (test.txt) with different number of columns: >>> >>> 0610007P14Rik%%% Tcf19 Gtf2i >>> 0610010O12Rik%%% Ivns1abp Etv6 >>> 1100001G20Rik%%% Nmi >>> 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 >>> 1700003E16Rik%%% Ascl2 Ifnar2 >>> 1700028J19Rik%%% Musk Nfe2l3 >>> 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 Sox10 Smarca2 >>> 1810019D21Rik%%% Asb8 >>> 1810037I17Rik%%% Zfp612 >>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i >>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6 >>> >>> I wold like to read it into R using >>> >>>> test=read.csv("test.txt",sep="\t",header=FALSE) >>> >>> However, when I check the r object "test", I found that all the rows have >>> 5 >>> columns: >>> >>>> test >>> V1 V2 V3 V4 V5 >>> 1 0610007P14Rik%%% Tcf19 Gtf2i >>> 2 0610010O12Rik%%% Ivns1abp Etv6 >>> 3 1100001G20Rik%%% Nmi >>> 4 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 >>> 5 1700003E16Rik%%% Ascl2 Ifnar2 >>> 6 1700028J19Rik%%% Musk Nfe2l3 >>> 7 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 >>> 8 Sox10 Smarca2 >>> 9 1810019D21Rik%%% Asb8 >>> 10 1810037I17Rik%%% Zfp612 >>> 11 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 >>> 12 Elk4 Spdef Tcf19 Isl2 Gtf2i >>> 13 Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 >>> 14 Nupr1 3632451O06Rik Creb3l4 Lass6 >>> >>> Basically it breaks some rows into more than one rows. For example, row 7 >>> in >>> the original record becomes two rows. Looks like the "test" always has 5 >>> columns. >>> >>> How does this happen? How should I fix it to make one record into one two >>> in >>> R object? >>> >>> =>>> >>> Please let me know if it is readable now. Thank you very much for your >>> time! >>> >>> Kind regards, >>> >>> Ace >>> >>> >>> On Sunday, August 27, 2017 7:25 PM, Jim Lemon <drjimlemon at gmail.com> >>> wrote: >>> >>> >>> Hi Ace, >>> As your example seems to have spaces as separators, >>> >>> testdf<-read.table("test.txt",header=FALSE,fill=TRUE, >>> col.names=paste("V",1:14,sep=""),stringsAsFactors=FALSE) >>> >>> By specifying the number of columns with "col.names" and using >>> "fill=TRUE" you can get a data frame with zero length strings where >>> values are missing in the input file. >>> >>> Jim >>> >>> On Mon, Aug 28, 2017 at 6:25 AM, Fix Ace via R-help >>> <r-help at r-project.org> wrote: >>>> Dear R community, >>>> I have a text file (test.txt) with different number of columns: >>>> 0610007P14Rik%%% Tcf19 Gtf2i 0610010O12Rik%%% Ivns1abp Etv6 >>>> 1100001G20Rik%%% Nmi 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 1700003E16Rik%%% >>>> Ascl2 Ifnar2 1700028J19Rik%%% Musk Nfe2l3 1810011O10Rik%%% Ppp1r13b Bpnt1 >>>> Cdkn2c Foxc1 Sox10 Smarca2 1810019D21Rik%%% Asb8 1810037I17Rik%%% Zfp612 >>>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i >>>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6 >>>> I wold like to read it into R using >>>>> test=read.csv("test.txt",sep="\t",header=FALSE) >>>> However, when I check the r object "test", I found that all the rows have >>>> 5 columns: >>>>> test V1 V2 V3 V4 V51 >>>>> 0610007P14Rik%%% Tcf19 Gtf2i 2 0610010O12Rik%%% >>>>> Ivns1abp Etv6 3 1100001G20Rik%%% Nmi >>>>> 4 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 5 >>>>> 1700003E16Rik%%% >>>>> Ascl2 Ifnar2 6 1700028J19Rik%%% Musk Nfe2l3 >>>>> 7 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc18 Sox10 >>>>> Smarca2 9 1810019D21Rik%%% Asb8 >>>>> 10 1810037I17Rik%%% Zfp612 11 >>>>> 1810055G02Rik%%% >>>>> Nkx2-3 Maged1 Runx1 Ugp212 Elk4 Spdef Tcf19 >>>>> Isl2 >>>>> Gtf2i13 Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l114 >>>>> Nupr1 3632451O06Rik Creb3l4 Lass6 >>>> Basically it breaks some rows into more than one rows. For example, row 7 >>>> in the original record becomes two rows. Looks like the "test" always has >>>> 5 >>>> columns. >>>> How does this happen? How should I fix it to make one record into one two >>>> in R object? >>>> Thank you very much! >>>> Ace >>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law
Possibly Parallel Threads
- help with read.csv() for files with different number of columns
- help with read.csv() for files with different number of columns
- help with read.csv() for files with different number of columns
- help with read.csv() for files with different number of columns
- help with read.csv() for files with different number of columns