thr3ads.net - R help - [R] Read data with different column lengths [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Hofert Marius

2007-Mar-07 21:06 UTC

[R] Read data with different column lengths

Dear r-help users,

I have the following simple problem: Reading data from a file. The  
file is a .txt file exported ("save as...") from Excel (see below for
an example). The Excel file consists of two header rows (first row  
consists of ticker symbols of stocks, the second row consists of  
column explanations ("Date","Px Last"), followed by several
rows of
data. Now forget about the first two rows, I can deal with that (read  
separately, then extract the actual ticker symbols "ADS",  
"ALV", ...). For reading the rest, I tried several things, for
example:
data=read.table(infile,quote="",fill=T,dec=",",skip=2,colClasses=rep(c
("character","numeric"),ntickers))
or
data=matrix(scan(file=infile,what=rep(c 
("character","numeric"),ntickers),dec=",",skip=2),ncol=2*ntickers,byrow=
T)
where "infile" specifies the path to the input file and
"ntickers" is
the number of ticker-columns in the data set, so in the example  
below, ntickers=2.

Both ways of reading the data work perfectly fine if all columns have  
the same length (i.e. the same number of filled rows), so if the data  
is given in a (filled) "rectangular" form. Now, as you can imagine,  
there are days when one stock is traded but not the other... so,  
there might be columns that do not have the same number of filled  
rows (see below, for the stock with ticker symbol "ADS", only 3  
trading days are shown, so this column is shorter than the data  
column for the stock "ALV"). Now, if I export such a structure to  
a .txt file, then all (by default) blank fields will be replaced by  
"\t", i.e. tabs. Both reading procedures as give above have problems  
as they either display that the number of rows/columns do not fit  
together or as they read the table, but some cells are shifted to the  
left (for the example below, the entry "07/02/05	134,7" appears in  
the empty field of the stock "ADS" which is of course not what we
want).
So the simple question is: How do I read such a structure?
Can there be a simple solution? The problem is simply that empty  
cells are replace by "\t" which are then ignored for reading. So how  
do we distinguish between the empty cells that are given between the  
columns and the empty cells that actually "fill" a column to have the
same length as other columns. Of course I could manually put in a  
certain character (e.g. a "*") to fill in the gaps, but the data set  
is simply too large. If it helps, these blank fields only appear in  
the end of each column, not in the middle.

As I work on a Mac (OS X 10.4), it was not possible (at least to me)  
to read the data directly from the Excel file vial the library RODBC  
or read.xls.

Note, that the same problem arises, when I export the Excel file as  
a .csv, then all blank fields are separated by ";" instead of
"\t"
and the reading procedure can also not decide if the field  
corresponds to an empty separating column or actually to a column  
with given entries, but which is simply not as long as another column  
in the file.

Hope, you can help. I would really appreciate it.

Best regards.

Marius

Excel example (I hope it's displayed correctly, the entry in the last  
row should be aligned with the last column):

ADS GY Equity			ALV GY Equity		
Date		Px Last		Date		Px Last
07/02/04	41,395		07/01/31	130,234
07/02/05	42,134		07/02/01	133,353
07/02/06	41,875		07/02/04	133,824
						07/02/05	134,734

jim holtman

2007-Mar-07 23:53 UTC

head link

[R] Read data with different column lengths

I would have thought that a csv file written out by Excel would have looked
like this for your data:

ADS GY Equity,                 ,ALV GY Equity,
Date        ,   Px Last        ,Date          , Px Last
07/02/04    ,"   41,395          ",07/01/31      ," 130,234"
07/02/05    ,"   42,134          ",07/02/01      ," 133,353"
07/02/06    ,"   41,875          ",07/02/04      ," 133,824"
             ,                 , 07/02/05," 134,734"

Here we have commas separating the fields and this will read in the data:
> x <- read.csv("/tempxx.csv", skip=2, header=FALSE, as.is=TRUE)
> str(x)'data.frame':   4 obs. of  4 variables:
 $ V1: chr  "07/02/04    " "07/02/05    " "07/02/06   
" "             "
 $ V2: chr  "   41,395          " "   42,134          "
"   41,875
 " "                 "
 $ V3: chr  "07/01/31      " "07/02/01      " "07/02/04
" " 07/02/05"
 $ V4: chr  " 130,234" " 133,353" " 133,824"
" 134,734"> x             V1                  V2             V3       V4
1  07/02/04        41,395           07/01/31        130,234
2  07/02/05        42,134           07/02/01        133,353
3  07/02/06        41,875           07/02/04        133,824
4                                         07/02/05 
134,734>
You will have to delete the ',' from the numeric fields, The
'dec=","' on
your command would have read '133,353' in as '133.353' which I
don't think
is what you want.  You can delete the commas and convert to numeric with the
following:
> x$V4 <- as.numeric(gsub(',', '', x$V4))
> x             V1                  V2             V3     V4
1  07/02/04        41,395           07/01/31       130234
2  07/02/05        42,134           07/02/01       133353
3  07/02/06        41,875           07/02/04       133824
4                                         07/02/05
134734>
So take a look at the options that you have with writing out the Excel data;
you should be able to get either tab or comma delimited.  If not, you can
try reading in each line as a character string and then using 'substr'
to
split it apart assuming that you have constant width columns.


On 3/7/07, Hofert Marius <m_hofert@web.de> wrote:>
> Dear r-help users,
>
> I have the following simple problem: Reading data from a file. The
> file is a .txt file exported ("save as...") from Excel (see below
for
> an example). The Excel file consists of two header rows (first row
> consists of ticker symbols of stocks, the second row consists of
> column explanations ("Date","Px Last"), followed by
several rows of
> data. Now forget about the first two rows, I can deal with that (read
> separately, then extract the actual ticker symbols "ADS",
> "ALV", ...). For reading the rest, I tried several things, for
example:
>
data=read.table(infile,quote="",fill=T,dec=",",skip=2,colClasses=rep(c
> ("character","numeric"),ntickers))
> or
> data=matrix(scan(file=infile,what=rep(c
>
("character","numeric"),ntickers),dec=",",skip=2),ncol=2*ntickers,byrow>
T)
> where "infile" specifies the path to the input file and
"ntickers" is
> the number of ticker-columns in the data set, so in the example
> below, ntickers=2.
>
> Both ways of reading the data work perfectly fine if all columns have
> the same length (i.e. the same number of filled rows), so if the data
> is given in a (filled) "rectangular" form. Now, as you can
imagine,
> there are days when one stock is traded but not the other... so,
> there might be columns that do not have the same number of filled
> rows (see below, for the stock with ticker symbol "ADS", only 3
> trading days are shown, so this column is shorter than the data
> column for the stock "ALV"). Now, if I export such a structure to
> a .txt file, then all (by default) blank fields will be replaced by
> "\t", i.e. tabs. Both reading procedures as give above have
problems
> as they either display that the number of rows/columns do not fit
> together or as they read the table, but some cells are shifted to the
> left (for the example below, the entry "07/02/05        134,7"
appears in
> the empty field of the stock "ADS" which is of course not what we
want).
> So the simple question is: How do I read such a structure?
> Can there be a simple solution? The problem is simply that empty
> cells are replace by "\t" which are then ignored for reading. So
how
> do we distinguish between the empty cells that are given between the
> columns and the empty cells that actually "fill" a column to have
the
> same length as other columns. Of course I could manually put in a
> certain character (e.g. a "*") to fill in the gaps, but the data
set
> is simply too large. If it helps, these blank fields only appear in
> the end of each column, not in the middle.
>
> As I work on a Mac (OS X 10.4), it was not possible (at least to me)
> to read the data directly from the Excel file vial the library RODBC
> or read.xls.
>
> Note, that the same problem arises, when I export the Excel file as
> a .csv, then all blank fields are separated by ";" instead of
"\t"
> and the reading procedure can also not decide if the field
> corresponds to an empty separating column or actually to a column
> with given entries, but which is simply not as long as another column
> in the file.
>
> Hope, you can help. I would really appreciate it.
>
> Best regards.
>
> Marius
>
> Excel example (I hope it's displayed correctly, the entry in the last
> row should be aligned with the last column):
>
> ADS GY Equity                   ALV GY Equity
> Date            Px Last         Date            Px Last
> 07/02/04        41,395          07/01/31        130,234
> 07/02/05        42,134          07/02/01        133,353
> 07/02/06        41,875          07/02/04        133,824
>                                                07/02/05        134,734
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more maybe matching threads

R help - Mar 2007 - Read data with different column lengths

[R] Read data with different column lengths

[R] Read data with different column lengths

Apparently Analagous Threads