Cecilia Carmo
2009-Apr-19 11:30 UTC
[R] importing spreadsheet data - linera regression - panel data
Hi everyone and thank you for the help you could give me. My data is in a spreadsheet. The 1st column identifies the firm (with the fiscal number), the columns 2 to 11 have the variable value for 11 years. I have many variables (files like this). Each file has about 40.000 firms (rows). I transformed all the files in txt files. The data is a panel data, like this: firm revenu2007 revenue2006 revenue2005 revenue2004 revenue2003 revenue2002 revenue2001 revenue2000 revenue1999 revenue1998 500100144 504394029 4282809 3769159 3520807 3548322 3458122 503264032 502011475 2595780 2417433 2299563 2060552 1804531 1821638 1789533 1463371 947712 500400911 504615947 22801 28656 27067 26182 26356 34060 39147 502616695 1412354 1209619 1429755 1623496 1955123 2273486 2087406 2076868 2036937 500829993 1383396 1095570 805830 793809 777591 791614 779924 774560 702845 The objective of my work is to do linear regressions with the variables in the files or with other variables that I can obtain from those by doing some mathematical operations. I?ve already tried to import this information to an array in R, but I?ve seen that the linear regressions functions need the data into a dataframe. So I?m asking: How to import this information to a dataframe in R, in a manner that I can easily do the mathematical operations between the variables and then use it to do the regressions? Cec?lia Carmo
Stefan Grosse
2009-Apr-19 14:34 UTC
[R] importing spreadsheet data - linera regression - panel data
Cecilia Carmo schrieb:> The objective of my work is to do linear regressions with the > variables in the files or with other variables that I can obtain from > those by doing some mathematical operations. I?ve already tried to > import this information to an array in R, but I?ve seen that the > linear regressions functions need the data into a dataframe. So I?m > asking: How to import this information to a dataframe in R, in a > manner that I can easily do the mathematical operations between the > variables and then use it to do the regressions? >Your question is probably misspecified. Or the pure import is not your only problem. For the import of the txt you should easily find read.csv your candidate. (please use search on the r-project.org page which leads easily to this solution.) firms<-read.csv("myfilename.txt",header=T,sep="\t") see ?read.csv for help. Then your data are maybe in the wrong order. Instead of Firm Year1 Year2 A 100 101 it is sometimes better to Firm Year Rev A 1 100 A 1 101 the transformation is called reshape. See ?reshape and example(reshape) Then you can do your "mathematical operations". See eg. ?lm example(lm) hth Stefan
Millo Giovanni
2009-Apr-20 15:31 UTC
[R] importing spreadsheet data - linera regression - panel data
Dear Cecilia, just adding some examples to Stefan's post, which says everything already. I've recently gone mad with reshaping, so I assume it is a little tricky. Or maybe what I tell you is obvious, then just skip it. **import** Your files are spreadsheets, so the best way to import is to save them in a .csv or maybe in a .txt file (File>Save as>[choose 'Tab delimited' format]) and then do as Stefan said with read.csv() or, if you saved in tab delimited, read.table(..., sep="\t"). See help("read.table") or ?read.table which is the same. Specifically, looks like you already exported them as tab delimited. So something like yourdata <- read.table(file="yourfile.txt", sep="\t", header=TRUE) should work. **reshape** Now 'yourdata' is a data.frame object. Then you reshape() them, as said, from 'wide' to 'long' format. You need to do something like yourreshapeddata <- reshape(yourdata, direction="long", varying=list(paste("revenue", 2007:1998, sep=""))) i.e., 'varying' must be the list of the vectors (here: only one vector!) of column names corresponding to the time-varying variables to be stacked. (Hint: see what 'paste("revenue", 2007:1998, sep=""))' does). An example on the Grunfeld data just to clarify: data(Grunfeld, package="Ecdat") # data start in 'long' format, the one you need ## make them 'wide' pippo<-reshape(Grunfeld, direction="wide", timevar="year", idvar="firm") ## see what you got fix(pippo) ## now make them 'long' again (what you need!) pluto<-reshape(pippo, direction="long", varying=list(paste("inv",1935:1954,sep="."),paste("value",1935:1954,sep"."),paste("capital",1935:1954,sep="."))) ## see how it worked. Notice the "id" and "time" variables it produced. fix(pluto) (Nevermind the names, Pippo and Pluto are Goofy and his dog in Italian. Stands for "Foo" etc.) **merge** You say, you have one variable per spreadsheet/txt-file. Then as soon as you have as many 'long' dataframes, let's say they are called 'revenue', 'cost', etc., you might want to merge() them two by two to make your final dataframe, along the lines of yourfinaldata <- merge("revenue", "cost", by=c("id", "time")) **modeling** Now your data are ready to do, say, yourmodel <- lm(revenue~cost, data=yourfinaldata) Notice that if you want to use 'plm' for panel data models, you will want to have "id" and "time" in the first two columns, or alternatively to specify an 'index' (see ?plm in library(plm)). Most "panel" models can also be very effectively estimated with the 'nlme' or 'lme4' packages, although the syntax is slightly more complicated. Just to be sure, I am adding a small reproducible example on fake data that should resemble what you have to do: ## begin example ## ## make fake 'wide' data for variable 'revenue' pippo1<-matrix(rnorm(12),ncol=4) pippo1<-as.data.frame(pippo1) pippo1[,1]<-1:3 dimnames(pippo1)[[2]]<-c("firm",paste("revenue",2002:2000, sep="")) ## reshape pippo1<-reshape(pippo1, direction="long", drop="firm", varying=list(paste("revenue",2002:2000,sep=""))) ## ...and have a look: print(pippo1) ## make fake 'wide' data for variable 'cost' pippo2<-matrix(rnorm(12),ncol=4) pippo2<-as.data.frame(pippo2) pippo2[,1]<-1:3 dimnames(pippo2)[[2]]<-c("firm",paste("cost",2002:2000, sep="")) ## reshape pippo2<-reshape(pippo2, direction="long", drop="firm", varying=list(paste("cost",2002:2000,sep=""))) ## merge them (R is smart enough to guess the names of 'by' variables) alldata<-merge(pippo1, pippo2) ## see what happened: print(alldata) ## fix var names dimnames(alldata)[[2]][3:4]<-c("revenue","cost") ## estimate a linear model yourmod<-lm(cost~revenue, data=alldata) summary(yourmod) ## to do panel models (with 'plm'): ## e.g., a fixed effects model yourFEmod<-plm(cost~revenue, data=alldata, index=c("id","time")) ## end example ## Original message: ------------------------------ Date: Sun, 19 Apr 2009 12:30:10 +0100 From: "Cecilia Carmo" <cecilia.carmo at ua.pt> Subject: [R] importing spreadsheet data - linera regression - panel data To: r-help at r-project.org Message-ID: <web-74601530 at controller2.cgpmail.ua.pt> Content-Type: text/plain;charset=iso-8859-1;format="flowed" Hi everyone and thank you for the help you could give me. My data is in a spreadsheet. The 1st column identifies the firm (with the fiscal number), the columns 2 to 11 have the variable value for 11 years. I have many variables (files like this). Each file has about 40.000 firms (rows). I transformed all the files in txt files. The data is a panel data, like this: firm revenu2007 revenue2006 revenue2005 revenue2004 revenue2003 revenue2002 revenue2001 revenue2000 revenue1999 revenue1998 500100144 504394029 4282809 3769159 3520807 3548322 3458122 503264032 502011475 2595780 2417433 2299563 2060552 1804531 1821638 1789533 1463371 947712 500400911 504615947 22801 28656 27067 26182 26356 34060 39147 502616695 1412354 1209619 1429755 1623496 1955123 2273486 2087406 2076868 2036937 500829993 1383396 1095570 805830 793809 777591 791614 779924 774560 702845 The objective of my work is to do linear regressions with the variables in the files or with other variables that I can obtain from those by doing some mathematical operations. I?ve already tried to import this information to an array in R, but I?ve seen that the linear regressions functions need the data into a dataframe. So I?m asking: How to import this information to a dataframe in R, in a manner that I can easily do the mathematical operations between the variables and then use it to do the regressions? Cec?lia Carmo ********** I hope it helps. Please if you use 'plm' let me know how it fares with 40.000x10 data points. Sure 'nlme' and 'lme4' can handle this. Best wishes, Giovanni Giovanni Millo Research Dept., Assicurazioni Generali SpA Via Machiavelli 4, 34132 Trieste (Italy) tel. +39 040 671184 fax +39 040 671160 Ai sensi del D.Lgs. 196/2003 si precisa che le informazi...{{dropped:13}}