Vumani Dlamini
2003-Aug-20 07:13 UTC
[R] query on converting survey data from one structure to another
Dear R users, I am trying to convert a dataset from one format to several rectangular datasets. A consultant helped design the data entry program for our survey using Delphi/Pascal and for each household the information is stored in a file called "EA-HM-HH.TXT" where EA is the enumeration area number, HM is the homestead number and HH is the household number. Within this file the data is stored as follows, ###### file="5677-001-001.TXT" ###### EAnumber=5677 HMnumber=001 HHnumber=001 # Demographics section Dserial=01 #first person in household Dage=56 Dsex=1 Dserial=02 #second person in household Dage=44 Dsex=2 Dserial=03 #second person in household Dage=7 Dsex=2 # Agricultural inputs section Amaize=200 Apumpkins=50 ###### end of file ######## Note that in the demograpics section there may only be less than 3 or more people in some households. I would like to create a file for the demographics section which is as follows EAnumber|HMnumber|HHnumber|Dserial|Dage|Dsex 5677 001 001 056 1 5677 001 001 044 2 5677 001 001 007 2 and for the agricultural inputs I would like to have, EAnumber|HMnumber|HHnumber|Amaize|Apumpkins 5677 001 001 0200 5677 001 001 0050 There are several similar files where the EA number, HM number or HH number changes, thus I would also like to know whether it is possible to create a script where for all household there is only one dataset for demographics, and one for agricultural inputs. Thanking you as always. Vumani Dlamini
Jonathan Baron
2003-Aug-20 11:03 UTC
[R] query on converting survey data from one structure to another
On 08/20/03 07:13, Vumani Dlamini wrote:>Dear R users, > >I am trying to convert a dataset from one format to several rectangular >datasets. A consultant helped design the data entry program for our survey >using Delphi/Pascal and for each household the information is stored in a >file called "EA-HM-HH.TXT" where EA is the enumeration area number, HM is >the homestead number and HH is the household number. Within this file the >data is stored as follows, > > >###### file="5677-001-001.TXT" ###### >EAnumber=5677 >HMnumber=001 >HHnumber=001 ># Demographics section >Dserial=01 #first person in household >Dage=56 >Dsex=1 >Dserial=02 #second person in household >Dage=44 >Dsex=2 >Dserial=03 #second person in household >Dage=7 >Dsex=2 ># Agricultural inputs section >Amaize=200 >Apumpkins=50 >###### end of file ######## > >Note that in the demograpics section there may only be less than 3 or more >people in some households. I would like to create a file for the >demographics section which is as follows > >EAnumber|HMnumber|HHnumber|Dserial|Dage|Dsex >5677 001 001 056 1 >5677 001 001 044 2 >5677 001 001 007 2 > >and for the agricultural inputs I would like to have, > >EAnumber|HMnumber|HHnumber|Amaize|Apumpkins >5677 001 001 0200 >5677 001 001 0050 > >There are several similar files where the EA number, HM number or HH number >changes, thus I would also like to know whether it is possible to create a >script where for all household there is only one dataset for demographics, >and one for agricultural inputs.The string handling functions in R should be able to handle this about as easly as could the string functions in (say) Perl, but I'm not going to do it for you. It would take me at least an hour, probably 2 hours. If I were doing it - and I may not be doing it the best way - I would first put all the little files together into one big file f1 with the Unix command cat. Then I would read in f1 in a way that gives me a vector of strings, with one line corresponding to each element of the vector: d1 <- as.vector(as.matrix(read.table(f1,sep="\n",quote=""))) is how I've done this before, but I'm not sure this is all necessary. Then I would start a file d2 for output and write one line with the names of the variables: write.table("EAnumber HMnumber HHnumber Dserial Dage Dsex",file="d2"), quote=F,row.names=F,col.names=F,sep="") (I'm not sure this command will work. I'd try it first by itself.) Then I would write a loop to process d1 line by line, with sub-loops to deal with the family members in one of the original files. Here is part of a similar loop I wrote for another purpose. The variable "onoff" indicated whether I was still within one of my original records (like one of your little files). You would probably want two such variables, one to indicate the family, one for each of its members. d2 <- {} # set up a vector for the output, undetermined length for (i in 1:length(d1)) { s1 <- substr(d1[i],1,4) if (s1=="age=") {onoff <- TRUE; j <- j+1; d2[j] <- ""} if (onoff==1 & s1!="aqua" & s1!="comm" & s1!="apay") {d2[j] <- paste(d2[j],strsplit(d1[i],"=")[[1]][2],collapse=" ")} if (s1==f2) {onoff <- FALSE} } The following command is to replace plus signs with spaces. It just shows you how to use gsub(). d2 <- gsub("\\+"," ",d2) Finally, I would append the vector d2 to the file I had written: write.table(d2,file="d2",append=T,quote=F,row.names=F,col.names=F,sep="") I have not tested any of this recently. You will have to build it up step by step and test it at each point. But perhaps this will get you started. Also look at the functions grep() and regexpr(), which might come in hand. -- Jonathan Baron, Professor of Psychology, University of Pennsylvania Home page: http://www.sas.upenn.edu/~baron R page: http://finzi.psych.upenn.edu/