thr3ads.net - R help - [R] query on converting survey data from one structure to another [Aug 2003]

If this information is useful, please help other people find it:
Share via:

Vumani Dlamini

2003-Aug-20 07:13 UTC

[R] query on converting survey data from one structure to another

Dear R users,

I am trying to convert a dataset from one format to several rectangular 
datasets. A consultant helped design the data entry program for our survey 
using Delphi/Pascal and for each household the information is stored in a 
file called "EA-HM-HH.TXT" where EA is the enumeration area number, HM
is
the homestead number and HH is the household number. Within this file the 
data is stored as follows,


###### file="5677-001-001.TXT" ######
EAnumber=5677
HMnumber=001
HHnumber=001
# Demographics section
Dserial=01   #first person in household
Dage=56
Dsex=1
Dserial=02   #second person in household
Dage=44
Dsex=2
Dserial=03   #second person in household
Dage=7
Dsex=2
# Agricultural inputs section
Amaize=200
Apumpkins=50
###### end of file ########

Note that in the demograpics section there may only be less than 3 or more 
people in some households. I would like to create a file for the 
demographics section which is as follows

EAnumber|HMnumber|HHnumber|Dserial|Dage|Dsex
5677 001 001 056 1
5677 001 001 044 2
5677 001 001 007 2

and for the agricultural inputs I would like to have,

EAnumber|HMnumber|HHnumber|Amaize|Apumpkins
5677 001 001 0200
5677 001 001 0050

There are several similar files where the EA number, HM number or HH number 
changes, thus I would also like to know whether it is possible to create a 
script where for all household there is only one dataset for demographics, 
and one for agricultural inputs.

Thanking you as always.


Vumani Dlamini

Jonathan Baron

2003-Aug-20 11:03 UTC

head link

[R] query on converting survey data from one structure to another

On 08/20/03 07:13, Vumani Dlamini wrote:>Dear R users,
>
>I am trying to convert a dataset from one format to several rectangular 
>datasets. A consultant helped design the data entry program for our survey 
>using Delphi/Pascal and for each household the information is stored in a 
>file called "EA-HM-HH.TXT" where EA is the enumeration area
number, HM is
>the homestead number and HH is the household number. Within this file the 
>data is stored as follows,
>
>
>###### file="5677-001-001.TXT" ######
>EAnumber=5677
>HMnumber=001
>HHnumber=001
># Demographics section
>Dserial=01   #first person in household
>Dage=56
>Dsex=1
>Dserial=02   #second person in household
>Dage=44
>Dsex=2
>Dserial=03   #second person in household
>Dage=7
>Dsex=2
># Agricultural inputs section
>Amaize=200
>Apumpkins=50
>###### end of file ########
>
>Note that in the demograpics section there may only be less than 3 or more 
>people in some households. I would like to create a file for the 
>demographics section which is as follows
>
>EAnumber|HMnumber|HHnumber|Dserial|Dage|Dsex
>5677 001 001 056 1
>5677 001 001 044 2
>5677 001 001 007 2
>
>and for the agricultural inputs I would like to have,
>
>EAnumber|HMnumber|HHnumber|Amaize|Apumpkins
>5677 001 001 0200
>5677 001 001 0050
>
>There are several similar files where the EA number, HM number or HH number 
>changes, thus I would also like to know whether it is possible to create a 
>script where for all household there is only one dataset for demographics, 
>and one for agricultural inputs.
The string handling functions in R should be able to handle this
about as easly as could the string functions in (say) Perl, but
I'm not going to do it for you.  It would take me at least an
hour, probably 2 hours.  If I were doing it - and I may not be
doing it the best way - I would first put all the little files
together into one big file f1 with the Unix command cat.  Then I
would read in f1 in a way that gives me a vector of strings, with
one line corresponding to each element of the vector:

d1 <-
as.vector(as.matrix(read.table(f1,sep="\n",quote="")))

is how I've done this before, but I'm not sure this is all
necessary.

Then I would start a file d2 for output and write one line with
the names of the variables:

write.table("EAnumber HMnumber HHnumber Dserial Dage
Dsex",file="d2"),
            quote=F,row.names=F,col.names=F,sep="")

(I'm not sure this command will work.  I'd try it first by itself.)

Then I would write a loop to process d1 line by line, with
sub-loops to deal with the family members in one of the original
files.  Here is part of a similar loop I wrote for another
purpose.  The variable "onoff" indicated whether I was still
within one of my original records (like one of your little
files).  You would probably want two such variables, one to
indicate the family, one for each of its members.

d2 <- {} # set up a vector for the output, undetermined length
for (i in 1:length(d1)) {
  s1 <- substr(d1[i],1,4)
  if (s1=="age=") {onoff <- TRUE; j <- j+1; d2[j] <-
""}
  if (onoff==1 & s1!="aqua" & s1!="comm" &
s1!="apay")
    {d2[j] <-
paste(d2[j],strsplit(d1[i],"=")[[1]][2],collapse=" ")}
  if (s1==f2) {onoff <- FALSE}
}

The following command is to replace plus signs with spaces.  It
just shows you how to use gsub().

d2 <- gsub("\\+"," ",d2)

Finally, I would append the vector d2 to the file I had written:

write.table(d2,file="d2",append=T,quote=F,row.names=F,col.names=F,sep="")

I have not tested any of this recently.  You will have to build
it up step by step and test it at each point.  But perhaps this
will get you started.  Also look at the functions grep() and
regexpr(), which might come in hand.

-- 
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page:            http://www.sas.upenn.edu/~baron
R page:               http://finzi.psych.upenn.edu/

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Aug 2003 - query on converting survey data from one structure to another

[R] query on converting survey data from one structure to another

[R] query on converting survey data from one structure to another

Apparently Analagous Threads