I would like to read the following hierarchical data set. There is a family record followed by one or more personal records. If col. 7 is "1" it is a family record. If it is "2" it is a personal record. The family record is formatted as follows: col. 1-5 family id col. 7 "1" col. 9 dwelling type code The personal record is formatted as follows: col. 1-5 personal id col. 7 "2" col. 8-9 age col. 11 sex code The first six family and accompanying personal records look like this: 06470 1 1 1 232 0 2 230 1 07470 1 0 1 240 1 08470 1 0 1 227 0 09470 1 0 1 213 1 2 222 0 3 224 1 10470 1 1 1 220 0 2 211 1 11470 1 0 1 217 0 2 210 1 3 226 1 I want to create a dataset containing . family ID . dwelling code . person ID . age . sex code The dataset will contain one observation per person, and the with family information repeated for people in the same family. Can anyone help? Thanks, Richard Saba
Will this do it for you:> input <- readLines(textConnection("06470 1 1+ 1 232 0 + 2 230 1 + 07470 1 0 + 1 240 1 + 08470 1 0 + 1 227 0 + 09470 1 0 + 1 213 1 + 2 222 0 + 3 224 1 + 10470 1 1 + 1 220 0 + 2 211 1 + 11470 1 0 + 1 217 0 + 2 210 1 + 3 226 1"))> closeAllConnections() > fid <- NULL > dwell <- NULL > result <- do.call(rbind, lapply(input, function(.line){+ values <- as.integer(substring(.line, c(1, 7, 9), c(5, 7, 9))) # assume family record + if (values[2] == '1'){ + fid <<- values[1] + dwell <<- values[3] + return(NULL) + } else { + values <- as.integer(substring(.line, c(1, 7, 8, 11), c(5, 7, 9, 11))) + return(c(fid=fid, dwell=dwell, pid=values[1], age=values[3], sex=values[4])) + } + }))> > resultfid dwell pid age sex [1,] 6470 1 1 32 0 [2,] 6470 1 2 30 1 [3,] 7470 0 1 40 1 [4,] 8470 0 1 27 0 [5,] 9470 0 1 13 1 [6,] 9470 0 2 22 0 [7,] 9470 0 3 24 1 [8,] 10470 1 1 20 0 [9,] 10470 1 2 11 1 [10,] 11470 0 1 17 0 [11,] 11470 0 2 10 1 [12,] 11470 0 3 26 1 On Sun, Feb 7, 2010 at 10:57 AM, Saba(Home) <sabaric at charter.net> wrote:> > I would like to read the following hierarchical data set. There is a family > record followed by one or more personal records. > If col. 7 is "1" it is a family record. If it is "2" it is a personal > record. > The family record is formatted as follows: > col. 1-5 ? ? family id > col. 7 ? ? ? ?"1" > col. 9 ? ? ? ?dwelling type code > The personal record is formatted as follows: > col. 1-5 ? ? ? ?personal id > col. 7 ? "2" > col. 8-9 ? ? ? ?age > col. 11 sex code > > The first six family and accompanying personal records look like this: > 06470 1 1 > ? ?1 232 0 > ? ?2 230 1 > 07470 1 0 > ? ?1 240 1 > 08470 1 0 > ? ?1 227 0 > 09470 1 0 > ? ?1 213 1 > ? ?2 222 0 > ? ?3 224 1 > 10470 1 1 > ? ?1 220 0 > ? ?2 211 1 > 11470 1 0 > ? ?1 217 0 > ? ?2 210 1 > ? ?3 226 1 > > I want to create a dataset containing > . family ID > . dwelling code > . person ID > . age > . sex code > The dataset will contain one observation per person, and the with family > information repeated for people in the same family. > Can anyone help? > Thanks, > Richard Saba > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Try this. It uses input defined in Jim's post and defines the rectype of each row ("1" or "2"). It then reads the rectype "1" records into DF1 using read.fwf and the rectype "2" records into DF2 also using read.fwf. ix is defined to have one component per personal record giving the row number in DF1 of the corresponding family. We combine DF1 and DF2 using ix and remove the column names that start with "X". # record type ("1" or "2") rectype <- substr(input, 7, 7) # read in record type "1" input1 <- input[rectype == "1"] DF1 <- read.fwf(textConnection(input1), widths = c(5, 1, 1, 1, 1), col.names = c("familyid", "X", "X", "X", "dwelling")) # read in record type "2" input2 <- input[rectype == "2"] DF2 <- read.fwf(textConnection(input2), widths = c(5, 1, 1, 2, 1, 1), col.names = c("personalid", "X", "X", "age", "X", "sex")) # ix is the index in DF1 of family row corresponding to each personal row in DF2 ix <- cumsum(rectype == "1")[rectype == "2"] DF <- cbind(DF1[ix,], DF2) DF <- DF[substr(names(DF), 1, 1) != "X"] so DF looks like this:> DFfamilyid dwelling personalid age sex 1 6470 1 1 32 0 1.1 6470 1 2 30 1 2 7470 0 1 40 1 3 8470 0 1 27 0 4 9470 0 1 13 1 4.1 9470 0 2 22 0 4.2 9470 0 3 24 1 5 10470 1 1 20 0 5.1 10470 1 2 11 1 6 11470 0 1 17 0 6.1 11470 0 2 10 1 6.2 11470 0 3 26 1 On Sun, Feb 7, 2010 at 10:57 AM, Saba(Home) <sabaric at charter.net> wrote:> > I would like to read the following hierarchical data set. There is a family > record followed by one or more personal records. > If col. 7 is "1" it is a family record. If it is "2" it is a personal > record. > The family record is formatted as follows: > col. 1-5 ? ? family id > col. 7 ? ? ? ?"1" > col. 9 ? ? ? ?dwelling type code > The personal record is formatted as follows: > col. 1-5 ? ? ? ?personal id > col. 7 ? "2" > col. 8-9 ? ? ? ?age > col. 11 sex code > > The first six family and accompanying personal records look like this: > 06470 1 1 > ? ?1 232 0 > ? ?2 230 1 > 07470 1 0 > ? ?1 240 1 > 08470 1 0 > ? ?1 227 0 > 09470 1 0 > ? ?1 213 1 > ? ?2 222 0 > ? ?3 224 1 > 10470 1 1 > ? ?1 220 0 > ? ?2 211 1 > 11470 1 0 > ? ?1 217 0 > ? ?2 210 1 > ? ?3 226 1 > > I want to create a dataset containing > . family ID > . dwelling code > . person ID > . age > . sex code > The dataset will contain one observation per person, and the with family > information repeated for people in the same family. > Can anyone help? > Thanks, > Richard Saba > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >