Andrew.Haywood@poyry.com.au
2005-Oct-04 00:36 UTC
[R] newbie questions - looping through hierarchial datafille
Dear List, Im new to R - making a transition from SAS. I have a space delimited file with the following structure. Each line in the datafile is identified by the first letter. A = Inventory (Inventory) X = Stratum (Stratum_no Total Ye=year established) P = Plot (Plot_no age slope= species) T = Tree (tree_no frequency) L = Leader (leader diameter height) F = Feature (start_height finish_height feature) On each of these lines there are some 'line specific' variables (in brackets). The data is hierarchical in nature - A feature belongs to a leader, a leader belongs to a tree, a tree belongs to a plot, a plot belongs to a stratum, a stratum belongs to inventory. There are many features in a tree. Many trees in a plot etc. In SAS I would read in the data in a procedural way using first. and last. variables to work out where inventories/stratums/plots/trees finished and started so I could create summary statistics for each of them. For example, how many plots in a stratum? How many trees in a plot? An example of the sas code I would (not checked for errors!!!). If anybody could give me some idea on what the right approach in R would be for a similar analysis it would be greatly appreciated. regards Andrew Data datafile; infile 'test.txt'; input @1 tag $1. @@; retain inventory stratum plot tree leader; if tag = 'A' then input @3 inventory $.; if tag = 'X' then input @3 stratum_no $. total $. yearest $. ; if tag = 'P' then input @3 plot_no $. age $. slope $. species $; if tag = 'T' then input @3 tree_no $. frequency ; if tag = 'L' then input @3 leader_no $ diameter height ; if tag = 'F' then input @3 start $ finish $ feature $; if tag = 'F' then output; run; proc sort data = datafile; by inventory stratum_no plot_no tree_no leader_no; * calculate mean dbh in each plot data dbh set datafile; by inventory stratum_no plot_no tree_no leader_no if first.leader_no then output; proc summary data = diameter; by inventory stratum plot tree; var diameter; output out = mean mean=; run; A BENALLA_1 X 1 10 YE=1985 P 1 20.25 slope=14 SPP:P.RAD T 1 25 L 0 28.5 21.3528 F 0 21.3528 SFNSW_DIC:P F 21.3528 100 SFNSW_DIC:P T 2 25 L 0 32 23.1 F 0 6.5 SFNSW_DIC:A F 6.5 23.1 SFNSW_DIC:C F 23.1 100 SFNSW_DIC:C T 3 25 L 0 39.5 22.2407 F 0 4.7 SFNSW_DIC:A F 4.7 6.7 SFNSW_DIC:C P 2 20.25 slope=13 SPP:P.RAD T 1 25 L 0 38 22.1474 F 0 1 SFNSW_DIC:G F 1 2.3 SFNSW_DIC:A T 1001 25 L 0 38 22.1474 F 0 1 SFNSW_DIC:G F 1 2.3 SFNSW_DIC:A T 2 25 L 0 32.5 21.7386 F 0 2 SFNSW_DIC:A F 2 3.3 SFNSW_DIC:G F 3.3 10.4 SFNSW_DIC:C X 2 10 YE=1985 P 1 20.25 slope=14 SPP:P.RAD T 1 25 L 0 28.5 21.3528 F 0 21.3528 SFNSW_DIC:P F 21.3528 100 SFNSW_DIC:P T 2 25 L 0 32 23.1 F 0 6.5 SFNSW_DIC:A F 6.5 23.1 SFNSW_DIC:C F 23.1 100 SFNSW_DIC:C T 3 25 L 0 39.5 22.2407 F 0 4.7 SFNSW_DIC:A F 4.7 6.7 SFNSW_DIC:C P 2 20.25 slope=13 SPP:P.RAD T 1 25 L 0 38 22.1474 F 0 1 SFNSW_DIC:G F 1 2.3 SFNSW_DIC:A T 1001 25 L 0 38 22.1474 F 0 1 SFNSW_DIC:G F 1 2.3 SFNSW_DIC:A T 2 25 L 0 32.5 21.7386 F 0 2 SFNSW_DIC:A F 2 3.3 SFNSW_DIC:G F 3.3 10.4 SFNSW_DIC:C [[alternative HTML version deleted]]
jim holtman
2005-Oct-04 10:54 UTC
[R] newbie questions - looping through hierarchial datafille
Here a brute force way based on the format of you input data. Basically it reads a line in and then 'splits' it apart based on blanks and then processes based on the 'tag'. Information is stored in some global data and the '.result' is converted into a dataframe that you can work with. ===============================> xIN <- scan('/treedata.txt', what='', sep='\n') # read in entire line Read 59 items> xIN <- strsplit(xIN, ' ') # split out fields separated by blanks > # initialize 'global' variables to collect the information > Out <- list() # individual results > .result <- list(); r.n <- 0 > # process the data into a list '.result' > # make use of the '<<-' to assign to a 'global' value > invisible(lapply(xIN, function(x){+ if (x[1] == "A") Out$inv <<- x[2] + else if (x[1] == "X") { + Out$strat <<- x[2] + Out$total <<- x[3] + Out$year <<- x[4] + } else if (x[1] == "P"){ + Out$plot <<- x[2] + Out$age <<- x[3] + Out$slope <<- x[4] + Out$species <<- x[5] + } else if (x[1] == "T"){ + Out$tree <<- x[2] + Out$freq <<- x[3] + } else if (x[1] == "L"){ + Out$leader <<- x[2] + Out$diam <<- x[3] + Out$height <<- x[4] + } else if (x[1] == "F") { + Out$start <<- x[2] + Out$finish <<- x[3] + Out$feature <<- x[4] + .result[[r.n <<- r.n + 1]] <<- Out # store the result + } + }))> # convert the list to a dataframe for processing > myData <- lapply(.result, function(x) do.call('cbind', x)) > myData <- as.data.frame(do.call('rbind', myData)) > myData[order(myData$inv, myData$strat, myData$plot, myData$tree,myData$leader),] inv strat total year plot age slope species tree freq leader diam height start finish feature 1 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528 0 21.3528 SFNSW_DIC:P 2 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528 21.3528 100 SFNSW_DIC:P 3 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 0 6.5SFNSW_DIC:A 4 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 6.5 23.1SFNSW_DIC:C 5 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 23.1 100 SFNSW_DIC:C 6 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 0 4.7 SFNSW_DIC:A 7 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 4.7 6.7 SFNSW_DIC:C 8 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474 0 1 SFNSW_DIC:G 9 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474 1 2.3SFNSW_DIC:A 10 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 0 1 SFNSW_DIC:G 11 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 1 2.3 SFNSW_DIC:A 12 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 0 2 SFNSW_DIC:A 13 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 2 3.3 SFNSW_DIC:G 14 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 3.3 10.4 SFNSW_DIC:C 15 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528 0 21.3528 SFNSW_DIC:P 16 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528 21.3528 100 SFNSW_DIC:P 17 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 0 6.5SFNSW_DIC:A 18 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 6.5 23.1SFNSW_DIC:C 19 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 23.1 100 SFNSW_DIC:C 20 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 0 4.7 SFNSW_DIC:A 21 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 4.7 6.7 SFNSW_DIC:C 22 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474 0 1 SFNSW_DIC:G 23 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474 1 2.3SFNSW_DIC:A 24 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 0 1 SFNSW_DIC:G 25 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 1 2.3 SFNSW_DIC:A 26 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 0 2 SFNSW_DIC:A 27 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 2 3.3 SFNSW_DIC:G 28 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 3.3 10.4 SFNSW_DIC:C> > >On 10/3/05, Andrew.Haywood@poyry.com.au <Andrew.Haywood@poyry.com.au> wrote:> > Dear List, > > Im new to R - making a transition from SAS. I have a space delimited file > with the following structure. Each line in the datafile is identified by > the first letter. > > A = Inventory (Inventory) > X = Stratum (Stratum_no Total Ye=year established) > P = Plot (Plot_no age slope= species) > T = Tree (tree_no frequency) > L = Leader (leader diameter height) > F = Feature (start_height finish_height feature) > > On each of these lines there are some 'line specific' variables (in > brackets). The data is hierarchical in nature - A feature belongs to a > leader, a leader belongs to a tree, a tree belongs to a plot, a plot > belongs to a stratum, a stratum belongs to inventory. There are many > features in a tree. Many trees in a plot etc. > > In SAS I would read in the data in a procedural way using first. and last. > variables to work out where inventories/stratums/plots/trees finished and > started so I could create summary statistics for each of them. For > example, how many plots in a stratum? How many trees in a plot? An example > of the sas code I would (not checked for errors!!!). If anybody could give > me some idea on what the right approach in R would be for a similar > analysis it would be greatly appreciated. > > regards Andrew > > > Data datafile; > infile 'test.txt'; > input @1 tag $1. @@; > retain inventory stratum plot tree leader; > if tag = 'A' then input @3 inventory $.; > if tag = 'X' then input @3 stratum_no $. total $. yearest $. ; > if tag = 'P' then input @3 plot_no $. age $. slope $. species $; > if tag = 'T' then input @3 tree_no $. frequency ; > if tag = 'L' then input @3 leader_no $ diameter height ; > if tag = 'F' then input @3 start $ finish $ feature $; > if tag = 'F' then output; > run; > proc sort data = datafile; > by inventory stratum_no plot_no tree_no leader_no; > > * calculate mean dbh in each plot > data dbh > set datafile; > by inventory stratum_no plot_no tree_no leader_no > if first.leader_no then output; > > proc summary data = diameter; > by inventory stratum plot tree; > var diameter; > output out = mean mean=; > run; > > A BENALLA_1 > X 1 10 YE=1985 > P 1 20.25 slope=14 SPP:P.RAD > T 1 25 > L 0 28.5 21.3528 > F 0 21.3528 SFNSW_DIC:P > F 21.3528 100 SFNSW_DIC:P > T 2 25 > L 0 32 23.1 > F 0 6.5 SFNSW_DIC:A > F 6.5 23.1 SFNSW_DIC:C > F 23.1 100 SFNSW_DIC:C > T 3 25 > L 0 39.5 22.2407 > F 0 4.7 SFNSW_DIC:A > F 4.7 6.7 SFNSW_DIC:C > P 2 20.25 slope=13 SPP:P.RAD > T 1 25 > L 0 38 22.1474 > F 0 1 SFNSW_DIC:G > F 1 2.3 SFNSW_DIC:A > T 1001 25 > L 0 38 22.1474 > F 0 1 SFNSW_DIC:G > F 1 2.3 SFNSW_DIC:A > T 2 25 > L 0 32.5 21.7386 > F 0 2 SFNSW_DIC:A > F 2 3.3 SFNSW_DIC:G > F 3.3 10.4 SFNSW_DIC:C > X 2 10 YE=1985 > P 1 20.25 slope=14 SPP:P.RAD > T 1 25 > L 0 28.5 21.3528 > F 0 21.3528 SFNSW_DIC:P > F 21.3528 100 SFNSW_DIC:P > T 2 25 > L 0 32 23.1 > F 0 6.5 SFNSW_DIC:A > F 6.5 23.1 SFNSW_DIC:C > F 23.1 100 SFNSW_DIC:C > T 3 25 > L 0 39.5 22.2407 > F 0 4.7 SFNSW_DIC:A > F 4.7 6.7 SFNSW_DIC:C > P 2 20.25 slope=13 SPP:P.RAD > T 1 25 > L 0 38 22.1474 > F 0 1 SFNSW_DIC:G > F 1 2.3 SFNSW_DIC:A > T 1001 25 > L 0 38 22.1474 > F 0 1 SFNSW_DIC:G > F 1 2.3 SFNSW_DIC:A > T 2 25 > L 0 32.5 21.7386 > F 0 2 SFNSW_DIC:A > F 2 3.3 SFNSW_DIC:G > F 3.3 10.4 SFNSW_DIC:C > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >-- Jim Holtman Cincinnati, OH +1 513 247 0281 What the problem you are trying to solve? [[alternative HTML version deleted]]
Simon Blomberg
2005-Oct-06 02:09 UTC
[R] newbie questions - looping through hierarchial datafille
Well I haven't seen any replies to this, so I have had a stab at the problem of getting the data into a data frame. The approach I took was to break up the data into a list, and then fill in a matrix, row by row, "filling down" a la spreadsheet style when necessary, taking advantage of the ordering of the data. Then coercing to a data.frame. Maybe not a very portable/general solution, but it appears to work. list.to.data.frame <- function () { filecon <- file(file.choose()) # open a data file dat <- strsplit(readLines(filecon, n=-1), split=" ") # read all the data into a list, # 1 line per element, each element is # a character vector of data (variable length) resultvec <- matrix(rep(NA, 16), nrow=1) # results will be stored here filldown <- function (x) { # cluge to simulate fill-down of a vector, spreadsheet style if(all(is.na(x)) || all(!is.na(x))) x else { last <- min(which(is.na(x))) x[last:length(x)] <- x[last-1] x } } #loop through the data for (vec in dat) { f <- switch(vec[1], # what kind of field are we dealing with? "A" = c(vec[-1], rep(NA, 15)), "X" = c(NA, vec[-1], rep(NA, 12)), "P" = c(rep(NA,4), vec[-1], rep(NA, 8)), "T" = c(rep(NA, 8), vec[-1], rep(NA, 6)), "L" = c(rep(NA, 10), vec[-1], rep(NA, 3)), "F" = c(rep(NA, 13), vec[-1])) if (any(is.na(resultvec[nrow(resultvec), which(!is.na(f))]))) # slot the data into the appropriate column resultvec[nrow(resultvec),] <- ifelse(is.na(resultvec[nrow(resultvec),]), f, resultvec[nrow(resultvec),]) else # if the row is full, start a new one resultvec <- rbind(resultvec, f) # if we are at the end of a row, fill down and start a new row if (vec[1] == "F") resultvec <- rbind(apply(resultvec, 2, filldown), rep(NA, 16)) } # coerce to a data frame, and get rid of the last empty row res <- as.data.frame(resultvec[-nrow(resultvec),], row.names=NULL) # set column names names(res) <- c("Inventory", "Stratum_no", "Total", "Ye", "Plot_no", "age", "slope", "species", "tree_no", "frequency", "leader", "diameter", "height", "start_height", "finish_height", "feature") #return the result res } Cheers, Simon. At 10:36 AM 4/10/2005, you wrote:>Dear List, > >Im new to R - making a transition from SAS. I have a space delimited file >with the following structure. Each line in the datafile is identified by >the first letter. > >A = Inventory (Inventory) >X = Stratum (Stratum_no Total Ye=year established) >P = Plot (Plot_no age slope= species) >T = Tree (tree_no frequency) >L = Leader (leader diameter height) >F = Feature (start_height finish_height feature) > >On each of these lines there are some 'line specific' variables (in >brackets). The data is hierarchical in nature - A feature belongs to a >leader, a leader belongs to a tree, a tree belongs to a plot, a plot >belongs to a stratum, a stratum belongs to inventory. There are many >features in a tree. Many trees in a plot etc. > >In SAS I would read in the data in a procedural way using first. and last. >variables to work out where inventories/stratums/plots/trees finished and >started so I could create summary statistics for each of them. For >example, how many plots in a stratum? How many trees in a plot? An example >of the sas code I would (not checked for errors!!!). If anybody could give >me some idea on what the right approach in R would be for a similar >analysis it would be greatly appreciated. > >regards Andrew > > >Data datafile; >infile 'test.txt'; >input @1 tag $1. @@; >retain inventory stratum plot tree leader; >if tag = 'A' then input @3 inventory $.; >if tag = 'X' then input @3 stratum_no $. total $. yearest $. ; >if tag = 'P' then input @3 plot_no $. age $. slope $. species $; >if tag = 'T' then input @3 tree_no $. frequency ; >if tag = 'L' then input @3 leader_no $ diameter height ; >if tag = 'F' then input @3 start $ finish $ feature $; >if tag = 'F' then output; >run; >proc sort data = datafile; >by inventory stratum_no plot_no tree_no leader_no; > >* calculate mean dbh in each plot >data dbh >set datafile; >by inventory stratum_no plot_no tree_no leader_no >if first.leader_no then output; > >proc summary data = diameter; >by inventory stratum plot tree; >var diameter; >output out = mean mean=; >run; > >A BENALLA_1 >X 1 10 YE=1985 >P 1 20.25 slope=14 SPP:P.RAD >T 1 25 >L 0 28.5 21.3528 >F 0 21.3528 SFNSW_DIC:P >F 21.3528 100 SFNSW_DIC:P >T 2 25 >L 0 32 23.1 >F 0 6.5 SFNSW_DIC:A >F 6.5 23.1 SFNSW_DIC:C >F 23.1 100 SFNSW_DIC:C >T 3 25 >L 0 39.5 22.2407 >F 0 4.7 SFNSW_DIC:A >F 4.7 6.7 SFNSW_DIC:C >P 2 20.25 slope=13 SPP:P.RAD >T 1 25 >L 0 38 22.1474 >F 0 1 SFNSW_DIC:G >F 1 2.3 SFNSW_DIC:A >T 1001 25 >L 0 38 22.1474 >F 0 1 SFNSW_DIC:G >F 1 2.3 SFNSW_DIC:A >T 2 25 >L 0 32.5 21.7386 >F 0 2 SFNSW_DIC:A >F 2 3.3 SFNSW_DIC:G >F 3.3 10.4 SFNSW_DIC:C >X 2 10 YE=1985 >P 1 20.25 slope=14 SPP:P.RAD >T 1 25 >L 0 28.5 21.3528 >F 0 21.3528 SFNSW_DIC:P >F 21.3528 100 SFNSW_DIC:P >T 2 25 >L 0 32 23.1 >F 0 6.5 SFNSW_DIC:A >F 6.5 23.1 SFNSW_DIC:C >F 23.1 100 SFNSW_DIC:C >T 3 25 >L 0 39.5 22.2407 >F 0 4.7 SFNSW_DIC:A >F 4.7 6.7 SFNSW_DIC:C >P 2 20.25 slope=13 SPP:P.RAD >T 1 25 >L 0 38 22.1474 >F 0 1 SFNSW_DIC:G >F 1 2.3 SFNSW_DIC:A >T 1001 25 >L 0 38 22.1474 >F 0 1 SFNSW_DIC:G >F 1 2.3 SFNSW_DIC:A >T 2 25 >L 0 32.5 21.7386 >F 0 2 SFNSW_DIC:A >F 2 3.3 SFNSW_DIC:G >F 3.3 10.4 SFNSW_DIC:C > > > > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide! http://www.R-project.org/posting-guide.htmlSimon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat. Centre for Resource and Environmental Studies The Australian National University Canberra ACT 0200 Australia T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au F: +61 2 6125 0757 CRICOS Provider # 00120C