thr3ads.net - R help - [R] newbie questions - looping through hierarchial datafille [Oct 2005]

If this information is useful, please help other people find it:
Share via:

Andrew.Haywood@poyry.com.au

2005-Oct-04 00:36 UTC

[R] newbie questions - looping through hierarchial datafille

Dear List,

Im new to R - making a transition from SAS. I have a space delimited file 
with the following structure. Each line in the datafile is identified by 
the first letter. 

A = Inventory (Inventory)
X = Stratum (Stratum_no Total Ye=year established)
P = Plot (Plot_no age slope= species)
T = Tree (tree_no frequency)
L = Leader (leader diameter height)
F = Feature (start_height finish_height feature)

On each of these lines there are some 'line specific' variables (in 
brackets). The data is hierarchical in nature - A feature belongs to a 
leader, a leader belongs to a tree, a tree belongs to a plot, a plot 
belongs to a stratum, a stratum belongs to inventory. There are many 
features in a tree. Many trees in a plot etc. 

In SAS I would read in the data in a procedural way using first. and last. 
variables to work out where inventories/stratums/plots/trees  finished and 
started so I could create summary statistics for each of them. For 
example, how many plots in a stratum? How many trees in a plot? An example 
of the sas code I would (not checked for errors!!!). If anybody could give 
me some idea on what the right approach in R would be for a similar 
analysis it would be greatly appreciated.

regards Andrew


Data datafile;
infile 'test.txt';
input @1 tag $1. @@;
retain inventory stratum plot tree leader;
if tag = 'A' then input @3 inventory $.;
if tag = 'X' then input @3 stratum_no $. total $. yearest $. ;
if tag = 'P' then input @3 plot_no $. age $. slope $. species $;
if tag = 'T' then input @3 tree_no $. frequency  ;
if tag = 'L' then input @3 leader_no $ diameter  height  ;
if tag = 'F' then input @3 start $ finish $ feature $;
if tag = 'F' then output;
run;
proc sort data = datafile;
by inventory stratum_no  plot_no  tree_no  leader_no;

* calculate mean dbh in each plot
data dbh
set datafile;
by inventory stratum_no  plot_no  tree_no leader_no
if first.leader_no then output;

proc summary data = diameter;
by inventory stratum plot tree;
var diameter;
output out = mean mean=;
run;

A BENALLA_1
X 1 10 YE=1985
P 1 20.25 slope=14 SPP:P.RAD
T 1 25
L 0 28.5 21.3528
F 0 21.3528 SFNSW_DIC:P
F 21.3528 100 SFNSW_DIC:P
T 2 25
L 0 32 23.1
F 0 6.5 SFNSW_DIC:A
F 6.5 23.1 SFNSW_DIC:C
F 23.1 100 SFNSW_DIC:C
T 3 25
L 0 39.5 22.2407
F 0 4.7 SFNSW_DIC:A
F 4.7 6.7 SFNSW_DIC:C
P 2 20.25 slope=13 SPP:P.RAD
T 1 25
L 0 38 22.1474
F 0 1 SFNSW_DIC:G
F 1 2.3 SFNSW_DIC:A
T 1001 25
L 0 38 22.1474
F 0 1 SFNSW_DIC:G
F 1 2.3 SFNSW_DIC:A
T 2 25
L 0 32.5 21.7386
F 0 2 SFNSW_DIC:A
F 2 3.3 SFNSW_DIC:G
F 3.3 10.4 SFNSW_DIC:C
X 2 10 YE=1985
P 1 20.25 slope=14 SPP:P.RAD
T 1 25
L 0 28.5 21.3528
F 0 21.3528 SFNSW_DIC:P
F 21.3528 100 SFNSW_DIC:P
T 2 25
L 0 32 23.1
F 0 6.5 SFNSW_DIC:A
F 6.5 23.1 SFNSW_DIC:C
F 23.1 100 SFNSW_DIC:C
T 3 25
L 0 39.5 22.2407
F 0 4.7 SFNSW_DIC:A
F 4.7 6.7 SFNSW_DIC:C
P 2 20.25 slope=13 SPP:P.RAD
T 1 25
L 0 38 22.1474
F 0 1 SFNSW_DIC:G
F 1 2.3 SFNSW_DIC:A
T 1001 25
L 0 38 22.1474
F 0 1 SFNSW_DIC:G
F 1 2.3 SFNSW_DIC:A
T 2 25
L 0 32.5 21.7386
F 0 2 SFNSW_DIC:A
F 2 3.3 SFNSW_DIC:G
F 3.3 10.4 SFNSW_DIC:C




	[[alternative HTML version deleted]]

jim holtman

2005-Oct-04 10:54 UTC

head link

[R] newbie questions - looping through hierarchial datafille

Here a brute force way based on the format of you input data. Basically it
reads a line in and then 'splits' it apart based on blanks and then
processes based on the 'tag'. Information is stored in some global data
and
the '.result' is converted into a dataframe that you can work with.
 ===============================> xIN <- scan('/treedata.txt',
what='', sep='\n') # read in entire line
Read 59 items> xIN <- strsplit(xIN, ' ') # split out fields separated by blanks
> # initialize 'global' variables to collect the information
> Out <- list() # individual results
> .result <- list(); r.n <- 0
> # process the data into a list '.result'
> # make use of the '<<-' to assign to a 'global' value
> invisible(lapply(xIN, function(x){+ if (x[1] == "A") Out$inv <<- x[2]
+ else if (x[1] == "X") {
+ Out$strat <<- x[2]
+ Out$total <<- x[3]
+ Out$year <<- x[4]
+ } else if (x[1] == "P"){
+ Out$plot <<- x[2]
+ Out$age <<- x[3]
+ Out$slope <<- x[4]
+ Out$species <<- x[5]
+ } else if (x[1] == "T"){
+ Out$tree <<- x[2]
+ Out$freq <<- x[3]
+ } else if (x[1] == "L"){
+ Out$leader <<- x[2]
+ Out$diam <<- x[3]
+ Out$height <<- x[4]
+ } else if (x[1] == "F") {
+ Out$start <<- x[2]
+ Out$finish <<- x[3]
+ Out$feature <<- x[4]
+ .result[[r.n <<- r.n + 1]] <<- Out # store the result
+ }
+ }))> # convert the list to a dataframe for processing
> myData <- lapply(.result, function(x) do.call('cbind', x))
> myData <- as.data.frame(do.call('rbind', myData))
> myData[order(myData$inv, myData$strat, myData$plot, myData$tree,myData$leader),]
inv strat total year plot age slope species tree freq leader diam height
start finish feature
1 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528 0
21.3528 SFNSW_DIC:P
2 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528
21.3528 100 SFNSW_DIC:P
3 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 0
6.5SFNSW_DIC:A
4 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 6.5
23.1SFNSW_DIC:C
5 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 23.1 100
SFNSW_DIC:C
6 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 0
4.7 SFNSW_DIC:A
7 BENALLA_1 1 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 4.7
6.7 SFNSW_DIC:C
8 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474 0 1
SFNSW_DIC:G
9 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474
1 2.3SFNSW_DIC:A
10 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 0
1 SFNSW_DIC:G
11 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 1
2.3 SFNSW_DIC:A
12 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 0 2
SFNSW_DIC:A
13 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 2
3.3 SFNSW_DIC:G
14 BENALLA_1 1 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 3.3
10.4 SFNSW_DIC:C
15 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528 0
21.3528 SFNSW_DIC:P
16 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 1 25 0 28.5 21.3528
21.3528 100 SFNSW_DIC:P
17 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 0
6.5SFNSW_DIC:A
18 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1
6.5 23.1SFNSW_DIC:C
19 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 2 25 0 32 23.1 23.1 100
SFNSW_DIC:C
20 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 0
4.7 SFNSW_DIC:A
21 BENALLA_1 2 10 YE=1985 1 20.25 slope=14 SPP:P.RAD 3 25 0 39.5 22.2407 4.7
6.7 SFNSW_DIC:C
22 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474 0 1
SFNSW_DIC:G
23 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1 25 0 38 22.1474
1 2.3SFNSW_DIC:A
24 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 0
1 SFNSW_DIC:G
25 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 1001 25 0 38 22.1474 1
2.3 SFNSW_DIC:A
26 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 0 2
SFNSW_DIC:A
27 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 2
3.3 SFNSW_DIC:G
28 BENALLA_1 2 10 YE=1985 2 20.25 slope=13 SPP:P.RAD 2 25 0 32.5 21.7386 3.3
10.4 SFNSW_DIC:C>
>
>

 On 10/3/05, Andrew.Haywood@poyry.com.au <Andrew.Haywood@poyry.com.au>
wrote:>
> Dear List,
>
> Im new to R - making a transition from SAS. I have a space delimited file
> with the following structure. Each line in the datafile is identified by
> the first letter.
>
> A = Inventory (Inventory)
> X = Stratum (Stratum_no Total Ye=year established)
> P = Plot (Plot_no age slope= species)
> T = Tree (tree_no frequency)
> L = Leader (leader diameter height)
> F = Feature (start_height finish_height feature)
>
> On each of these lines there are some 'line specific' variables (in
> brackets). The data is hierarchical in nature - A feature belongs to a
> leader, a leader belongs to a tree, a tree belongs to a plot, a plot
> belongs to a stratum, a stratum belongs to inventory. There are many
> features in a tree. Many trees in a plot etc.
>
> In SAS I would read in the data in a procedural way using first. and last.
> variables to work out where inventories/stratums/plots/trees finished and
> started so I could create summary statistics for each of them. For
> example, how many plots in a stratum? How many trees in a plot? An example
> of the sas code I would (not checked for errors!!!). If anybody could give
> me some idea on what the right approach in R would be for a similar
> analysis it would be greatly appreciated.
>
> regards Andrew
>
>
> Data datafile;
> infile 'test.txt';
> input @1 tag $1. @@;
> retain inventory stratum plot tree leader;
> if tag = 'A' then input @3 inventory $.;
> if tag = 'X' then input @3 stratum_no $. total $. yearest $. ;
> if tag = 'P' then input @3 plot_no $. age $. slope $. species $;
> if tag = 'T' then input @3 tree_no $. frequency ;
> if tag = 'L' then input @3 leader_no $ diameter height ;
> if tag = 'F' then input @3 start $ finish $ feature $;
> if tag = 'F' then output;
> run;
> proc sort data = datafile;
> by inventory stratum_no plot_no tree_no leader_no;
>
> * calculate mean dbh in each plot
> data dbh
> set datafile;
> by inventory stratum_no plot_no tree_no leader_no
> if first.leader_no then output;
>
> proc summary data = diameter;
> by inventory stratum plot tree;
> var diameter;
> output out = mean mean=;
> run;
>
> A BENALLA_1
> X 1 10 YE=1985
> P 1 20.25 slope=14 SPP:P.RAD
> T 1 25
> L 0 28.5 21.3528
> F 0 21.3528 SFNSW_DIC:P
> F 21.3528 100 SFNSW_DIC:P
> T 2 25
> L 0 32 23.1
> F 0 6.5 SFNSW_DIC:A
> F 6.5 23.1 SFNSW_DIC:C
> F 23.1 100 SFNSW_DIC:C
> T 3 25
> L 0 39.5 22.2407
> F 0 4.7 SFNSW_DIC:A
> F 4.7 6.7 SFNSW_DIC:C
> P 2 20.25 slope=13 SPP:P.RAD
> T 1 25
> L 0 38 22.1474
> F 0 1 SFNSW_DIC:G
> F 1 2.3 SFNSW_DIC:A
> T 1001 25
> L 0 38 22.1474
> F 0 1 SFNSW_DIC:G
> F 1 2.3 SFNSW_DIC:A
> T 2 25
> L 0 32.5 21.7386
> F 0 2 SFNSW_DIC:A
> F 2 3.3 SFNSW_DIC:G
> F 3.3 10.4 SFNSW_DIC:C
> X 2 10 YE=1985
> P 1 20.25 slope=14 SPP:P.RAD
> T 1 25
> L 0 28.5 21.3528
> F 0 21.3528 SFNSW_DIC:P
> F 21.3528 100 SFNSW_DIC:P
> T 2 25
> L 0 32 23.1
> F 0 6.5 SFNSW_DIC:A
> F 6.5 23.1 SFNSW_DIC:C
> F 23.1 100 SFNSW_DIC:C
> T 3 25
> L 0 39.5 22.2407
> F 0 4.7 SFNSW_DIC:A
> F 4.7 6.7 SFNSW_DIC:C
> P 2 20.25 slope=13 SPP:P.RAD
> T 1 25
> L 0 38 22.1474
> F 0 1 SFNSW_DIC:G
> F 1 2.3 SFNSW_DIC:A
> T 1001 25
> L 0 38 22.1474
> F 0 1 SFNSW_DIC:G
> F 1 2.3 SFNSW_DIC:A
> T 2 25
> L 0 32.5 21.7386
> F 0 2 SFNSW_DIC:A
> F 2 3.3 SFNSW_DIC:G
> F 3.3 10.4 SFNSW_DIC:C
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>


--
Jim Holtman
Cincinnati, OH
+1 513 247 0281

What the problem you are trying to solve?

	[[alternative HTML version deleted]]

Simon Blomberg

2005-Oct-06 02:09 UTC

head link

[R] newbie questions - looping through hierarchial datafille

Well I haven't seen any replies to this, so I have had a stab at the 
problem of getting the data into a data frame.

The approach I took was to break up the data into a list, and then fill in 
a matrix, row by row, "filling down" a la spreadsheet style when
necessary,
taking advantage of the ordering of the data. Then coercing to a 
data.frame. Maybe not a very portable/general solution, but it appears to work.

list.to.data.frame <- function () {
filecon <- file(file.choose()) # open a data file
dat <- strsplit(readLines(filecon, n=-1), split=" ") # read all the
data
into a list,
                                         # 1 line per element, each element is
                                         # a character vector of data 
(variable length)
resultvec <- matrix(rep(NA, 16), nrow=1) # results will be stored here

filldown <- function (x) {
# cluge to simulate fill-down of a vector, spreadsheet style
         if(all(is.na(x)) || all(!is.na(x))) x else {
         last <- min(which(is.na(x)))
         x[last:length(x)] <- x[last-1]
         x
         }
}

#loop through the data
for (vec in dat) {
         f <- switch(vec[1], # what kind of field are we dealing with?
                 "A" = c(vec[-1], rep(NA, 15)),
                 "X" = c(NA, vec[-1], rep(NA, 12)),
                 "P" = c(rep(NA,4), vec[-1], rep(NA, 8)),
                 "T" = c(rep(NA, 8), vec[-1], rep(NA, 6)),
                 "L" = c(rep(NA, 10), vec[-1], rep(NA, 3)),
                 "F" = c(rep(NA, 13), vec[-1]))
         if (any(is.na(resultvec[nrow(resultvec), which(!is.na(f))])))
         # slot the data into the appropriate column
         resultvec[nrow(resultvec),] <- 
ifelse(is.na(resultvec[nrow(resultvec),]), f,
         resultvec[nrow(resultvec),]) else
         # if the row is full, start a new one
         resultvec <- rbind(resultvec, f)
         # if we are at the end of a row, fill down and start a new row
         if (vec[1] == "F") resultvec <- rbind(apply(resultvec, 2, 
filldown), rep(NA, 16))
         }

# coerce to a data frame, and get rid of the last empty row
res <- as.data.frame(resultvec[-nrow(resultvec),], row.names=NULL)
# set column names
names(res) <- c("Inventory", "Stratum_no",
"Total", "Ye", "Plot_no", "age",
"slope",
"species", "tree_no", "frequency",
"leader",  "diameter", "height",
"start_height",
"finish_height", "feature")
#return the result
res
}

Cheers,

Simon.


At 10:36 AM 4/10/2005, you wrote:>Dear List,
>
>Im new to R - making a transition from SAS. I have a space delimited file
>with the following structure. Each line in the datafile is identified by
>the first letter.
>
>A = Inventory (Inventory)
>X = Stratum (Stratum_no Total Ye=year established)
>P = Plot (Plot_no age slope= species)
>T = Tree (tree_no frequency)
>L = Leader (leader diameter height)
>F = Feature (start_height finish_height feature)
>
>On each of these lines there are some 'line specific' variables (in
>brackets). The data is hierarchical in nature - A feature belongs to a
>leader, a leader belongs to a tree, a tree belongs to a plot, a plot
>belongs to a stratum, a stratum belongs to inventory. There are many
>features in a tree. Many trees in a plot etc.
>
>In SAS I would read in the data in a procedural way using first. and last.
>variables to work out where inventories/stratums/plots/trees  finished and
>started so I could create summary statistics for each of them. For
>example, how many plots in a stratum? How many trees in a plot? An example
>of the sas code I would (not checked for errors!!!). If anybody could give
>me some idea on what the right approach in R would be for a similar
>analysis it would be greatly appreciated.
>
>regards Andrew
>
>
>Data datafile;
>infile 'test.txt';
>input @1 tag $1. @@;
>retain inventory stratum plot tree leader;
>if tag = 'A' then input @3 inventory $.;
>if tag = 'X' then input @3 stratum_no $. total $. yearest $. ;
>if tag = 'P' then input @3 plot_no $. age $. slope $. species $;
>if tag = 'T' then input @3 tree_no $. frequency  ;
>if tag = 'L' then input @3 leader_no $ diameter  height  ;
>if tag = 'F' then input @3 start $ finish $ feature $;
>if tag = 'F' then output;
>run;
>proc sort data = datafile;
>by inventory stratum_no  plot_no  tree_no  leader_no;
>
>* calculate mean dbh in each plot
>data dbh
>set datafile;
>by inventory stratum_no  plot_no  tree_no leader_no
>if first.leader_no then output;
>
>proc summary data = diameter;
>by inventory stratum plot tree;
>var diameter;
>output out = mean mean=;
>run;
>
>A BENALLA_1
>X 1 10 YE=1985
>P 1 20.25 slope=14 SPP:P.RAD
>T 1 25
>L 0 28.5 21.3528
>F 0 21.3528 SFNSW_DIC:P
>F 21.3528 100 SFNSW_DIC:P
>T 2 25
>L 0 32 23.1
>F 0 6.5 SFNSW_DIC:A
>F 6.5 23.1 SFNSW_DIC:C
>F 23.1 100 SFNSW_DIC:C
>T 3 25
>L 0 39.5 22.2407
>F 0 4.7 SFNSW_DIC:A
>F 4.7 6.7 SFNSW_DIC:C
>P 2 20.25 slope=13 SPP:P.RAD
>T 1 25
>L 0 38 22.1474
>F 0 1 SFNSW_DIC:G
>F 1 2.3 SFNSW_DIC:A
>T 1001 25
>L 0 38 22.1474
>F 0 1 SFNSW_DIC:G
>F 1 2.3 SFNSW_DIC:A
>T 2 25
>L 0 32.5 21.7386
>F 0 2 SFNSW_DIC:A
>F 2 3.3 SFNSW_DIC:G
>F 3.3 10.4 SFNSW_DIC:C
>X 2 10 YE=1985
>P 1 20.25 slope=14 SPP:P.RAD
>T 1 25
>L 0 28.5 21.3528
>F 0 21.3528 SFNSW_DIC:P
>F 21.3528 100 SFNSW_DIC:P
>T 2 25
>L 0 32 23.1
>F 0 6.5 SFNSW_DIC:A
>F 6.5 23.1 SFNSW_DIC:C
>F 23.1 100 SFNSW_DIC:C
>T 3 25
>L 0 39.5 22.2407
>F 0 4.7 SFNSW_DIC:A
>F 4.7 6.7 SFNSW_DIC:C
>P 2 20.25 slope=13 SPP:P.RAD
>T 1 25
>L 0 38 22.1474
>F 0 1 SFNSW_DIC:G
>F 1 2.3 SFNSW_DIC:A
>T 1001 25
>L 0 38 22.1474
>F 0 1 SFNSW_DIC:G
>F 1 2.3 SFNSW_DIC:A
>T 2 25
>L 0 32.5 21.7386
>F 0 2 SFNSW_DIC:A
>F 2 3.3 SFNSW_DIC:G
>F 3.3 10.4 SFNSW_DIC:C
>
>
>
>
>         [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat.
Centre for Resource and Environmental Studies
The Australian National University
Canberra ACT 0200
Australia
T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au
F: +61 2 6125 0757
CRICOS Provider # 00120C

Reasonably Related Threads

Search for more reasonably related threads

R help - Oct 2005 - newbie questions - looping through hierarchial datafille

[R] newbie questions - looping through hierarchial datafille

[R] newbie questions - looping through hierarchial datafille

[R] newbie questions - looping through hierarchial datafille

Reasonably Related Threads