I have a dataset that is basically structureless. Its dimension varies from row to row and sep(s) are a mixture of tab and semi colon (;) and example is HEADER1 HEADER2 HEADER3 HEADER3 A1 B1 C1 X11;X12;X13 A2 B2 C2 X21;X22;X23;X24;X25 A3 B3 C3 A4 B4 C4 X41;X42;X43 A5 B5 C5 X51 etc., say. Note that a blank under HEADER3 corresponds to non occurance and all semi colon (;) delimited variables are under HEADER3. These values run into tens of thousands. I want to give some order to this queer matrix to something like: HEADER1 HEADER2 HEADER3 HEADER3 A1 B1 C1 X11 A1 B1 C1 X12 A1 B1 C1 X13 A1 B1 C1 X14 A2 B2 C2 X21 A2 B2 C2 X22 A2 B2 C2 X23 A2 B2 C2 X24 A2 B2 C2 X25 A2 B2 C2 X26 A3 B3 C3 NA A4 B4 C4 X41 A4 B4 C4 X42 A4 B4 C4 X43 Is there a brilliant R-way of doing such task? Goodday. Stephen. ----- Original Message ----- From: Prof Brian Ripley <ripley at stats.ox.ac.uk> Date: Monday, August 15, 2005 11:13 pm Subject: Re: [R] How to get a list work in RData file> On Mon, 15 Aug 2005, Xiyan Lon wrote: > > > Dear R-Helper, > (There are quite a few of us.) > > > I want to know how I get a list work which I saved in RData > file. For > > example, > > I don't understand that at all, but it looks as if you want to > save an > unevaluated call, in which case see ?quote and use something like > > xyadd <- quote(test.xy(x=2, y=3)) > > load and saving has nothing to do with this: it doesn't change the > meaning > of objects in the workspace. > > > > test.xy <- function(x,y) { > > + xy <- x+y > > + xy > > + } > > > > > > xyadd <- test.xy(x=2, y=3) > > > xyadd > > [1] 5 > > > x1 <- c(2,43,60,8) > > > y1 <- c(91,7,5,30) > > > > > > xyadd1 <- test.xy(x=x1, y=y1) > > > xyadd1 > > [1] 93 50 65 38 > > > save(list = ls(all=TRUE), file = "testxy.RData") > > > rm(list=ls(all=TRUE)) > > > load("C:/R/useR/testxy.RData") > > > ls() > > [1] "test.xy" "x1" "xyadd" "xyadd1" "y1" > > > > > > ls.str(pat="xyadd") > > xyadd : num 5 > > xyadd1 : num [1:4] 93 50 65 38 > > > > > > > When I run, I know the result like above > > > xyadd > > [1] 5 > > > xyadd1 > > [1] 93 50 65 38 > > > > > what I want to know, is there any function to make the result like: > > > > > xyadd > > > > test.xy(x=2, y=3) > > > > and > > > > > xyadd1 > > > > test.xy(x=x1, y=y1) > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting- > guide.html
Here's one way of working with the data you gave: > x <- read.table(file("clipboard"), fill=T, header=T) > x HEADER1 HEADER2 HEADER3 HEADER3.1 1 A1 B1 C1 X11;X12;X13 2 A2 B2 C2 X21;X22;X23;X24;X25 3 A3 B3 C3 4 A4 B4 C4 X41;X42;X43 5 A5 B5 C5 X51 > apply(x, 1, function(x) strsplit(x[4], ";")[[1]]) $"1" [1] "X11" "X12" "X13" $"2" [1] "X21" "X22" "X23" "X24" "X25" $"3" character(0) $"4" [1] "X41" "X42" "X43" $"5" [1] "X51" > do.call("rbind", apply(x, 1, function(x) { + y <- strsplit(x[4], ";")[[1]] + x3 <- matrix(x[1:3], ncol=3, nrow=max(1,length(y)), byrow=T) + return(cbind(x3, if (length(y)) y else "NA")) + })) [,1] [,2] [,3] [,4] [1,] "A1" "B1" "C1" "X11" [2,] "A1" "B1" "C1" "X12" [3,] "A1" "B1" "C1" "X13" [4,] "A2" "B2" "C2" "X21" [5,] "A2" "B2" "C2" "X22" [6,] "A2" "B2" "C2" "X23" [7,] "A2" "B2" "C2" "X24" [8,] "A2" "B2" "C2" "X25" [9,] "A3" "B3" "C3" "NA" [10,] "A4" "B4" "C4" "X41" [11,] "A4" "B4" "C4" "X42" [12,] "A4" "B4" "C4" "X43" [13,] "A5" "B5" "C5" "X51" > This of course is a matrix; you can convert it back to a dataframe using as.data.frame() if you desire. Use either "NA" (with quotes) or NA (without quotes) to control whether you get just the string "NA" or an actual character NA value in column 4. If you're processing a huge amount of data, you can probably do better by rewriting the above code to avoid implicit coercions of data types. hope this helps, Tony Plate S.O. Nyangoma wrote:> I have a dataset that is basically structureless. Its dimension varies > from row to row and sep(s) are a mixture of tab and semi colon (;) and > example is > > HEADER1 HEADER2 HEADER3 HEADER3 > A1 B1 C1 X11;X12;X13 > A2 B2 C2 X21;X22;X23;X24;X25 > A3 B3 C3 > A4 B4 C4 X41;X42;X43 > A5 B5 C5 X51 > > etc., say. Note that a blank under HEADER3 corresponds to non > occurance and all semi colon (;) delimited variables are under > HEADER3. These values run into tens of thousands. I want to give some > order to this queer matrix to something like: > > HEADER1 HEADER2 HEADER3 HEADER3 > A1 B1 C1 X11 > A1 B1 C1 X12 > A1 B1 C1 X13 > A1 B1 C1 X14 > A2 B2 C2 X21 > A2 B2 C2 X22 > A2 B2 C2 X23 > A2 B2 C2 X24 > A2 B2 C2 X25 > A2 B2 C2 X26 > A3 B3 C3 NA > A4 B4 C4 X41 > A4 B4 C4 X42 > A4 B4 C4 X43 > > Is there a brilliant R-way of doing such task? > > Goodday. Stephen. > > > > > > > > > ----- Original Message ----- > From: Prof Brian Ripley <ripley at stats.ox.ac.uk> > Date: Monday, August 15, 2005 11:13 pm > Subject: Re: [R] How to get a list work in RData file > > >>On Mon, 15 Aug 2005, Xiyan Lon wrote: >> >> >>>Dear R-Helper, >> >>(There are quite a few of us.) >> >> >>>I want to know how I get a list work which I saved in RData >> >>file. For >> >>>example, >> >>I don't understand that at all, but it looks as if you want to >>save an >>unevaluated call, in which case see ?quote and use something like >> >>xyadd <- quote(test.xy(x=2, y=3)) >> >>load and saving has nothing to do with this: it doesn't change the >>meaning >>of objects in the workspace. >> >> >>>>test.xy <- function(x,y) { >>> >>>+ xy <- x+y >>>+ xy >>>+ } >>> >>>>xyadd <- test.xy(x=2, y=3) >>>>xyadd >>> >>>[1] 5 >>> >>>>x1 <- c(2,43,60,8) >>>>y1 <- c(91,7,5,30) >>>> >>>>xyadd1 <- test.xy(x=x1, y=y1) >>>>xyadd1 >>> >>>[1] 93 50 65 38 >>> >>>>save(list = ls(all=TRUE), file = "testxy.RData") >>>>rm(list=ls(all=TRUE)) >>>>load("C:/R/useR/testxy.RData") >>>>ls() >>> >>>[1] "test.xy" "x1" "xyadd" "xyadd1" "y1" >>> >>>>ls.str(pat="xyadd") >>> >>>xyadd : num 5 >>>xyadd1 : num [1:4] 93 50 65 38 >>> >>>When I run, I know the result like above >>> >>>>xyadd >>> >>>[1] 5 >>> >>>>xyadd1 >>> >>>[1] 93 50 65 38 >>> >>>what I want to know, is there any function to make the result like: >>> >>> >>>>xyadd >>> >>> test.xy(x=2, y=3) >>> >>>and >>> >>> >>>>xyadd1 >>> >>> test.xy(x=x1, y=y1) >> >>-- >>Brian D. Ripley, ripley at stats.ox.ac.uk >>Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >>University of Oxford, Tel: +44 1865 272861 (self) >>1 South Parks Road, +44 1865 272866 (PA) >>Oxford OX1 3TG, UK Fax: +44 1865 272595 >> >>______________________________________________ >>R-help at stat.math.ethz.ch mailing list >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide! http://www.R-project.org/posting- >>guide.html > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
On 15-Aug-05 S.O. Nyangoma wrote:> I have a dataset that is basically structureless. Its dimension > varies from row to row and sep(s) are a mixture of tab and semi > colon (;) an example is >HEADER1 HEADER2 HEADER3 HEADER3 A1 B1 C1 X11;X12;X13 A2 B2 C2 X21;X22;X23;X24;X25 A3 B3 C3 A4 B4 C4 X41;X42;X43 A5 B5 C5 X51> > etc., say. Note that a blank under HEADER3 corresponds to non > occurance and all semi colon (;) delimited variables are under > HEADER3. These values run into tens of thousands. I want to give some > order to this queer matrix to something like: > > HEADER1 HEADER2 HEADER3 HEADER3 > A1 B1 C1 X11 > A1 B1 C1 X12 > A1 B1 C1 X13 > A1 B1 C1 X14 > A2 B2 C2 X21 > A2 B2 C2 X22 > A2 B2 C2 X23 > A2 B2 C2 X24 > A2 B2 C2 X25 > A2 B2 C2 X26 > A3 B3 C3 NA > A4 B4 C4 X41 > A4 B4 C4 X42 > A4 B4 C4 X43 > > Is there a brilliant R-way of doing such task? > > Goodday. Stephen.I don't know about a brilliant R trick (though I'm sure others do). But (on my usual hobby-horse) if you have 'awk' available (and don't mind using it) then it will do the job: First create an 'awk' program file as follows: {for(i in A) delete A[i]} {if($4==""){A[1]="NA"} else {split($4,A,";")}} {B = $1 "\t" $2 "\t" $3} {for(i in A) print B "\t" A[i]} and call this say split.awk Then run awk -f split.awk and feed it the lines of your primary dataset as above. Here's a cut&paste from my Linux session, where the first block of lines after "awk -f split.awk" are the lines being input to the program, starting with the header, followed by the output of the program starting with the header again: awk -f split.awk HEADER1 HEADER2 HEADER3 HEADER3 A1 B1 C1 X11;X12;X13 A2 B2 C2 X21;X22;X23;X24;X25 A3 B3 C3 A4 B4 C4 X41;X42;X43 A5 B5 C5 X51 HEADER1 HEADER2 HEADER3 HEADER3 A1 B1 C1 X11 A1 B1 C1 X12 A1 B1 C1 X13 A2 B2 C2 X24 A2 B2 C2 X25 A2 B2 C2 X21 A2 B2 C2 X22 A2 B2 C2 X23 A3 B3 C3 NA A4 B4 C4 X41 A4 B4 C4 X42 A4 B4 C4 X43 A5 B5 C5 X51 In unixoid systems, with a large file of such lines, the command would be cat yourdatafile | awk -f split.awk and then you would only see the output, not the input as shown above, and you can of course redirect it into a new file with cat yourdatafile | awk -f split.awk > newdatafile Note, however, that the order of the lines output for the third line of input (the one with X21;X22;X23;X24;X25) is not the same as the order of the X21;X22;X23;X24;X25 though they are all there. This is a "feature" of the way 'awk' handles arrays (which are "associative arrays" indexed by values, not by position). This may not matter for your application; but if it does matter then I'm not sure how to force the correct order. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 16-Aug-05 Time: 00:45:49 ------------------------------ XFMail ------------------------------