hi everyone : suppose i have a matrix in which some column names are identical so, for example, TEMP "AAA", "BBB", "CCC", "DDD","AAA", "BBB" 0 2 1 2 0 0 2 3 7 6 0 1 1.5 4 9 9 6 0 1.0 6 10 11 3 3 I didn''t even check yet whether identical column names are allowed in a matrix but i hope they are. assuming that they are, then i would like to be able to take the matrix and make a new matrix with the following requirements. 1) whenever there is a unique column name, just take that column for the new matrix 2) whenever the column name is not unique, take the one that has the most non zero elements ? ( in the case of ties, i don''t care which one is picked ). so, in this case, the resulting matrix would just be the first 4 columns. i realize ( or atleast i think ) that sum( TEMP[(TEMP[,columnname] !=0) ,columnname) will give me the number of non elements in a column with the name columnmame but how to use this deal with the non uniqueness to solve my particular problem is beyond me. plus, i think the command will bomb because columnname will not always be unique ? Thanks for any help. I realize this is not a trivial problem so I really appreciate it. Mark

Try this: # test data # read in header separately so R does not make column names unique Lines <- "AAA BBB CCC DDD AAA BBB 0 2 1 2 0 0 2 3 7 6 0 1 1.5 4 9 9 6 0 1.0 6 10 11 3 3 " DF <- read.table(textConnection(Lines), skip = 1) names(DF) <- scan(textConnection(Lines), what = "", nlines = 1) f <- function(x) x[which.max(colSums(DF[x]!=0))] tapply(seq(DF), names(DF), f) On 7/3/06, markleeds at verizon.net <markleeds at verizon.net> wrote:> > hi everyone : > > suppose i have a matrix in which some column names are identical so, > for example, TEMP > > "AAA", "BBB", "CCC", "DDD","AAA", "BBB" > 0 2 1 2 0 0 > 2 3 7 6 0 1 > 1.5 4 9 9 6 0 > 1.0 6 10 11 3 3 > > > I didn''t even check yet whether identical column names are allowed > in a matrix but i hope they are. > > assuming that they are, then i would like to be able to take the matrix and make a new matrix with the following requirements. > > 1) whenever there is a unique column name, just take that column for the new matrix > > 2) whenever the column name is not unique, take the one > that has the most non zero elements ? ( in the case of > ties, i don''t care which one is picked ). > > so, in this case, the resulting matrix would just be the first 4 columns. > > i realize ( or atleast i think ) that > sum( TEMP[(TEMP[,columnname] !=0) ,columnname) will give me the > number of non elements in a column with the name columnmame > but how to use this deal with the non uniqueness to solve my particular problem is beyond me. plus, i think the command will > bomb because columnname will not always be unique ? > Thanks for any help. I realize this is not a trivial problem so I really appreciate it. > > Mark > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >

Here is a modification of Gabor''s solution that will return the dataframe with just the maximum columns: # test data # read in header separately so R does not make column names unique Lines <- "AAA BBB CCC DDD AAA BBB 0 2 1 2 0 0 2 3 7 6 0 1 1.5 4 9 9 6 0 1.0 6 10 11 3 3 " DF <- read.table(textConnection(Lines), skip = 1) names(DF) <- scan(textConnection(Lines), what = "", nlines = 1) f <- function(x) x[which.max(colSums(DF[x]!=0))] tapply(seq(DF), names(DF), f) #================added code================# # compute the number of non-zeros in each column MostZeros <- colSums(DF != 0) # determine which column is the maximum x.max <- lapply(unique(names(DF)), function(.name){ .col <- which(names(DF) == .name) # find columns of matching names .max <- which.max(MostZeros[.col]) # determine max .col[.max] # return the column number of the max }) DF[unlist(x.max)] # select only the unique maximums On 7/3/06, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:> > Try this: > > # test data > # read in header separately so R does not make column names unique > Lines <- "AAA BBB CCC DDD AAA BBB > 0 2 1 2 0 0 > 2 3 7 6 0 1 > 1.5 4 9 9 6 0 > 1.0 6 10 11 3 3 > " > DF <- read.table(textConnection(Lines), skip = 1) > names(DF) <- scan(textConnection(Lines), what = "", nlines = 1) > > f <- function(x) x[which.max(colSums(DF[x]!=0))] > tapply(seq(DF), names(DF), f) > > On 7/3/06, markleeds@verizon.net <markleeds@verizon.net> wrote: > > > > hi everyone : > > > > suppose i have a matrix in which some column names are identical so, > > for example, TEMP > > > > "AAA", "BBB", "CCC", "DDD","AAA", "BBB" > > 0 2 1 2 0 0 > > 2 3 7 6 0 1 > > 1.5 4 9 9 6 0 > > 1.0 6 10 11 3 3 > > > > > > I didn''t even check yet whether identical column names are allowed > > in a matrix but i hope they are. > > > > assuming that they are, then i would like to be able to take the matrix > and make a new matrix with the following requirements. > > > > 1) whenever there is a unique column name, just take that column for the > new matrix > > > > 2) whenever the column name is not unique, take the one > > that has the most non zero elements ? ( in the case of > > ties, i don''t care which one is picked ). > > > > so, in this case, the resulting matrix would just be the first 4 > columns. > > > > i realize ( or atleast i think ) that > > sum( TEMP[(TEMP[,columnname] !=0) ,columnname) will give me the > > number of non elements in a column with the name columnmame > > but how to use this deal with the non uniqueness to solve my particular > problem is beyond me. plus, i think the command will > > bomb because columnname will not always be unique ? > > Thanks for any help. I realize this is not a trivial problem so I really > appreciate it. > > > > Mark > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >-- Jim Holtman Cincinnati, OH +1 513 646 9390 (Cell) +1 513 247 0281 (Home) What is the problem you are trying to solve? [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

>>>>> "Gabor" == Gabor Grothendieck <ggrothendieck at gmail.com> >>>>> on Mon, 3 Jul 2006 16:58:14 -0400 writes:Gabor> Try this: Gabor> # test data Gabor> # read in header separately so R does not make column names unique Gabor> Lines <- "AAA BBB CCC DDD AAA BBB Gabor> 0 2 1 2 0 0 Gabor> 2 3 7 6 0 1 Gabor> 1.5 4 9 9 6 0 Gabor> 1.0 6 10 11 3 3 Gabor> " Gabor> DF <- read.table(textConnection(Lines), skip = 1) Gabor> names(DF) <- scan(textConnection(Lines), what = "", nlines = 1) Hmm, this is unnecessarily slightly complicated. Instead, rather make use of read.table()''s capabilities, by DF <- read.table(textConnection(Lines), check.names=FALSE, header=TRUE) ## ^^^^^^^^^^^^^^^^^ Martin Maechler, ETH Zurich