Asis Hallab
2012-Dec-17 19:22 UTC
[R] Why does matrix selection behave differently when using which?
Dear R community, I have a medium sized matrix stored in variable "t" and a simple function " countRows" (see below) to count the number of rows in which a selected column "C" matches a given value. If I count all rows matching all pairwise distinct values in the column "C" and sum these counts up, I get the number or rows of "t". If I delete the "which" calls from function "countRows" the resulting sum of matching row numbers is much greater than the number of rows in "t". The table "t" I use can be downloaded from here: https://github.com/groupschoof/PhyloFun/archive/test_selector.zip Unzip the file and read in the table "t" using t <- read.table("test.tbl") The above function "sumRows" is defined as follows: sumRows <- function( tbl, ps ) { sum( sapply(ps, function(x) { t <- if ( is.na(x) ) { tbl[ which( is.na(tbl[ , "Domain.Architecture.Distance" ]) ), , drop=F] } else { tbl[ which( tbl[ , "Domain.Architecture.Distance" ] == x ), , drop=F] } nrow(t) } ) ) } What does cause the different behavior of sumRows, when the which calls are deleted? What does which do, I seem not to grasp? Or is there an error in my test.tbl? * * Any help on this subject will be greatly appreciated. Kind regards and *merry christmas*! [[alternative HTML version deleted]]
Berend Hasselman
2012-Dec-17 19:39 UTC
[R] Why does matrix selection behave differently when using which?
On 17-12-2012, at 20:22, Asis Hallab wrote:> Dear R community, > > I have a medium sized matrix stored in variable "t" and a simple function " > countRows" (see below) to count the number of rows in which a selected > column "C" matches a given value. If I count all rows matching all pairwise > distinct values in the column "C" and sum these counts up, I get the number > or rows of "t". If I delete the "which" calls from function "countRows" the > resulting sum of matching row numbers is much greater than the number of > rows in "t". > > The table "t" I use can be downloaded from here: > https://github.com/groupschoof/PhyloFun/archive/test_selector.zip > Unzip the file and read in the table "t" using t <- read.table("test.tbl") > > The above function "sumRows" is defined as follows: > sumRows <- function( tbl, ps ) { > sum( > sapply(ps, > function(x) { > t <- if ( is.na(x) ) { > tbl[ which( is.na(tbl[ , "Domain.Architecture.Distance" ]) ), , > drop=F] > } else { > tbl[ which( tbl[ , "Domain.Architecture.Distance" ] == x ), , > drop=F] > } > nrow(t) > } > ) > ) > } >And how are we supposed to call sumRows()? sumRows(???, ??? Berend
David Winsemius
2012-Dec-17 20:00 UTC
[R] Why does matrix selection behave differently when using which?
On Dec 17, 2012, at 11:22 AM, Asis Hallab wrote:> Dear R community, > > I have a medium sized matrix stored in variable "t" and a simple function " > countRows" (see below) to count the number of rows in which a selected > column "C" matches a given value. If I count all rows matching all pairwise > distinct values in the column "C" and sum these counts up, I get the number > or rows of "t". If I delete the "which" calls from function "countRows" the > resulting sum of matching row numbers is much greater than the number of > rows in "t". > > The table "t" I use can be downloaded from here: > https://github.com/groupschoof/PhyloFun/archive/test_selector.zipWhat part of "minimal" example are you having difficulty understanding? That zip file expands to a 1.8 MB file!> Unzip the file and read in the table "t" using t <- read.table("test.tbl")Since it has a header line, you will be creating all factors and it's doubtful you are getting what you want. Instead: t <- read.table("test.tbl", header=TRUE)> > The above function "sumRows" is defined as follows: > sumRows <- function( tbl, ps ) { > sum( > sapply(ps,'ps'? What is ps????> function(x) { > t <- if ( is.na(x) ) {I suspect that it is not `which` that is the problem, but rahter your understanding of how `if` processes vectors. (This also should be simplified greatly to avoid stepping through vectors one element at a time.)> tbl[ which( is.na(tbl[ , "Domain.Architecture.Distance" ]) ), , > drop=F]You didn't do anything with that result!> } else { > tbl[ which( tbl[ , "Domain.Architecture.Distance" ] == x ), , > drop=F] > } > nrow(t)That value will not depend in any manner on what preceded it. ???? It will simply be the number of rows in the local copy of "t" You goal is _only_ to get a count? Why not just this: sum( tbl[!is.na(tbl$Domain.Architecture.Distance), "Domain.Architecture.Distance" ] == x ) E.g.:> sum( tbl[!is.na(tbl$Domain.Architecture.Distance), "Domain.Architecture.Distance" ] == 0.99)[1] 3440 You should probably be creating a factor variable with `cut` to create reasonable intervals for grouping, and if you do not know this it suggests you need to do more stufy of the text or introductory materials.To get a quick look at the distribution this is useful" plot( density(tbl[!is.na(tbl$Domain.Architecture.Distance), "Domain.Architecture.Distance" ] )) (125 KB file so not attached)> table( cut(tbl$Domain.Architecture.Distance, breaks=(0:10)/10) )(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1] 616 1864 328 103 923 1763 1151 2490 3709 38563> } > ) > ) > } > > What does cause the different behavior of sumRows, when the which calls are > deleted? > What does which do, I seem not to grasp?The question ... as yet unanswered .... is _how_ exactly are you calling that function. You posted a link to data "t" but there is no code that calls that function with the data. I do not see anything that would resemble a "ps"-object.> Or is there an error in my test.tbl?(See above.)> * * > Any help on this subject will be greatly appreciated. > Kind regards and *merry christmas*! > > [[alternative HTML version deleted]]Please read the Posting Guide and learn to post in plain text. -- David Winsemius Alameda, CA, USA