Hello, This is not exactly an R question, but I suspect that there is an R procedure that does what I am calling (for lack of a better name) "histogram ranking". I'm trying to evaluate a set of regression features by segregating by target class and comparing the feature histograms. My idea is that if the histograms are the same for two different classes then there is no predictive power in those features. Conversely, if the histograms are different then there is probably some predictive "juice" that we can squeeze out of the features with regression. The histograms are computing by partitioning the features into equally spaced bins over their spans and counting the sample values in each bin that corresponds to that partition of feature space. This is done for each target class, so the resulting histograms are the features distributions conditioned by target class. Since the histograms are numeric vectors, we can measure the "goodness" of a feature set by evaluating the "distance" between histograms. The bigger the better etc. Now I'm no statistics expert. Have I re-invented some "wheel" here? What is the canonical name for this kind of analysis? Is this kind of analysis routinely done in R? [Is there a "better" way to do all this?] Thanks, John Day I -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear all, I received no response to this, which indicates that perhaps I didn't express myself clearly. Let me be a little more formal and try again: Let C be a classification problem where we are trying to classify a set of N features F=(f1,f2,...fN) into a small set of M classes T=(t1,t2,...,tM). We want to learn the classification rules from the distributions of the values of F. Suppose we have a large number of labelled samples SF=(sf1,sf2,...sfN,t), where the sf's are feature values and t is the "ground truth" for that sample. Then we can construct a sample histograms by class: hk = (h1k, h2k, ... hNk) for t=1..M where hjk is count of feature j for class k If the samples are random and unbiased then they constitute an approximation of the conditional probabliity distributions for each class. My idea is that these histograms can be viewed as points in a metric space and ranked by computing their euclidean distance from each other, after normalizing them so that each histogram has a probability of 1. For example: let h1 and h2 be two histograms of feature counts for classes 1 and 2 for a feature set with 6 features: > h1<-c(1,3,5,6,7,8) > h2<-c(4,6,4,6,7,9) Normalize and compute the metric: > n1=h1/sum(h1) > n2=h2/sum(h2) > sqrt(sum((n1-n2)^2)) By adjusting the bin sizes we can probe how well the class concepts can be generalized. That is, if the metric is non-zero only when the binsize is 1, then the problem cannot be generalized. The size of the metric is roughly the "predictive juice" contained in the conditional distributions. Yes, it's extremely simplistic, but I have found that this metric is useful for measuring how effectively features support a classification scheme. Is this a well-known procedure? Is it sound assuming good samples? Are there procedures in R to support this kind of analysis? Or is there a much better way to rank the "goodness" of a feature set? Thanks, John Day At 02:30 PM 9/6/2002 -0400, I wrote:>Hello, > >This is not exactly an R question, but I suspect that there is an R >procedure that does what I am calling (for lack of a better name) >"histogram ranking". > >I'm trying to evaluate a set of regression features by segregating by >target class and comparing the feature histograms. My idea is that if the >histograms are the same for two different classes then there is no >predictive power in those features. Conversely, if the histograms are >different then there is probably some predictive "juice" that we can >squeeze out of the features with regression. > >The histograms are computing by partitioning the features into equally >spaced bins over their spans and counting the sample values in each bin >that corresponds to that partition of feature space. This is done for each >target class, so the resulting histograms are the features distributions >conditioned by target class. > >Since the histograms are numeric vectors, we can measure the "goodness" of >a feature set by evaluating the "distance" between histograms. The bigger >the better etc. > >Now I'm no statistics expert. Have I re-invented some "wheel" here? What >is the canonical name for this kind of analysis? Is this kind of analysis >routinely done in R? [Is there a "better" way to do all this?] > >Thanks, >John Day-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear all, I received no response to this. Not sure if it was a glitch in the mailing list or maybe I didn't express myself clearly. Let me be a little more formal and try again: I want to know if the algorithm described below is a well-known procedure? Is it sound assuming unbiased samples? Are there procedures in R to support this kind of analysis? Or is there a much better way to rank the "goodness" of a feature set, which is really what I'm trying to do here? Here's the "historanking" algorithm: Let C be a classification problem where we are trying to classify a set of N features F=(f1,f2,...fN) into a small set of M classes T=(t1,t2,...,tM). We want to learn the classification rules from the distributions of the values of F. Suppose we have a large number of labelled samples SF=(sf1,sf2,...sfN,tk), where the sf's are feature values and tk is the kth "ground truth" (out of 1..M) for that sample. Then we can construct a sample histograms by class: hk = (h1k, h2k, ... hNk) for t=1..M where hjk is count of feature j for class k If the samples are random and unbiased then they constitute an approximation of the conditional probability distributions for each class. My idea is that these histograms can be viewed as points in a metric space and ranked by computing their euclidean distance from each other, after normalizing them so that each histogram has a probability of 1. For example: let h1 and h2 be two histograms of feature counts for classes 1 and 2 for a feature set with 6 features: > h1<-c(1,3,5,6,7,8) > h2<-c(4,6,4,6,7,9) Normalize and compute the metric: > n1=h1/sum(h1) > n2=h2/sum(h2) > sqrt(sum((n1-n2)^2)) By adjusting the bin sizes we can probe how well the class concepts can be generalized. That is, if the metric is non-zero only when the binsize is 1, then the problem cannot be generalized. The size of the metric is roughly the "predictive juice" contained in the conditional distributions. Yes, "historanking" is somewhat simplistic and perhaps statistically impaired, but I have found that this metric is useful for measuring how effectively features support a classification scheme. Thanks, John Day At 02:30 PM 9/6/2002 -0400, I wrote:>Hello, > >This is not exactly an R question, but I suspect that there is an R >procedure that does what I am calling (for lack of a better name) >"histogram ranking". > >I'm trying to evaluate a set of regression features by segregating by >target class and comparing the feature histograms. My idea is that if the >histograms are the same for two different classes then there is no >predictive power in those features. Conversely, if the histograms are >different then there is probably some predictive "juice" that we can >squeeze out of the features with regression. > >The histograms are computing by partitioning the features into equally >spaced bins over their spans and counting the sample values in each bin >that corresponds to that partition of feature space. This is done for each >target class, so the resulting histograms are the features distributions >conditioned by target class. > >Since the histograms are numeric vectors, we can measure the "goodness" of >a feature set by evaluating the "distance" between histograms. The bigger >the better etc. > >Now I'm no statistics expert. Have I re-invented some "wheel" here? What >is the canonical name for this kind of analysis? Is this kind of analysis >routinely done in R? [Is there a "better" way to do all this?] > >Thanks, >John Day-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
[Note: I posted this Tuesday and re-sent it Wednesday and today, but it didn't show up in the list archive, so apparently got dropped on the floor somehow. I'll try again] Dear All, I received no response to this. Not sure if it was a glitch in the mailing list or maybe I didn't express myself clearly. Let me be a little more formal and try again: I want to know if the algorithm described below is a well-known procedure? Is it sound assuming unbiased samples? Are there procedures in R to support this kind of analysis? Or is there a much better way to rank the "goodness" of a feature set, which is really what I'm trying to do here? Here's a description of the problem I'm trying to solve and the "historanking" algorithm: Let C=(F,T) be a classification problem where we are trying to classify a set of N features F=(f1,f2,...fN) into a small set of M classes T=(t1,t2,...,tM). We want to learn the classification rules from the distributions of the values of F. Suppose we have a large number of labelled samples SF=(sf1,sf2,...sfN,tk), where the sf's are feature values and tk is the kth "ground truth" (out of 1..M) for that sample. Then we can construct a sample histograms by class: hk = (h1k, h2k, ... hNk) for t=1..M where hjk is count of feature j for class k If the samples are random and unbiased then they constitute an approximation of the conditional probability distributions for each class. My idea is that these histograms can be viewed as points in a metric space and ranked by computing their euclidean distance from each other, after normalizing them so that each histogram has a probability of 1. For example: let h1 and h2 be two histograms of feature counts for classes 1 and 2 for a feature set with 6 features: > h1<-c(1,3,5,6,7,8) > h2<-c(4,6,4,6,7,9) Normalize and compute the metric: > n1=h1/sum(h1) > n2=h2/sum(h2) > sqrt(sum((n1-n2)^2)) By adjusting the bin sizes we can probe how well the class concepts can be generalized. That is, if the metric is non-zero only when the binsize is 1, then the problem cannot be generalized. The size of the metric is roughly the "predictive juice" contained in the conditional distributions. Yes, "historanking" is somewhat simplistic and perhaps statistically impaired, but I have found that this metric is useful for measuring how effectively features support a classification scheme. Thanks, John Day At 02:30 PM 9/6/2002 -0400, I wrote:>Hello, > >This is not exactly an R question, but I suspect that there is an R >procedure that does what I am calling (for lack of a better name) >"histogram ranking". > >I'm trying to evaluate a set of regression features by segregating by >target class and comparing the feature histograms. My idea is that if the >histograms are the same for two different classes then there is no >predictive power in those features. Conversely, if the histograms are >different then there is probably some predictive "juice" that we can >squeeze out of the features with regression. > >The histograms are computing by partitioning the features into equally >spaced bins over their spans and counting the sample values in each bin >that corresponds to that partition of feature space. This is done for each >target class, so the resulting histograms are the features distributions >conditioned by target class. > >Since the histograms are numeric vectors, we can measure the "goodness" of >a feature set by evaluating the "distance" between histograms. The bigger >the better etc. > >Now I'm no statistics expert. Have I re-invented some "wheel" here? What >is the canonical name for this kind of analysis? Is this kind of analysis >routinely done in R? [Is there a "better" way to do all this?] > >Thanks, >John Day-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._