greatest.possible.newbie
2012-May-11 12:28 UTC
[R] identify() doesn't return "true" numbers
Dear R community.
I am using the identify() function to identify outliers in my dataset.
This is the code I am using:
####################################################################
# Function to allow identifying points in the QQ plot (by mouseclicking)
qqInteractive <- function(..., IDENTIFY = TRUE)
{
qqplot(...) -> X
abline(a=0,b=1)
if(IDENTIFY) return(identify(X))
invisisble(X)
}
qqplot.mv.interactive <- function (data, xlim=NULL, ylim=NULL)
{
x <- as.matrix(data) # n x p numeric matrix
center <- colMeans(x) # centroid
n <- nrow(x); p <- ncol(x); cov <- cov(x);
d <- mahalanobis(x,center,cov) # distances
qqInteractive(qchisq(ppoints(n),df=p),d, # ppoints(n) makes
a sequence from 0 to 1. with stepsize 1/n
main="QQ Plot Assessing Multivariate Normality", # qchisq()
makes a
chi squared distribution function for the given probabilities in ppoints(n)
and degress of freedom df
ylab="Mahalanobis D2", xlim=xlim, ylim=ylim)
#abline(a=0,b=1)
}
y <- c((1:100)+rnorm(100, sd=100))
x <- c(1:100)
windows();qqInteractive(x,y)
####################################################################
When i click the points in the graph identify() only returns the number of
the points in the order they are lying on the X-axis. Let's say I mark the
point in the upper right corner, identify() will return 100. But what I want
is the number in the original dataset y. Lets say the point was at y[87].
Otherwise I wont be able to remove this point from my original dataset.
I hope you understand my problem. I apreciate every help.
Regards, Daniel Hoop
--
View this message in context:
http://r.789695.n4.nabble.com/identify-doesn-t-return-true-numbers-tp4626273.html
Sent from the R help mailing list archive at Nabble.com.
Daniel,
There are a few ways to deal with this.
You could sort your data by y before you apply these functions.
Then the point labelled 100 will be the 100th row in the data frame.
df <- data.frame(x=1:100, y=(1:100)+rnorm(100, sd=100)
df2 <- df[order(df$y), ]
windows()
qqInteractive(df$x, df$y)
You could modify the code to keep track of the original row numbers.
(No example given.)
You could use your code pretty much as you have it, then convert the
numbers you see on the plot back to the original row in the data frame.
This would be made a bit easier if you let your function keep the
identified points
qqInteractive.v2 <- function(..., IDENTIFY = TRUE) {
X <- qqplot(...)
abline(a=0, b=1)
if(IDENTIFY) identify(X)
}
y <- 1:100 + rnorm(100, sd=100)
x <- 1:100
id.pts <- qqInteractive.v2(x, y)
seq(y)[is.element(rank(y), id.pts)]
Jean
"greatest.possible.newbie" <daniel.hoop@gmx.net> wrote on
05/11/2012
07:28:43 AM:
> Dear R community.
>
> I am using the identify() function to identify outliers in my dataset.
> This is the code I am using:
>
> ####################################################################
> # Function to allow identifying points in the QQ plot (by
mouseclicking)> qqInteractive <- function(..., IDENTIFY = TRUE)
> {
> qqplot(...) -> X
> abline(a=0,b=1)
> if(IDENTIFY) return(identify(X))
> invisisble(X)
> }
>
> qqplot.mv.interactive <- function (data, xlim=NULL, ylim=NULL)
> {
> x <- as.matrix(data) # n x p numeric matrix
> center <- colMeans(x) # centroid
> n <- nrow(x); p <- ncol(x); cov <- cov(x);
> d <- mahalanobis(x,center,cov) # distances
> qqInteractive(qchisq(ppoints(n),df=p),d, # ppoints(n)
makes> a sequence from 0 to 1. with stepsize 1/n
> main="QQ Plot Assessing Multivariate Normality", #
qchisq()
makes a> chi squared distribution function for the given probabilities in
ppoints(n)> and degress of freedom df
> ylab="Mahalanobis D2", xlim=xlim, ylim=ylim)
> #abline(a=0,b=1)
> }
>
> y <- c((1:100)+rnorm(100, sd=100))
> x <- c(1:100)
> windows();qqInteractive(x,y)
> ####################################################################
>
> When i click the points in the graph identify() only returns the number
of> the points in the order they are lying on the X-axis. Let's say I mark
the> point in the upper right corner, identify() will return 100. But what I
want> is the number in the original dataset y. Lets say the point was at
y[87].> Otherwise I wont be able to remove this point from my original dataset.
>
> I hope you understand my problem. I apreciate every help.
> Regards, Daniel Hoop
[[alternative HTML version deleted]]