thr3ads.net - R help - [R] Help with matching rows [Apr 2011]

If this information is useful, please help other people find it:
Share via:

gary engstrom

2011-Apr-21 02:09 UTC

[R] Help with matching rows

Dear Sir,

Please excuse my akwardness as I a new to R and computers, but would kindly
appreciate help.
{
a <- sample (1:10,100,replace=T )
b <-sample(10:20,100,replace=T)
c <- sample(20:30,100,replace=T)
d <- sample(30:40,100,replace=T)
e <- sample(40:50,100,replace=T)
}
d1 <- a
d2 <- b
d3 <-c
d4 <- d
d5 <- e

data.frame(d1,d2,d3,d4,d5)
dd <- data.frame(d1,d2,d3,d4,d5)
dd
sd(d1)
summary(d1)
sd(d2)
summary(d2)
sd(d3)
summary(d3)
sd(d4)
summary(d4)
sd(d5)
summary(d5)
I am a beginner to R and am trying to learn statistical
probability. I have started Dr. Levine and Dr Kerns books.
So far from the usual sources, I haven't found the answers
to the following questions and would greatly appreciate
any assistance that anyone might kindly share.
If I run this code, how do I look for duplicate rows and how can
 I adjust the SD of the sample function to make the chances
of a duplicate row occur more often ?
How do I export the dd data frame to excel?
Deepest Gradtitude
Gary

	[[alternative HTML version deleted]]

Petr Savicky

2011-Apr-21 07:34 UTC

head link

[R] Help with matching rows

On Wed, Apr 20, 2011 at 10:09:26PM -0400, gary engstrom
wrote:> Dear Sir,
> 
> Please excuse my akwardness as I a new to R and computers, but would kindly
> appreciate help.
> {
> a <- sample (1:10,100,replace=T )
> b <-sample(10:20,100,replace=T)
> c <- sample(20:30,100,replace=T)
> d <- sample(30:40,100,replace=T)
> e <- sample(40:50,100,replace=T)
> }
> d1 <- a
> d2 <- b
> d3 <-c
> d4 <- d
> d5 <- e
> 
> data.frame(d1,d2,d3,d4,d5)
> dd <- data.frame(d1,d2,d3,d4,d5)
> dd
> sd(d1)
> summary(d1)
> sd(d2)
> summary(d2)
> sd(d3)
> summary(d3)
> sd(d4)
> summary(d4)
> sd(d5)
> summary(d5)
> I am a beginner to R and am trying to learn statistical
> probability. I have started Dr. Levine and Dr Kerns books.
> So far from the usual sources, I haven't found the answers
> to the following questions and would greatly appreciate
> any assistance that anyone might kindly share.
> If I run this code, how do I look for duplicate rows and how can
See ?duplicated .
>  I adjust the SD of the sample function to make the chances
> of a duplicate row occur more often ?
A simple way, how to increase the number of duplicated rows,
is to reduce the space, from which the rows are drawn.

The following estimates the probability to have at least one
duplicated row using your original code.

  m <- 10000
  count <- 0
  for (i in 1:m) {
      d1 <- sample(1:10,100,replace=T)
      d2 <- sample(10:20,100,replace=T)
      d3 <- sample(20:30,100,replace=T)
      d4 <- sample(30:40,100,replace=T)
      d5 <- sample(40:50,100,replace=T)
      dd <- data.frame(d1,d2,d3,d4,d5)
      if (any(duplicated(dd))) {
          count <- count + 1
      }
  }
  count/m

I obtained

  [1] 0.035

This probability may also be computed exactly as follows.
The number of all possible rows, from which we sample, is the
product of the sizes of the sets, from which each component
is chosen. This is 10*11^4. Using this, the probability to
have at least one duplicated row among 100 rows chosen from
the uniform distribution is

  N <- 10*11^4 # the number of all possible rows
  1 - prod(1 - (0:99)/N)
  [1] 0.03325143

If the sample space is reduced to 8^5 using

    d1 <- sample(1:8,100,replace=T)
    d2 <- sample(11:18,100,replace=T)
    d3 <- sample(21:28,100,replace=T)
    d4 <- sample(31:38,100,replace=T)
    d5 <- sample(41:48,100,replace=T)

then the probability to have at least one duplicated row 
increases to

  N <- 8^5
  1 - prod(1 - (0:99)/N)
  [1] 0.1403373

Hope this helps.

Petr Savicky.

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Apr 2011 - Help with matching rows

[R] Help with matching rows

[R] Help with matching rows

Seemingly Similar Threads