thr3ads.net - R help - [R] loop for a large database [Feb 2012]

If this information is useful, please help other people find it:
Share via:

mari681

2012-Feb-26 12:13 UTC

[R] loop for a large database

Yes, I am a newbie.

I have a data.frame (MyTable) of  1445846  rows and  15  columns with
character data.
And a character vector (MyVector) of 473491 elements.

I want simply to get a data.frame with the count of how many times each
element of MyVector appears in MyTable.

I've tried a loop with : for (i in 1 : length (myvector))  sum (MyTable== i)

but it crashes my computer.

I've also tried something like   

x <- 1 : length (MyVector)
apply (MyTable , 1 , function(x) {sum (MyTable ==x)}

but doesn't work.
Any idea?

Thank you. AAAAAny suggestion is super welcome.

Marianna



--
View this message in context:
http://r.789695.n4.nabble.com/loop-for-a-large-database-tp4422052p4422052.html
Sent from the R help mailing list archive at Nabble.com.

David Winsemius

2012-Feb-26 16:54 UTC

head link

[R] loop for a large database

On Feb 26, 2012, at 7:13 AM, mari681 wrote:
> Yes, I am a newbie.
>
> I have a data.frame (MyTable) of  1445846  rows and  15  columns with
> character data.
> And a character vector (MyVector) of 473491 elements.
>
> I want simply to get a data.frame with the count of how many times  
> each
> element of MyVector appears in MyTable.
>
> I've tried a loop with : for (i in 1 : length (myvector))  sum  
> (MyTable== i)
In that instance "i" is a number and probably would not be matching  
something that was a character vector.
>
> but it crashes my computer.
>
> I've also tried something like
>
> x <- 1 : length (MyVector)
> apply (MyTable , 1 , function(x) {sum (MyTable ==x)}
>
> but doesn't work.
> Any idea?
>
> Thank you. AAAAAny suggestion is super welcome.
Since you never offered the requested information about your objects,  
this is guesswork. If MyVector is one of the 15 columns in MyTable  
then this will have good chance:

table(MyTable$MyVector)

If on the other hand they are separate and you want to ignore the  
elements not in MyVector, then assign the value of a table operation  
and then use match() to pick out the tabulated values

In the future, please al least offer the results of str(MyTable).

-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

chuck.01

2012-Feb-26 17:56 UTC

head link

[R] loop for a large database

Untested die to no data, but this should work with a loop

out=vector("list", length= length(MyVector))

for(i in 1 : length (MyVector))
          { 
 x <- data.frame (sum (MyTable ==MyVector[i])) 
 out[[i]] <- x
            }
sum(do.call(rbind, out))

--
View this message in context:
http://r.789695.n4.nabble.com/loop-for-a-large-database-tp4422052p4422584.html
Sent from the R help mailing list archive at Nabble.com.

Petr Savicky

2012-Feb-26 19:31 UTC

head link

[R] loop for a large database

On Sun, Feb 26, 2012 at 04:13:49AM -0800, mari681 wrote:> Yes, I am a newbie.
> 
> I have a data.frame (MyTable) of  1445846  rows and  15  columns with
> character data.
> And a character vector (MyVector) of 473491 elements.
> 
> I want simply to get a data.frame with the count of how many times each
> element of MyVector appears in MyTable.
> 
> I've tried a loop with : for (i in 1 : length (myvector))  sum
(MyTable== i)
> 
> but it crashes my computer.
Hi.

As David pointed out, you probably want to compute 

  sum (MyTable== myvector[i])

and not sum (MyTable== i).

Also, i would expect storing the results somewhere, for example

  numOccur <- rep(NA, times=length(myvector))
  for (i in 1:length(myvector)) numOccur[i] <- sum(MyTable == myvector[i])

What do you see on the crashing computer? I would expect it to run for
a long time, but not crashing.

Try to run your code on a smaller part of the data to test efficiency
of different approaches.

How many different strings are in your data? If there is a lot of
repeated strings, then it may be better to first compute the
frequency table of them and search the strings from "myvector"
in this table and sum the frequencies.

Does your data frame consist of character vectors or from factors?
This may be seen by testing class(MyTable[[1]]).

Petr Savicky.

mari681

2012-Feb-26 19:39 UTC

head link

[R] loop for a large database

SORRY!

The data in MyTable are tagsets of photos,  like this:

      V1         V2       V3      V4      V5       V6        V7   V8
230    green nailpolish   barrym       0       0        0         0    0
231       ny      green brooklyn cleanup   clean  gowanus volunteer  gcc
232    green       saul  lecture       0       0        0         0    0
233    green     colors    cores  market colores marakesh   mercado malu
234       ny      green brooklyn cleanup   clean  gowanus volunteer  gcc
235    green       saul  lecture       0       0        0         0    0
236 portrait        pet    white   green     cat    canon    square  eos

                         V9   V10  V11      V12 V13 V14 V15
230                       0     0    0        0   0   0   0
231 gowanuscanalconservancy     0    0        0   0   0   0
232                       0     0    0        0   0   0   0
233               malugreen maroc souk marrocos   0   0   0
234 gowanuscanalconservancy     0    0        0   0   0   0
235                       0     0    0        0   0   0   0
236                      is  eyes mark   taiwan  ii mk2  5d


while data of MyVector is a list of tags (none of the columns in particular)
whose frequency in MyTable has to be computed. Like this:

[1] "life"  "wood"  "pink"  "house"
"green" "fall"



Thanks!!

Marianna


--
View this message in context:
http://r.789695.n4.nabble.com/loop-for-a-large-database-tp4422052p4422776.html
Sent from the R help mailing list archive at Nabble.com.

Petr Savicky

2012-Feb-26 20:15 UTC

head link

[R] loop for a large database

On Sun, Feb 26, 2012 at 04:13:49AM -0800, mari681 wrote:> Yes, I am a newbie.
> 
> I have a data.frame (MyTable) of  1445846  rows and  15  columns with
> character data.
> And a character vector (MyVector) of 473491 elements.
> 
> I want simply to get a data.frame with the count of how many times each
> element of MyVector appears in MyTable.
> 
> I've tried a loop with : for (i in 1 : length (myvector))  sum
(MyTable== i)
> 
> but it crashes my computer.
Hi.

Try first the following.

  out <- unclass(table(factor(MyTable[[1]], levels=myvector)))

The output should be a table of frequencies of the components
of "myvector" in the first column of "MyTable".

If this works for the data of the size, which you have,
then there are different possible ways how to compute
the frequencies in all columns. For example, concatenate
all columns to a single vector and apply the above to
this concatenation as follows.

  x <- c(as.matrix(MyTable))
  out <- unclass(table(factor(x, levels=myvector))) 

Here, "out" is a vector of the same length as "myvector"
and out[i] is the frequency of myvector[i] in "MyTable".

Hope this helps.

Petr Savicky.

Petr PIKAL

2012-Feb-27 09:09 UTC

head link

[R] loop for a large database

Hi> 
> SORRY!
> 
> The data in MyTable are tagsets of photos,  like this:
> 
>       V1         V2       V3      V4      V5       V6        V7   V8
> 230    green nailpolish   barrym       0       0        0         0    0
> 231       ny      green brooklyn cleanup   clean  gowanus volunteer  gcc
> 232    green       saul  lecture       0       0        0         0    0
> 233    green     colors    cores  market colores marakesh   mercado malu
> 234       ny      green brooklyn cleanup   clean  gowanus volunteer  gcc
> 235    green       saul  lecture       0       0        0         0    0
> 236 portrait        pet    white   green     cat    canon    square  eos
> 
>                          V9   V10  V11      V12 V13 V14 V15
> 230                       0     0    0        0   0   0   0
> 231 gowanuscanalconservancy     0    0        0   0   0   0
> 232                       0     0    0        0   0   0   0
> 233               malugreen maroc souk marrocos   0   0   0
> 234 gowanuscanalconservancy     0    0        0   0   0   0
> 235                       0     0    0        0   0   0   0
> 236                      is  eyes mark   taiwan  ii mk2  5d
> 
> 
> while data of MyVector is a list of tags (none of the columns in 
particular)> whose frequency in MyTable has to be computed. Like this:
> 
> [1] "life"  "wood"  "pink"  "house"
"green" "fall"
What about changing your data frame to matrix and use table

set.seed(111)
x<-sample(letters, 200, replace=T)
y<-letters[3:6]
dim(x)<-c(20,10)
dd<-data.frame(x)
tt<-table(as.matrix(dd))
tt[names(tt) %in% y]

 
 c  d  e  f 
13  5  8  3 

Regards
Petr 
> 
> 
> 
> Thanks!!
> 
> Marianna
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/loop-for-a-
> large-database-tp4422052p4422776.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Feb 2012 - loop for a large database

[R] loop for a large database

[R] loop for a large database

[R] loop for a large database

[R] loop for a large database

[R] loop for a large database

[R] loop for a large database

[R] loop for a large database

Apparently Analagous Threads