Assume that I have the dataframe "data1", which is listed at the end of this message. I want count the number of lines that each person has for each year. For example, the person with ID=213 has 15 entries (NinYear) for 1953. The following bit of code calculates NinYear: for (i in 1:length(data1$ID)) { data1$NinYear[i] <- length(data1[data1$Year==data1$Year[i] & data1$ID==data1$ID[i],1]) } This seems to work but is horribly slow (some files I am working with have over 500,000 lines). Can anyone suggest a faster way of doing this, perhaps a way that does not use a for loop? Thanks. Tom ID Year NinYear 209 1971 0 209 1971 0 213 1951 0 213 1951 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1953 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1954 0 213 1955 0 213 1955 0 234 1953 0 234 1953 0 234 1953 0 234 1953 0 234 1953 0 234 1958 0 234 1958 0 234 1965 0 234 1965 0 234 1965 0 249 1952 0 249 1952 0 -- View this message in context: http://www.nabble.com/Doing-a-Task-Without-Using-a-For-Loop-tp19974078p19974078.html Sent from the R help mailing list archive at Nabble.com.
Try this: with(data1, table(ID, Year)) On Tue, Oct 14, 2008 at 10:58 AM, Tom La Bone <booboo@gforcecable.com>wrote:> > Assume that I have the dataframe "data1", which is listed at the end of > this > message. I want count the number of lines that each person has for each > year. For example, the person with ID=213 has 15 entries (NinYear) for > 1953. > The following bit of code calculates NinYear: > > for (i in 1:length(data1$ID)) { > data1$NinYear[i] <- length(data1[data1$Year==data1$Year[i] & > data1$ID==data1$ID[i],1]) } > > This seems to work but is horribly slow (some files I am working with have > over 500,000 lines). Can anyone suggest a faster way of doing this, perhaps > a way that does not use a for loop? Thanks. > > Tom > > ID Year NinYear > 209 1971 0 > 209 1971 0 > 213 1951 0 > 213 1951 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1955 0 > 213 1955 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1958 0 > 234 1958 0 > 234 1965 0 > 234 1965 0 > 234 1965 0 > 249 1952 0 > 249 1952 0 > > > > -- > View this message in context: > http://www.nabble.com/Doing-a-Task-Without-Using-a-For-Loop-tp19974078p19974078.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
try the following: out <- tapply(data1$ID, list(data1$ID, data1$Year), length) out[is.na(out)] <- 0 out I hope it helps. Best, Dimitris Tom La Bone wrote:> Assume that I have the dataframe "data1", which is listed at the end of this > message. I want count the number of lines that each person has for each > year. For example, the person with ID=213 has 15 entries (NinYear) for 1953. > The following bit of code calculates NinYear: > > for (i in 1:length(data1$ID)) { > data1$NinYear[i] <- length(data1[data1$Year==data1$Year[i] & > data1$ID==data1$ID[i],1]) } > > This seems to work but is horribly slow (some files I am working with have > over 500,000 lines). Can anyone suggest a faster way of doing this, perhaps > a way that does not use a for loop? Thanks. > > Tom > > ID Year NinYear > 209 1971 0 > 209 1971 0 > 213 1951 0 > 213 1951 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1955 0 > 213 1955 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1958 0 > 234 1958 0 > 234 1965 0 > 234 1965 0 > 234 1965 0 > 249 1952 0 > 249 1952 0 > > >-- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014
table(data1$ID, data1$Year) See ?table and other functions referenced in ?table. Tom La Bone wrote:> Assume that I have the dataframe "data1", which is listed at the end of this > message. I want count the number of lines that each person has for each > year. For example, the person with ID=213 has 15 entries (NinYear) for 1953. > The following bit of code calculates NinYear: > > for (i in 1:length(data1$ID)) { > data1$NinYear[i] <- length(data1[data1$Year==data1$Year[i] & > data1$ID==data1$ID[i],1]) } > > This seems to work but is horribly slow (some files I am working with have > over 500,000 lines). Can anyone suggest a faster way of doing this, perhaps > a way that does not use a for loop? Thanks. > > Tom > > ID Year NinYear > 209 1971 0 > 209 1971 0 > 213 1951 0 > 213 1951 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1955 0 > 213 1955 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1958 0 > 234 1958 0 > 234 1965 0 > 234 1965 0 > 234 1965 0 > 249 1952 0 > 249 1952 0 > > >
> This seems to work but is horribly slow (some files I am working with have > over 500,000 lines). Can anyone suggest a faster way of doing this, perhaps > a way that does not use a for loop? Thanks.If the table solutions don't work or take forever with your real data, have a look into the wiki: http://wiki.r-project.org/rwiki/doku.php?id=tips:data-frames:count_and_extract_unique_rows Claudia -- Claudia Beleites Dipartimento dei Materiali e delle Risorse Naturali Universit? degli Studi di Trieste Via Alfonso Valerio 6/a I-34127 Trieste phone: +39 (0 40) 5 58-34 47 email: cbeleites at units.it
I want to thank everyone for the help. I ended up having to use a loop to assign values from the table to NinYear. However, as I have played with the full datasets I have noticed that R is MUCH faster if I use vectors in the loop rather than columns of a dataframe. In the specific case of 43,000 lines of data, assigning values from the table to the 43,000 elements of a vector took 6 seconds whereas assigning values from the table to 43,000 elements of a dataframe took 21 minutes. Why is there such a huge difference? Tom Tom La Bone wrote:> > Assume that I have the dataframe "data1", which is listed at the end of > this message. I want count the number of lines that each person has for > each year. For example, the person with ID=213 has 15 entries (NinYear) > for 1953. The following bit of code calculates NinYear: > > for (i in 1:length(data1$ID)) { > data1$NinYear[i] <- length(data1[data1$Year==data1$Year[i] & > data1$ID==data1$ID[i],1]) } > > This seems to work but is horribly slow (some files I am working with have > over 500,000 lines). Can anyone suggest a faster way of doing this, > perhaps a way that does not use a for loop? Thanks. > > Tom > > ID Year NinYear > 209 1971 0 > 209 1971 0 > 213 1951 0 > 213 1951 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1953 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1954 0 > 213 1955 0 > 213 1955 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1953 0 > 234 1958 0 > 234 1958 0 > 234 1965 0 > 234 1965 0 > 234 1965 0 > 249 1952 0 > 249 1952 0 > > > >-- View this message in context: http://www.nabble.com/Doing-a-Task-Without-Using-a-For-Loop-tp19974078p19991682.html Sent from the R help mailing list archive at Nabble.com.
Run Rprof on your script that is updating the dataframe. A dataframe is a list and everytime you access something in the list it can be expensive. Rprof will probably show that a lot of time is spent in the function "[[" which is accessing portions of the dataframe. Vectors are much faster because they are typically sequentially in memory and can be accessed easily. Rprof is always helpful in answering the question of "why is something taking so long". It helps you to find where the potential bottlenecks are. On Wed, Oct 15, 2008 at 7:33 AM, Tom La Bone <booboo at gforcecable.com> wrote:> > I want to thank everyone for the help. I ended up having to use a loop to > assign values from the table to NinYear. However, as I have played with the > full datasets I have noticed that R is MUCH faster if I use vectors in the > loop rather than columns of a dataframe. In the specific case of 43,000 > lines of data, assigning values from the table to the 43,000 elements of a > vector took 6 seconds whereas assigning values from the table to 43,000 > elements of a dataframe took 21 minutes. Why is there such a huge > difference? > > Tom > > > > > Tom La Bone wrote: >> >> Assume that I have the dataframe "data1", which is listed at the end of >> this message. I want count the number of lines that each person has for >> each year. For example, the person with ID=213 has 15 entries (NinYear) >> for 1953. The following bit of code calculates NinYear: >> >> for (i in 1:length(data1$ID)) { >> data1$NinYear[i] <- length(data1[data1$Year==data1$Year[i] & >> data1$ID==data1$ID[i],1]) } >> >> This seems to work but is horribly slow (some files I am working with have >> over 500,000 lines). Can anyone suggest a faster way of doing this, >> perhaps a way that does not use a for loop? Thanks. >> >> Tom >> >> ID Year NinYear >> 209 1971 0 >> 209 1971 0 >> 213 1951 0 >> 213 1951 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1953 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1954 0 >> 213 1955 0 >> 213 1955 0 >> 234 1953 0 >> 234 1953 0 >> 234 1953 0 >> 234 1953 0 >> 234 1953 0 >> 234 1958 0 >> 234 1958 0 >> 234 1965 0 >> 234 1965 0 >> 234 1965 0 >> 249 1952 0 >> 249 1952 0 >> >> >> >> > > -- > View this message in context: http://www.nabble.com/Doing-a-Task-Without-Using-a-For-Loop-tp19974078p19991682.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?