Hi I have a data.frame with 371,718 obs. of 12 variables (see below for an str). My problem is with V1, a Factor w/ 93144 levels, there should actually be 93994 levels. Each entry looks like: comp[number]_c[number]_seq[number] for example comp215489_c0_seq40 R is grouping as though the last number is a decimal for some reason, in other words comp215489_c0_seq40 and comp215489_c0_seq4 are considered to be the same. My problem is that they are not the same so when I group by this factor I am losing 800 levels. Here is an str 'data.frame': 371718 obs. of 12 variables: $ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 30 1564 1564 1623 91700 91701 91848 ... $ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: 10119 10779 13210 13210 11522 8115 13079 14493 14493 15858 ... $ V3 : num 95.5 90.2 98.7 99.2 81.4 ... $ V4 : int 335 153 237 122 258 127 306 258 120 177 ... $ V5 : int 15 15 3 1 38 19 20 23 5 9 ... $ V6 : int 0 0 0 0 4 2 0 0 0 0 ... $ V7 : int 1 45 1 43 1 129 1 54 1 70 ... $ V8 : int 335 197 237 164 254 254 306 311 120 246 ... $ V9 : int 6866 18 3172 3438 67 122 3927 42 346 195 ... $ V10: int 7200 170 3408 3559 318 247 4232 299 465 19 ... $ V11: num 7e-155 2e-46 4e-125 2e-61 3e-24 ... $ V12: num 545 184 446 234 111 69.9 448 329 198 280 .. -- View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006.html Sent from the R help mailing list archive at Nabble.com.
Le mardi 04 d?cembre 2012 ? 00:34 -0800, Jeremy.Shearman a ?crit :> Hi > I have a data.frame with 371,718 obs. of 12 variables (see below for > an str). My problem is with V1, a Factor w/ 93144 levels, there should > actually be 93994 levels. Each entry looks like: > comp[number]_c[number]_seq[number] > for example > comp215489_c0_seq40 > R is grouping as though the last number is a decimal for some reason, in > other words comp215489_c0_seq40 and comp215489_c0_seq4 are considered to be > the same. My problem is that they are not the same so when I group by this > factor I am losing 800 levels.What format is your original data using? How do you import it? Please provide us with an excerpt of your original file showing at least two different values of V1 that are considered the same once imported in R (which sounds very unlikely to me...). Regards> Here is an str > > 'data.frame': 371718 obs. of 12 variables: > $ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 30 > 1564 1564 1623 91700 91701 91848 ... > $ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: 10119 > 10779 13210 13210 11522 8115 13079 14493 14493 15858 ... > $ V3 : num 95.5 90.2 98.7 99.2 81.4 ... > $ V4 : int 335 153 237 122 258 127 306 258 120 177 ... > $ V5 : int 15 15 3 1 38 19 20 23 5 9 ... > $ V6 : int 0 0 0 0 4 2 0 0 0 0 ... > $ V7 : int 1 45 1 43 1 129 1 54 1 70 ... > $ V8 : int 335 197 237 164 254 254 306 311 120 246 ... > $ V9 : int 6866 18 3172 3438 67 122 3927 42 346 195 ... > $ V10: int 7200 170 3408 3559 318 247 4232 299 465 19 ... > $ V11: num 7e-155 2e-46 4e-125 2e-61 3e-24 ... > $ V12: num 545 184 446 234 111 69.9 448 329 198 280 .. > > > > -- > View this message in context: http://r.789695.n4.nabble.com/problem-with-factor-levels-tp4652006.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Jeremy.Shearman > Sent: Tuesday, December 04, 2012 9:35 AM > To: r-help at r-project.org > Subject: [R] problem with factor levels > > Hi > I have a data.frame with 371,718 obs. of 12 variables (see below > for an str). My problem is with V1, a Factor w/ 93144 levels, there > should actually be 93994 levels. Each entry looks like: > comp[number]_c[number]_seq[number] > for example > comp215489_c0_seq40 > R is grouping as though the last number is a decimal for some reason, > in other words comp215489_c0_seq40 and comp215489_c0_seq4 are > considered to be the same. My problem is that they are not the same so > when I group by this factor I am losing 800 levels. >Hm. How did you constructed those factors?> factor(c("comp215489_c0_seq40", "comp215489_c0_seq4") )[1] comp215489_c0_seq40 comp215489_c0_seq4 Levels: comp215489_c0_seq4 comp215489_c0_seq40 gives me 2 levels as expected. I also doubt that R will do such stripping during reading from other file. Regards Petr> Here is an str > > 'data.frame': 371718 obs. of 12 variables: > $ V1 : Factor w/ 93144 levels "comp100000_c0_seq1",..: 92271 91685 29 > 30 > 1564 1564 1623 91700 91701 91848 ... > $ V2 : Factor w/ 17162 levels "gi|345842331|ref|NM_001244016.1|",..: > 10119 > 10779 13210 13210 11522 8115 13079 14493 14493 15858 ... > $ V3 : num 95.5 90.2 98.7 99.2 81.4 ... > $ V4 : int 335 153 237 122 258 127 306 258 120 177 ... > $ V5 : int 15 15 3 1 38 19 20 23 5 9 ... > $ V6 : int 0 0 0 0 4 2 0 0 0 0 ... > $ V7 : int 1 45 1 43 1 129 1 54 1 70 ... > $ V8 : int 335 197 237 164 254 254 306 311 120 246 ... > $ V9 : int 6866 18 3172 3438 67 122 3927 42 346 195 ... > $ V10: int 7200 170 3408 3559 318 247 4232 299 465 19 ... > $ V11: num 7e-155 2e-46 4e-125 2e-61 3e-24 ... > $ V12: num 545 184 446 234 111 69.9 448 329 198 280 .. > > > > -- > View this message in context: http://r.789695.n4.nabble.com/problem- > with-factor-levels-tp4652006.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.