Dear all I have a problem with splitting up a data frame called ReVerb: ?? str(ReVerb) `data.frame': 92713 obs. of 16 variables: $ CHILD : Factor w/ 7 levels "ABE","ADA","EVE",..: 1 1 1 1 1 1 1 1 1 1 ... $ AGE : Factor w/ 484 levels "1;06.00","1;06.16",..: 43 43 43 99 99 99 99 99 99 99 ... $ AGE_Q : num 2.0 2.0 2.0 2.4 2.4 ... $ INTERVALS: num 2 2 2 2.25 2.25 2.25 2.25 2.25 2.25 2.25 ... $ RND : int 34368 38311 14949 20586 72516 27186 88019 10767 114448 86146 ... $ SYNTAX : Factor w/ 17 levels "Acmp","Amats",..: 15 12 8 15 7 16 7 7 16 7 ... $ LEXICAL : Factor w/ 1643 levels "$ACHE","$ACT",..: 194 803 803 294 299 803 1562 299 679 1562 ... $ MORPH : Factor w/ 337 levels "$","$ =inf","$ =prs",..: 9 20 9 39 184 231 57 67 231 39 ... $ COMPLEM : Factor w/ 1989 levels "$","$ V PR=Lp [1.2]",..: 203 547 220 203 1101 368 1834 1667 368 1834 ... $ MATRIX : Factor w/ 906 levels "$ ???","$ be PR=Aen",..: 5 5 5 308 5 856 5 5 856 308 ... $ SITUATION: Factor w/ 9 levels "[imitation of Mom: you know what I said]",..: 2 2 2 2 2 2 2 2 2 2 ... $ V_ANN : int 1 1 1 4 4 4 4 3 3 3 ... $ QUEST : int 0 0 0 0 0 0 0 0 0 0 ... $ EXCL : int 0 0 0 1 1 1 1 0 0 0 ... $ U_LEN : int 3 4 5 13 13 13 13 8 8 8 ... $ UTTERANCE: Factor w/ 55113 levels "","# (be)cause he wanted to .",..: 5696 39091 52180 2262 2262 2262 2262 3593 3593 3593 ... The level causing the problem is SYNTAX: ?? as.data.frame(sort(table(SYNTAX))) sort(table(SYNTAX)) Particles 100 PR=N1 144 Amats 271 Trans_PR=A2 787 Ditrans 1181 Intrans_PR=A1 1399 Acmp 2402 Trans_PR=V2 2433 CPcmps 2769 Vpreps 4896 Intrans_V0 5182 Trans_PR=L2 7653 Trans_V02 8117 Intrans_PR=L1 8457 Intrans_V1 9643 Intrans_PR=V1 14987 Trans_V12 22288 I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code: ?? ditrans<-which(SYNTAX=="Ditrans") ?? ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1) [1] 91532 16 ?? ?? # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ... ?? ?? ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1) [1] 91528 16 ?? ?? # ... so why don't I get 91532 again as the number of rows? ?? Any ideas?? ?? R.version # on Windows XP with service Pack 2 _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 1.1 year 2005 month 06 day 20 language R Thanks a lot, STG -- Stefan Th. Gries ---------------------------------------- Max Planck Inst. for Evol. Anthropology http://people.freenet.de/Stefan_Th_Gries ---------------------------------------- Machen Sie aus 14 Cent spielend bis zu 100 Euro! Die neue Gaming-Area von Arcor - ??ber 50 Onlinespiele im Angebot. http://www.arcor.de/rd/emf-gaming-1
"Stefan Th. Gries" <stgries_lists at arcor.de> writes:> Dear all > > I have a problem with splitting up a data frame called ReVerb: > > ?? str(ReVerb) > `data.frame': 92713 obs. of 16 variables: > $ CHILD : Factor w/ 7 levels "ABE","ADA","EVE",..: 1 1 1 1 1 1 1 1 1 1 ... > $ AGE : Factor w/ 484 levels "1;06.00","1;06.16",..: 43 43 43 99 99 99 99 99 99 99 ... > $ AGE_Q : num 2.0 2.0 2.0 2.4 2.4 ... > $ INTERVALS: num 2 2 2 2.25 2.25 2.25 2.25 2.25 2.25 2.25 ... > $ RND : int 34368 38311 14949 20586 72516 27186 88019 10767 114448 86146 ... > $ SYNTAX : Factor w/ 17 levels "Acmp","Amats",..: 15 12 8 15 7 16 7 7 16 7 ... > $ LEXICAL : Factor w/ 1643 levels "$ACHE","$ACT",..: 194 803 803 294 299 803 1562 299 679 1562 ... > $ MORPH : Factor w/ 337 levels "$","$ =inf","$ =prs",..: 9 20 9 39 184 231 57 67 231 39 ... > $ COMPLEM : Factor w/ 1989 levels "$","$ V PR=Lp [1.2]",..: 203 547 220 203 1101 368 1834 1667 368 1834 ... > $ MATRIX : Factor w/ 906 levels "$ ???","$ be PR=Aen",..: 5 5 5 308 5 856 5 5 856 308 ... > $ SITUATION: Factor w/ 9 levels "[imitation of Mom: you know what I said]",..: 2 2 2 2 2 2 2 2 2 2 ... > $ V_ANN : int 1 1 1 4 4 4 4 3 3 3 ... > $ QUEST : int 0 0 0 0 0 0 0 0 0 0 ... > $ EXCL : int 0 0 0 1 1 1 1 0 0 0 ... > $ U_LEN : int 3 4 5 13 13 13 13 8 8 8 ... > $ UTTERANCE: Factor w/ 55113 levels "","# (be)cause he wanted to .",..: 5696 39091 52180 2262 2262 2262 2262 3593 3593 3593 ... > > The level causing the problem is SYNTAX: > > ?? as.data.frame(sort(table(SYNTAX))) > sort(table(SYNTAX)) > Particles 100 > PR=N1 144 > Amats 271 > Trans_PR=A2 787 > Ditrans 1181 > Intrans_PR=A1 1399 > Acmp 2402 > Trans_PR=V2 2433 > CPcmps 2769 > Vpreps 4896 > Intrans_V0 5182 > Trans_PR=L2 7653 > Trans_V02 8117 > Intrans_PR=L1 8457 > Intrans_V1 9643 > Intrans_PR=V1 14987 > Trans_V12 22288 > > > I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code: > > ?? ditrans<-which(SYNTAX=="Ditrans") > ?? ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1) > [1] 91532 16 > ?? > ?? # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ... > ?? > ?? ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1) > [1] 91528 16 > ?? > ?? # ... so why don't I get 91532 again as the number of rows? > ?? > Any ideas??The SYNTAX variable is not necessarily the same. Could you retry the first case with ditrans <- which(ReVerb$SYNTAX=="Ditrans") ? Otherwise, try doing a setdiff() on the rownames of the two discrepant results and see which are the four cases that differ. -- O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Are there NAs in the variable? SYNTAX=="Ditrans" and SYNTAX!="Ditrans" are not mutually exclusive. On Fri, 26 Aug 2005, Stefan Th. Gries wrote:> Dear all > > I have a problem with splitting up a data frame called ReVerb: > > ? str(ReVerb) > `data.frame': 92713 obs. of 16 variables: > $ CHILD : Factor w/ 7 levels "ABE","ADA","EVE",..: 1 1 1 1 1 1 1 1 1 1 ... > $ AGE : Factor w/ 484 levels "1;06.00","1;06.16",..: 43 43 43 99 99 99 99 99 99 99 ... > $ AGE_Q : num 2.0 2.0 2.0 2.4 2.4 ... > $ INTERVALS: num 2 2 2 2.25 2.25 2.25 2.25 2.25 2.25 2.25 ... > $ RND : int 34368 38311 14949 20586 72516 27186 88019 10767 114448 86146 ... > $ SYNTAX : Factor w/ 17 levels "Acmp","Amats",..: 15 12 8 15 7 16 7 7 16 7 ... > $ LEXICAL : Factor w/ 1643 levels "$ACHE","$ACT",..: 194 803 803 294 299 803 1562 299 679 1562 ... > $ MORPH : Factor w/ 337 levels "$","$ =inf","$ =prs",..: 9 20 9 39 184 231 57 67 231 39 ... > $ COMPLEM : Factor w/ 1989 levels "$","$ V PR=Lp [1.2]",..: 203 547 220 203 1101 368 1834 1667 368 1834 ... > $ MATRIX : Factor w/ 906 levels "$ ???","$ be PR=Aen",..: 5 5 5 308 5 856 5 5 856 308 ... > $ SITUATION: Factor w/ 9 levels "[imitation of Mom: you know what I said]",..: 2 2 2 2 2 2 2 2 2 2 ... > $ V_ANN : int 1 1 1 4 4 4 4 3 3 3 ... > $ QUEST : int 0 0 0 0 0 0 0 0 0 0 ... > $ EXCL : int 0 0 0 1 1 1 1 0 0 0 ... > $ U_LEN : int 3 4 5 13 13 13 13 8 8 8 ... > $ UTTERANCE: Factor w/ 55113 levels "","# (be)cause he wanted to .",..: 5696 39091 52180 2262 2262 2262 2262 3593 3593 3593 ... > > The level causing the problem is SYNTAX: > > ? as.data.frame(sort(table(SYNTAX))) > sort(table(SYNTAX)) > Particles 100 > PR=N1 144 > Amats 271 > Trans_PR=A2 787 > Ditrans 1181 > Intrans_PR=A1 1399 > Acmp 2402 > Trans_PR=V2 2433 > CPcmps 2769 > Vpreps 4896 > Intrans_V0 5182 > Trans_PR=L2 7653 > Trans_V02 8117 > Intrans_PR=L1 8457 > Intrans_V1 9643 > Intrans_PR=V1 14987 > Trans_V12 22288 > > > I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code: > > ? ditrans<-which(SYNTAX=="Ditrans") > ? ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1) > [1] 91532 16 > ? > ? # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ... > ? > ? ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1) > [1] 91528 16 > ? > ? # ... so why don't I get 91532 again as the number of rows? > ? > Any ideas??-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
>> From: "Stefan Th. Gries" <stgries_lists at arcor.de> writes:I have a problem with splitting up a data frame called ReVerb: I would like to extract all cases where SYNTAX=="Ditrans" from ReVerb, store that in a file, and then generate ReVerb again without these cases and factor levels. My problem is probably obvious from the following lines of code:> ditrans<-which(SYNTAX=="Ditrans") > ReVerb1<-ReVerb[-c(ditrans),]; dim(ReVerb1)[1] 91532 16 # ok, so the 92713-91532=1181 cases where SYNTAX=="Ditrans" have been removed, but ...> ReVerb1<-subset(ReVerb, SYNTAX!="Ditrans"); dim(ReVerb1)[1] 91528 16 # ... so why don't I get 91532 again as the number of rows? # Any ideas??> From: Peter Dalgaard <p.dalgaard at biostat.ku.dk> > The SYNTAX variable is not necessarily the same. Could you retry the first case with > ditrans <- which(ReVerb$SYNTAX=="Ditrans") > ?The results were the same as with 'ditrans<-which(SYNTAX=="Ditrans")'.> Otherwise, try doing a setdiff() on the rownames of the two discrepant results and see which are the four cases that differ.This solved the issue: Using setdiff, I found that the cases that the second way with subset fails to include are NA's ... - I was not aware of how subset treats NA, sorry. Thanks a lot, STG -- Stefan Th. Gries ---------------------------------------- Max Planck Inst. for Evol. Anthropology http://people.freenet.de/Stefan_Th_Gries ---------------------------------------- Machen Sie aus 14 Cent spielend bis zu 100 Euro! Die neue Gaming-Area von Arcor - ??ber 50 Onlinespiele im Angebot. http://www.arcor.de/rd/emf-gaming-1