Hi all, I want to remove a row based on a condition in one of the variables from a data frame. When we split this string it should be composed of 3-2- 5 format (3 digits numeric, 2 characters and 5 digits numeric). Like area code -region-numeric. The max length of the area code should be 3, the max length of region be should be 2, followed by a max length of 5 numeric digits. The are code can be 1 digit, or 2 digits or 3 digits but not more than three digits. So the max length of this variable is 10. Anything outside of this pattern should be excluded. As an example dat <-read.table(text=" rown varx 1 9F209 2 FL250 3 2F250 4 102250 5 102FL 6 102 7 1212FL250 8 121FL50",header=TRUE,stringsAsFactors=F) 1 9F209 # keep 2 FL250 # remove, no area code 3 2F250 # keep 4 102250 # remove , no region code 5 102FL # remove , no numeric after region code 6 102 # remove , no region code and numeric 7 1212FL250 #remove, area code is more than three digits 8 121FL50 # Keep The desired output should be 1 9F209 3 2F250 8 121FL50 How do I do this in an efficient way? Thank you in advance
Use regular expressions. See ?regexp and ?grep Using your example:> grep("^[[:digit:]]{1,3}[[:alpha:]]{1,2}[[:digit:]]{1,5}$",dat$varx,value= TRUE) [1] "9F209" "2F250" "121FL50" Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Thu, Nov 28, 2019 at 3:17 PM Ashta <sewashm at gmail.com> wrote:> Hi all, I want to remove a row based on a condition in one of the > variables from a data frame. > When we split this string it should be composed of 3-2- 5 format (3 > digits numeric, 2 characters and 5 digits numeric). Like > area code -region-numeric. The max length of the area code should be > 3, the max length of region be should be 2, followed by a max length > of 5 numeric digits. The are code can be 1 digit, or 2 digits or > 3 digits but not more than three digits. So the max length of this > variable is 10. Anything outside of this pattern should be excluded. > As an example > > dat <-read.table(text=" rown varx > 1 9F209 > 2 FL250 > 3 2F250 > 4 102250 > 5 102FL > 6 102 > 7 1212FL250 > 8 121FL50",header=TRUE,stringsAsFactors=F) > > 1 9F209 # keep > 2 FL250 # remove, no area code > 3 2F250 # keep > 4 102250 # remove , no region code > 5 102FL # remove , no numeric after region code > 6 102 # remove , no region code and numeric > 7 1212FL250 #remove, area code is more than three digits > 8 121FL50 # Keep > > The desired output should be > 1 9F209 > 3 2F250 > 8 121FL50 > > How do I do this in an efficient way? > > Thank you in advance > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Thank you so much Bert. Is it possible to split the varx into three ( area code, region and the numeric part)as a separate variable On Thu, Nov 28, 2019 at 7:31 PM Bert Gunter <bgunter.4567 at gmail.com> wrote:> > Use regular expressions. > > See ?regexp and ?grep > > Using your example: > > > grep("^[[:digit:]]{1,3}[[:alpha:]]{1,2}[[:digit:]]{1,5}$",dat$varx,value = TRUE) > [1] "9F209" "2F250" "121FL50" > > Cheers, > Bert > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Thu, Nov 28, 2019 at 3:17 PM Ashta <sewashm at gmail.com> wrote: >> >> Hi all, I want to remove a row based on a condition in one of the >> variables from a data frame. >> When we split this string it should be composed of 3-2- 5 format (3 >> digits numeric, 2 characters and 5 digits numeric). Like >> area code -region-numeric. The max length of the area code should be >> 3, the max length of region be should be 2, followed by a max length >> of 5 numeric digits. The are code can be 1 digit, or 2 digits or >> 3 digits but not more than three digits. So the max length of this >> variable is 10. Anything outside of this pattern should be excluded. >> As an example >> >> dat <-read.table(text=" rown varx >> 1 9F209 >> 2 FL250 >> 3 2F250 >> 4 102250 >> 5 102FL >> 6 102 >> 7 1212FL250 >> 8 121FL50",header=TRUE,stringsAsFactors=F) >> >> 1 9F209 # keep >> 2 FL250 # remove, no area code >> 3 2F250 # keep >> 4 102250 # remove , no region code >> 5 102FL # remove , no numeric after region code >> 6 102 # remove , no region code and numeric >> 7 1212FL250 #remove, area code is more than three digits >> 8 121FL50 # Keep >> >> The desired output should be >> 1 9F209 >> 3 2F250 >> 8 121FL50 >> >> How do I do this in an efficient way? >> >> Thank you in advance >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.