Bill Poling
2019-Dec-16 13:24 UTC
[R] Help with Identify the number (Count) of values that are less than 5 char and replace with 99999
#RStudio Version 1.2.5019 sessionInfo() # R version 3.6.1 (2019-07-05) # Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 10 x64 (build 17134) Good morning. I have a factor that contains 1,418,303 Clinical Procedure Code (CPT). A CPT Code is 5 char. However, among my data there are many values that are less, 2, 3, 4, as well as NA's I get the count of NA's from the str() function = 58,481 Using the nchar function (I converted the Factor to a character column first) I get the first 1K values. (Perhaps this is not necessary with an alternative function?) # edt1a$ProcedureCode1 <- levels(edt1a$ProcedureCode)[edt1a$ProcedureCode] #https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/nchar [989] 5 5 5 5 5 5 5 5 5 5 5 5 [ reached getOption("max.print") -- omitted 1417303 entries ] What I would like to do is: 1. Identify the number (Count) of values that are less than 5 char (i.e. 2 char = 150, 3 char = 925, 4 char = 1002) Probably look something like this: |Var1 | Freq| |:------|-----:| |2 | 150 | |3 | 925 | |4 | 1002| 2. Replace with 99999 as well as replace the NA's with 99999 head(edt1a$ProcedureCode1, n= 50) #Not apparent in top 50 but they are there [1] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" "19081" "99479" [20] "99478" "99479" "99479" "99479" "99213" "99213" "98927" "96372" "92507" "99479" "99478" "99478" "99478" "99479" "77065" "19083" "95874" "99244" "A7034" [39] "A7046" "71275" "J1170" "90471" "87591" "80053" "98926" "A4649" "A7033" "43644" "85025" "73080" str(edt1a$ProcedureCode) #Factor w/ 6244 Factor w/ 6244 levels "0003M","00100",..: 1775 4732 4732 4733 4586 147 4708 3108 2400 4732 ... str(edt1a$ProcedureCode1) chr [1:1418303] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" ... #Some examples from using sink and knitr sink("ProcCodeV2.txt") knitr::kable(table(edt1a$ProcedureCode1)) closeAllConnections() |Var1 | Freq| |:------|-----:| |0003M | 1| |0110 | 4|<-- |0111 | 5|<-- |01112 | 11| |0112 | 14|<-- |01120 | 3| |0113 | 2|<-- |01130 | 1| |0114 | 1|<-- |01160 | 3| |01170 | 4| |0120 | 7|<-- |01200 | 8| |01202 | 26| |0121 | 7|<-- |01210 | 19| |01214 | 125| |01215 | 5| |0122 | 2|<-- |01220 | 2| |01230 | 11| |0124 | 5|<-- |171 | 1|<-- |17106 | 6| Thank you for any help. WHP Confidentiality Notice\ \ This email and the attachments...{{dropped:11}}
Ivan Krylov
2019-Dec-16 13:55 UTC
[R] Help with Identify the number (Count) of values that are less than 5 char and replace with 99999
On Mon, 16 Dec 2019 13:24:36 +0000 Bill Poling <Bill.Poling at zelis.com> wrote:> Using the nchar function (I converted the Factor to a character > column first) I get the first 1K values.<...>> 1. Identify the number (Count) of values that are less than 5 char > (i.e. 2 char = 150, 3 char = 925, 4 char = 1002)Use the table() function to get frequency counts for discrete-valued data (like nchar()).> 2. Replace with 99999 as well as replace the NA's with 99999An expression like `nchar(x) < 5` returns a boolean vector with TRUE where the condition is, well, true, and FALSE otherwise. Use this vector together with the subset operator (square brackets []) and assignment operator (<-) to perform a subassignment of "99999" to the elements of your dataset where the condition is true: https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Index-vectors See also: ?`[` and ?table -- Best regards, Ivan
Rui Barradas
2019-Dec-16 17:04 UTC
[R] Help with Identify the number (Count) of values that are less than 5 char and replace with 99999
Hello, To count the number of variables with less than 5 characters, use nchar and table or aggregate. Since nchar needs a character vector and you have a factor, first convert with as.character. edt1a$ProcedureCode <- as.character(edt1a$ProcedureCode) 1. Now any of the next 3 instructions will table the vector by number of characters. table(nchar(edt1a$ProcedureCode)) aggregate(ProcedureCode ~ nchar(ProcedureCode), edt1a, length) tapply(edt1a$ProcedureCode, nchar(edt1a$ProcedureCode), length) 2. If you want to change the values with less than 5 chars or all NA's to "99999", a vectorized logical operation is a good way of doing it. n <- nchar(edt1a$ProcedureCode) < 5 na <- is.na(edt1a$ProcedureCode) edt1a$ProcedureCode[n | na] <- "99999" Now back to factor, with the new level "99999". edt1a$ProcedureCode <- factor(edt1a$ProcedureCode) Hope this helps, Rui Barradas ?s 13:24 de 16/12/19, Bill Poling escreveu:> #RStudio Version 1.2.5019 > sessionInfo() > # R version 3.6.1 (2019-07-05) > # Platform: x86_64-w64-mingw32/x64 (64-bit) > # Running under: Windows 10 x64 (build 17134) > > Good morning. I have a factor that contains 1,418,303 Clinical Procedure Code (CPT). > > A CPT Code is 5 char. However, among my data there are many values that are less, 2, 3, 4, as well as NA's > I get the count of NA's from the str() function = 58,481 > > Using the nchar function (I converted the Factor to a character column first) I get the first 1K values. > (Perhaps this is not necessary with an alternative function?) > # edt1a$ProcedureCode1 <- levels(edt1a$ProcedureCode)[edt1a$ProcedureCode] > #https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/nchar > > [989] 5 5 5 5 5 5 5 5 5 5 5 5 > [ reached getOption("max.print") -- omitted 1417303 entries ] > > What I would like to do is: > > 1. Identify the number (Count) of values that are less than 5 char (i.e. 2 char = 150, 3 char = 925, 4 char = 1002) > Probably look something like this: > |Var1 | Freq| > |:------|-----:| > |2 | 150 | > |3 | 925 | > |4 | 1002| > 2. Replace with 99999 as well as replace the NA's with 99999 > > head(edt1a$ProcedureCode1, n= 50) #Not apparent in top 50 but they are there > [1] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" "19081" "99479" > [20] "99478" "99479" "99479" "99479" "99213" "99213" "98927" "96372" "92507" "99479" "99478" "99478" "99478" "99479" "77065" "19083" "95874" "99244" "A7034" > [39] "A7046" "71275" "J1170" "90471" "87591" "80053" "98926" "A4649" "A7033" "43644" "85025" "73080" > > str(edt1a$ProcedureCode) #Factor w/ 6244 > Factor w/ 6244 levels "0003M","00100",..: 1775 4732 4732 4733 4586 147 4708 3108 2400 4732 ... > str(edt1a$ProcedureCode1) > chr [1:1418303] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479" "99479" "97110" "J1885" ... > > #Some examples from using sink and knitr > > sink("ProcCodeV2.txt") > knitr::kable(table(edt1a$ProcedureCode1)) > closeAllConnections() > > |Var1 | Freq| > |:------|-----:| > |0003M | 1| > |0110 | 4|<-- > |0111 | 5|<-- > |01112 | 11| > |0112 | 14|<-- > |01120 | 3| > |0113 | 2|<-- > |01130 | 1| > |0114 | 1|<-- > |01160 | 3| > |01170 | 4| > |0120 | 7|<-- > |01200 | 8| > |01202 | 26| > |0121 | 7|<-- > |01210 | 19| > |01214 | 125| > |01215 | 5| > |0122 | 2|<-- > |01220 | 2| > |01230 | 11| > |0124 | 5|<-- > |171 | 1|<-- > |17106 | 6| > > Thank you for any help. > > WHP > > Confidentiality Notice\ \ This email and the attachments...{{dropped:11}} > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >