thr3ads.net - R help - [R] Help with factor column replacement value issue [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Bill Poling

2018-Nov-16 15:38 UTC

[R] Help with factor column replacement value issue

Hello:

I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456

I would like to know why when I replace a column value it still appears in
subsequent routines:

My example:

r1$B1 is a Factor: It is created from the first character of a list of CPT
codes, r1$CPT.

head(r1$CPT, N= 25)
[1] A4649 A4649 C9359 C1713 A0394 A0398
903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 01961 01968
10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 11201 11401 11402
... l8699

str(r1$CPT)
 Factor w/ 903 levels "00000","00001",..: 773 773 816 783
739 741 743 739 739 741 ...


And I want only those CPT's with leading alpha char in this column so I set
the numeric leading char to Z

r1$B1 <- str_sub(r1$CPT,1,1)

r1$B1 <- as.factor(r1$B1) #Redundant
levels(r1$B1)[levels(r1$B1) %in% 
c('1','2','3','4','5','6','7','8','9','0')]
<- 'Z'

When I check what I have done I find l & L

unique(r1$B1)
#[1] A C Z L G Q U J V E S l D P
#Levels: Z A C D E G J l L P Q S U V

So I change l to L
r1$B1[r1$B1 == 'l'] <- 'L'

When I check again I have l & L but l = 0
table(r1$B1)
#   Z          A          C      D     E     G      J           l     L        
P     Q     S     U     V
#19639  1673   546     2     8   147   281     0    664     1    64    36   114 
14

When I go to find those rows as if they existed, they are not accounted for?

tmp <- subset(r1, B1 == "l")
print(tmp)
Empty data.table (0 rows) of 9 cols:
SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...

And I have actually visually inspected the whole darn column, sheesh!

So I ignore it temporarily.

Now later on it resurfaces in a tutorial I am following for caret pkg.

preProcess(r1b, method = c("center", "scale"),
           thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
           knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3,
           verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9,
           rangeBounds = c(0, 1))
# Warning in preProcess.default(r1b, method = c("center",
"scale"), thresh = 0.95,  :
#                                 These variables have zero variances: B1l 
<-------------yes this is a remnant of the r1$B1 clean-up
#                               Created from 23141 samples and 22 variables
#
#                               Pre-processing:
#                                 - centered (22)
#                                 - ignored (0)
#                                 - scaled (22)


So my questions are, in consideration of regression modelling accuracy:

Why is this happening?
How do I remove it?
Or is it irrelevant and leave it be?

As always, thank you for you support.

WHP












Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}

Bert Gunter

2018-Nov-16 16:09 UTC

head link

[R] Help with factor column replacement value issue

As usual, careful reading of the relevant Help page would resolve the confusion.

from ?factor:

"factor(x, exclude = NULL) applied to a factor without NAs is a
no-operation unless there are unused levels: in that case, a factor
with the reduced level set is returned. If exclude is used, since R
version 3.4.0, excluding non-existing character levels is equivalent
to excluding nothing, and when excludeis a character vector, that is
applied to the levels of x. Alternatively, excludecan be factor with
the same level set as x and will exclude the levels present in
exclude."

In, subsetting a factor does not change the levels attribute, even if
some levels are not present. One must explicitly remove them, e.g.:
> f <- factor(letters[1:3])## 3 levels, all present
> f[1:2][1] a b
Levels: a b c
## 3 levels, but one empty
> factor(f[1:2], exclude = NULL)[1] a b
Levels: a b
## Now only two levels


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Nov 16, 2018 at 7:38 AM Bill Poling <Bill.Poling at zelis.com>
wrote:>
> Hello:
>
> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
>
> I would like to know why when I replace a column value it still appears in
subsequent routines:
>
> My example:
>
> r1$B1 is a Factor: It is created from the first character of a list of CPT
codes, r1$CPT.
>
> head(r1$CPT, N= 25)
> [1] A4649 A4649 C9359 C1713 A0394 A0398
> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 01961
01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 11201 11401
11402 ... l8699
>
> str(r1$CPT)
>  Factor w/ 903 levels "00000","00001",..: 773 773 816
783 739 741 743 739 739 741 ...
>
>
> And I want only those CPT's with leading alpha char in this column so I
set the numeric leading char to Z
>
> r1$B1 <- str_sub(r1$CPT,1,1)
>
> r1$B1 <- as.factor(r1$B1) #Redundant
> levels(r1$B1)[levels(r1$B1) %in% 
c('1','2','3','4','5','6','7','8','9','0')]
<- 'Z'
>
> When I check what I have done I find l & L
>
> unique(r1$B1)
> #[1] A C Z L G Q U J V E S l D P
> #Levels: Z A C D E G J l L P Q S U V
>
> So I change l to L
> r1$B1[r1$B1 == 'l'] <- 'L'
>
> When I check again I have l & L but l = 0
> table(r1$B1)
> #   Z          A          C      D     E     G      J           l     L    
P     Q     S     U     V
> #19639  1673   546     2     8   147   281     0    664     1    64    36  
114    14
>
> When I go to find those rows as if they existed, they are not accounted
for?
>
> tmp <- subset(r1, B1 == "l")
> print(tmp)
> Empty data.table (0 rows) of 9 cols:
SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
>
> And I have actually visually inspected the whole darn column, sheesh!
>
> So I ignore it temporarily.
>
> Now later on it resurfaces in a tutorial I am following for caret pkg.
>
> preProcess(r1b, method = c("center", "scale"),
>            thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
>            knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3,
>            verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9,
>            rangeBounds = c(0, 1))
> # Warning in preProcess.default(r1b, method = c("center",
"scale"), thresh = 0.95,  :
> #                                 These variables have zero variances: B1l 
<-------------yes this is a remnant of the r1$B1 clean-up
> #                               Created from 23141 samples and 22 variables
> #
> #                               Pre-processing:
> #                                 - centered (22)
> #                                 - ignored (0)
> #                                 - scaled (22)
>
>
> So my questions are, in consideration of regression modelling accuracy:
>
> Why is this happening?
> How do I remove it?
> Or is it irrelevant and leave it be?
>
> As always, thank you for you support.
>
> WHP
>
>
>
>
>
>
>
>
>
>
>
>
> Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Michael Dewey

2018-Nov-16 16:16 UTC

head link

[R] Help with factor column replacement value issue

Dear Bill

When you do your step of replacing lower case l with upper case L the 
level still stays in the factor even though it is empty. If that is a 
nuisance x <- factor(x) will drop the unused levels. There are other 
ways of doing this.

Michael

On 16/11/2018 15:38, Bill Poling wrote:> Hello:
> 
> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
> 
> I would like to know why when I replace a column value it still appears in
subsequent routines:
> 
> My example:
> 
> r1$B1 is a Factor: It is created from the first character of a list of CPT
codes, r1$CPT.
> 
> head(r1$CPT, N= 25)
> [1] A4649 A4649 C9359 C1713 A0394 A0398
> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 01961
01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 11201 11401
11402 ... l8699
> 
> str(r1$CPT)
>   Factor w/ 903 levels "00000","00001",..: 773 773 816
783 739 741 743 739 739 741 ...
> 
> 
> And I want only those CPT's with leading alpha char in this column so I
set the numeric leading char to Z
> 
> r1$B1 <- str_sub(r1$CPT,1,1)
> 
> r1$B1 <- as.factor(r1$B1) #Redundant
> levels(r1$B1)[levels(r1$B1) %in% 
c('1','2','3','4','5','6','7','8','9','0')]
<- 'Z'
> 
> When I check what I have done I find l & L
> 
> unique(r1$B1)
> #[1] A C Z L G Q U J V E S l D P
> #Levels: Z A C D E G J l L P Q S U V
> 
> So I change l to L
> r1$B1[r1$B1 == 'l'] <- 'L'
> 
> When I check again I have l & L but l = 0
> table(r1$B1)
> #   Z          A          C      D     E     G      J           l     L    
P     Q     S     U     V
> #19639  1673   546     2     8   147   281     0    664     1    64    36  
114    14
> 
> When I go to find those rows as if they existed, they are not accounted
for?
> 
> tmp <- subset(r1, B1 == "l")
> print(tmp)
> Empty data.table (0 rows) of 9 cols:
SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
> 
> And I have actually visually inspected the whole darn column, sheesh!
> 
> So I ignore it temporarily.
> 
> Now later on it resurfaces in a tutorial I am following for caret pkg.
> 
> preProcess(r1b, method = c("center", "scale"),
>             thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
>             knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3,
>             verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9,
>             rangeBounds = c(0, 1))
> # Warning in preProcess.default(r1b, method = c("center",
"scale"), thresh = 0.95,  :
> #                                 These variables have zero variances: B1l 
<-------------yes this is a remnant of the r1$B1 clean-up
> #                               Created from 23141 samples and 22 variables
> #
> #                               Pre-processing:
> #                                 - centered (22)
> #                                 - ignored (0)
> #                                 - scaled (22)
> 
> 
> So my questions are, in consideration of regression modelling accuracy:
> 
> Why is this happening?
> How do I remove it?
> Or is it irrelevant and leave it be?
> 
> As always, thank you for you support.
> 
> WHP
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
-- 
Michael
http://www.dewey.myzen.co.uk/home.html

Jeff Newmiller

2018-Nov-16 16:26 UTC

head link

[R] Help with factor column replacement value issue

My suggestion is to avoid converting the column to a factor until it is cleaned
up the way you want it. There is also the forcats package, but I still prefer to
work with character data for cleaning. The stringsAsFactors=FALSE argument to
read.table and friends helps with this.

On November 16, 2018 8:16:22 AM PST, Michael Dewey <lists at
dewey.myzen.co.uk> wrote:>Dear Bill
>
>When you do your step of replacing lower case l with upper case L the 
>level still stays in the factor even though it is empty. If that is a 
>nuisance x <- factor(x) will drop the unused levels. There are other 
>ways of doing this.
>
>Michael
>
>On 16/11/2018 15:38, Bill Poling wrote:
>> Hello:
>> 
>> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
>> 
>> I would like to know why when I replace a column value it still
>appears in subsequent routines:
>> 
>> My example:
>> 
>> r1$B1 is a Factor: It is created from the first character of a list
>of CPT codes, r1$CPT.
>> 
>> head(r1$CPT, N= 25)
>> [1] A4649 A4649 C9359 C1713 A0394 A0398
>> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470
>01961 01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200
>11201 11401 11402 ... l8699
>> 
>> str(r1$CPT)
>>   Factor w/ 903 levels "00000","00001",..: 773 773
816 783 739 741
>743 739 739 741 ...
>> 
>> 
>> And I want only those CPT's with leading alpha char in this column
so
>I set the numeric leading char to Z
>> 
>> r1$B1 <- str_sub(r1$CPT,1,1)
>> 
>> r1$B1 <- as.factor(r1$B1) #Redundant
>> levels(r1$B1)[levels(r1$B1) %in% 
>c('1','2','3','4','5','6','7','8','9','0')]
<- 'Z'
>> 
>> When I check what I have done I find l & L
>> 
>> unique(r1$B1)
>> #[1] A C Z L G Q U J V E S l D P
>> #Levels: Z A C D E G J l L P Q S U V
>> 
>> So I change l to L
>> r1$B1[r1$B1 == 'l'] <- 'L'
>> 
>> When I check again I have l & L but l = 0
>> table(r1$B1)
>> #   Z          A          C      D     E     G      J           l    
>L         P     Q     S     U     V
>> #19639  1673   546     2     8   147   281     0    664     1    64  
> 36   114    14
>> 
>> When I go to find those rows as if they existed, they are not
>accounted for?
>> 
>> tmp <- subset(r1, B1 == "l")
>> print(tmp)
>> Empty data.table (0 rows) of 9 cols:
>SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
>> 
>> And I have actually visually inspected the whole darn column, sheesh!
>> 
>> So I ignore it temporarily.
>> 
>> Now later on it resurfaces in a tutorial I am following for caret
>pkg.
>> 
>> preProcess(r1b, method = c("center", "scale"),
>>             thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
>>             knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique
>= 3,
>>             verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff
>0.9,
>>             rangeBounds = c(0, 1))
>> # Warning in preProcess.default(r1b, method = c("center",
"scale"),
>thresh = 0.95,  :
>> #                                 These variables have zero
>variances: B1l  <-------------yes this is a remnant of the r1$B1
>clean-up
>> #                               Created from 23141 samples and 22
>variables
>> #
>> #                               Pre-processing:
>> #                                 - centered (22)
>> #                                 - ignored (0)
>> #                                 - scaled (22)
>> 
>> 
>> So my questions are, in consideration of regression modelling
>accuracy:
>> 
>> Why is this happening?
>> How do I remove it?
>> Or is it irrelevant and leave it be?
>> 
>> As always, thank you for you support.
>> 
>> WHP
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Confidentiality Notice This message is sent from Zelis.
>...{{dropped:13}}
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
-- 
Sent from my phone. Please excuse my brevity.

Bill Poling

2018-Nov-16 16:47 UTC

head link

[R] Help with factor column replacement value issue

Thank you Bert.

WHP


As usual, careful reading of the relevant Help page would resolve the confusion.

from ?factor:

"factor(x, exclude = NULL) applied to a factor without NAs is a
no-operation unless there are unused levels: in that case, a factor
with the reduced level set is returned. If exclude is used, since R
version 3.4.0, excluding non-existing character levels is equivalent
to excluding nothing, and when excludeis a character vector, that is
applied to the levels of x. Alternatively, excludecan be factor with
the same level set as x and will exclude the levels present in
exclude."

In, subsetting a factor does not change the levels attribute, even if
some levels are not present. One must explicitly remove them, e.g.:
> f <- factor(letters[1:3])## 3 levels, all present
> f[1:2][1] a b
Levels: a b c
## 3 levels, but one empty
> factor(f[1:2], exclude = NULL)[1] a b
Levels: a b
## Now only two levels


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Nov 16, 2018 at 7:38 AM Bill Poling <mailto:Bill.Poling at
zelis.com> wrote:>
> Hello:
>
> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
>
> I would like to know why when I replace a column value it still appears in
subsequent routines:
>
> My example:
>
> r1$B1 is a Factor: It is created from the first character of a list of CPT
codes, r1$CPT.
>
> head(r1$CPT, N= 25)
> [1] A4649 A4649 C9359 C1713 A0394 A0398
> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 01961
01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 11201 11401
11402 ... l8699
>
> str(r1$CPT)
> Factor w/ 903 levels "00000","00001",..: 773 773 816
783 739 741 743 739 739 741 ...
>
>
> And I want only those CPT's with leading alpha char in this column so I
set the numeric leading char to Z
>
> r1$B1 <- str_sub(r1$CPT,1,1)
>
> r1$B1 <- as.factor(r1$B1) #Redundant
> levels(r1$B1)[levels(r1$B1) %in%
c('1','2','3','4','5','6','7','8','9','0')]
<- 'Z'
>
> When I check what I have done I find l & L
>
> unique(r1$B1)
> #[1] A C Z L G Q U J V E S l D P
> #Levels: Z A C D E G J l L P Q S U V
>
> So I change l to L
> r1$B1[r1$B1 == 'l'] <- 'L'
>
> When I check again I have l & L but l = 0
> table(r1$B1)
> # Z A C D E G J l L P Q S U V
> #19639 1673 546 2 8 147 281 0 664 1 64 36 114 14
>
> When I go to find those rows as if they existed, they are not accounted
for?
>
> tmp <- subset(r1, B1 == "l")
> print(tmp)
> Empty data.table (0 rows) of 9 cols:
SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
>
> And I have actually visually inspected the whole darn column, sheesh!
>
> So I ignore it temporarily.
>
> Now later on it resurfaces in a tutorial I am following for caret pkg.
>
> preProcess(r1b, method = c("center", "scale"),
> thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
> knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3,
> verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9,
> rangeBounds = c(0, 1))
> # Warning in preProcess.default(r1b, method = c("center",
"scale"), thresh = 0.95, :
> # These variables have zero variances: B1l <-------------yes this is a
remnant of the r1$B1 clean-up
> # Created from 23141 samples and 22 variables
> #
> # Pre-processing:
> # - centered (22)
> # - ignored (0)
> # - scaled (22)
>
>
> So my questions are, in consideration of regression modelling accuracy:
>
> Why is this happening?
> How do I remove it?
> Or is it irrelevant and leave it be?
>
> As always, thank you for you support.
>
> WHP
>
>
>
>
>
>
>
>
>
>
>
>
> Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
>
> ______________________________________________
> mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Confidentiality Notice This message is sent from Zelis. This transmission may
contain information which is privileged and confidential and is intended for the
personal and confidential use of the named recipient only. Such information may
be protected by applicable State and Federal laws from this disclosure or
unauthorized use. If the reader of this message is not the intended recipient,
or the employee or agent responsible for delivering the message to the intended
recipient, you are hereby notified that any disclosure, review, discussion,
copying, or taking any action in reliance on the contents of this transmission
is strictly prohibited. If you have received this transmission in error, please
contact the sender immediately. Zelis, 2018.

R help - Nov 2018 - Help with factor column replacement value issue

[R] Help with factor column replacement value issue

[R] Help with factor column replacement value issue

[R] Help with factor column replacement value issue

[R] Help with factor column replacement value issue

[R] Help with factor column replacement value issue