thr3ads.net - R help - [R] Error with text analysis data [Apr 2022]

If this information is useful, please help other people find it:
Share via:

Neha gupta

2022-Apr-13 18:48 UTC

[R] Error with text analysis data

Someone just told me that you need to pre process the data before model
construction. For instance, make the text to lower case, remove
punctuation, symbols etc and tokenize the text (give number to each word).
Then create word of bags model (not sure about it), and then create a
model.

Is it true to perform all these steps?

Best regards

On Wednesday, April 13, 2022, Bill Dunlap <williamwdunlap at gmail.com>
wrote:
> >  I would always suggest working until the model works, no errors and
no
> NA values
>
> We agree on that.  However, the error gives you no hint about which
> variables are causing the problem.  If it did, then it could only tell
> about the first variable with the problem.  I think you would get to your
> working model faster if you got NA's for the constant columns and then
> could drop them all at once (or otherwise deal with them).
>
> -Bill
>
> On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron <tebert at
ufl.edu>
> wrote:
>
>> I suspect that it is because you are looking at two types of error,
both
>> telling you that the model was not appropriate. In the ?error in
contrasts?
>> there is nothing to contrast in the model. For a numerical constant the
>> program calculates the standard deviation and ends with a division by
zero.
>> Division by zero is undefined, or NA.
>>
>>
>>
>> I would always suggest working until the model works, no errors and no
NA
>> values. The reason is that I can get NA in several ways and I need to
>> understand why. If I just ignore the NA in my model I may be assuming
the
>> wrong thing.
>>
>>
>>
>> Tim
>>
>>
>>
>> *From:* Bill Dunlap <williamwdunlap at gmail.com>
>> *Sent:* Wednesday, April 13, 2022 12:23 PM
>> *To:* Ebert,Timothy Aaron <tebert at ufl.edu>
>> *Cc:* Neha gupta <neha.bologna90 at gmail.com>; r-help mailing
list <
>> r-help at r-project.org>
>> *Subject:* Re: [R] Error with text analysis data
>>
>>
>>
>> *[External Email]*
>>
>> Constant columns can be the model when you do some subsetting or are
>> exploring a new dataset.  My objection is that constant columns of
numbers
>> and logicals are fine but those of characters and factors are not.
>>
>>
>>
>> -Bill
>>
>>
>>
>> On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron <tebert at
ufl.edu>
>> wrote:
>>
>> What is the goal of having a constant in the model? To me that seems
>> pointless. Also there is no variability in sexCode regardless of
whether
>> you call it integer or factor. So the model y ~ sexCode is just a
strange
>> way to look at the variability in y and it would be better to do
something
>> like summarize(y) or mean(y) if that was the goal.
>>
>> Tim
>>
>> -----Original Message-----
>> From: R-help <r-help-bounces at r-project.org> On Behalf Of Bill
Dunlap
>> Sent: Wednesday, April 13, 2022 9:56 AM
>> To: Neha gupta <neha.bologna90 at gmail.com>
>> Cc: r-help mailing list <r-help at r-project.org>
>> Subject: Re: [R] Error with text analysis data
>>
>> [External Email]
>>
>> This sounds like what I think is a bug in
stats::model.matrix.default():
>> a numeric column with all identical entries is fine but a constant
>> character or factor column is not.
>>
>> > d <- data.frame(y=1:5, sex=rep("Female",5))
d$sexFactor <-
>> > factor(d$sex, levels=c("Male","Female"))
d$sexCode <-
>> > as.integer(d$sexFactor) d
>>   y    sex sexFactor sexCode
>> 1 1 Female    Female       2
>> 2 2 Female    Female       2
>> 3 3 Female    Female       2
>> 4 4 Female    Female       2
>> 5 5 Female    Female       2
>> > lm(y~sex, data=d)
>> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>>   contrasts can be applied only to factors with 2 or more levels
>> > lm(y~sexFactor, data=d)
>> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
>>   contrasts can be applied only to factors with 2 or more levels
>> > lm(y~sexCode, data=d)
>>
>> Call:
>> lm(formula = y ~ sexCode, data = d)
>>
>> Coefficients:
>> (Intercept)      sexCode
>>           3           NA
>>
>> Calling traceback() after the error would clarify this.
>>
>> -Bill
>>
>>
>> On Tue, Apr 12, 2022 at 3:12 PM Neha gupta <neha.bologna90 at
gmail.com>
>> wrote:
>>
>> > Hello everyone, I have text data with output variable have three
>> subgroups.
>> > I am using the following code but getting the error message (see
error
>> > after the code).
>> >
>> > d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE)
>> > d$REMEDIATION_FUNCTION=NULL d$DEF_REMEDIATION_GAP_MULT=NULL
>> > d$REMEDIATION_BASE_EFFORT=NULL
>> >
>> > index <- createDataPartition(d$TYPE, p = .70,list = FALSE) tr
<-
>> > d[index, ] ts <- d[-index, ]
>> >
>> > ctrl <- trainControl(method = "cv",number=3, index =
index, classProbs
>> > = TRUE, summaryFunction = multiClassSummary)
>> >
>> > ran <- train(TYPE ~ ., data = tr,
>> >                     method = "rpart",
>> >                     ## Will create 48 parameter combinations
>> >                     tuneLength = 3,
>> >                     na.action= na.pass,
>> >                     metric = "Accuracy",
>> >                     preProc = c("center",
"scale", "nzv"),
>> >                     trControl = ctrl)
>> > getTrainPerf(ran)
>> >
>> > *It gives me error:*
>> >
>> >
>> > *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 +
isOF[nn]]) :
>> > contrasts can be applied only to factors with 2 or more levels*
>> >
>> >
>> > *My data is as follow*
>> >
>> > Rows: 1,819
>> > Columns: 14
>> > $ PLUGIN_RULE_KEY             <chr>
"InsufficientBranchCoverage",
>> > "InsufficientLin~
>> > $ PLUGIN_CONFIG_KEY           <chr> "",
"", "", "", "", "",
"", "", "",
>> "",
>> > "S1120~
>> > $ PLUGIN_NAME                 <chr> "common-java",
"common-java",
>> > "common-java", "~
>> > $ DESCRIPTION                 <chr> "An issue is
created on a file as
>> soon
>> > as the ~
>> > $ SEVERITY                    <chr> "MAJOR",
"MAJOR", "MAJOR", "MAJOR",
>> > "MAJOR", "~
>> > $ NAME                        <chr> "Branches should
have sufficient
>> > coverage by t~
>> > $ DEF_REMEDIATION_FUNCTION    <chr> "LINEAR",
"LINEAR", "LINEAR",
>> > "LINEAR_OFFSET",~
>> > $ REMEDIATION_GAP_MULT        <lgl> NA, NA, NA, NA, NA, NA,
NA, NA, NA,
>> NA,
>> > NA, NA~
>> > $ DEF_REMEDIATION_BASE_EFFORT <chr> "",
"", "", "10min", "", "",
>> > "5min", "5min", "~
>> > $ GAP_DESCRIPTION             <chr> "number of
uncovered conditions",
>> > "number of l~
>> > $ SYSTEM_TAGS                 <chr>
"bad-practice", "bad-practice",
>> > "convention", ~
>> > $ IS_TEMPLATE                 <int> 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0,
>> 0,
>> > 0, 0, 0~
>> > $ DESCRIPTION_FORMAT          <chr> "HTML",
"HTML", "HTML", "HTML",
>> "HTML",
>> > "HTML"~
>> > $ TYPE                        <chr> "CODE_SMELL",
"CODE_SMELL",
>> > "CODE_SMELL", "COD~
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> >
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
>> >
man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
>> >
Rzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxo
>> >
RrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e>> >
PLEASE do read the posting guide
>> >
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
>> >
g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
>> >
sRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWx
>> >
oRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e>> >
and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.
>>
ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r>>
9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-
>> qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=f3IyuRfeDDjr_
>> 8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e>> PLEASE do read the posting
guide https://urldefense.proofpoint.
>> com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.
>>
html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m>>
HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s>>
Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e>> and provide commented,
minimal, self-contained, reproducible code.
>>
>>
	[[alternative HTML version deleted]]

Ebert,Timothy Aaron

2022-Apr-13 19:53 UTC

head link

[R] Error with text analysis data

Is this a different question from the original post? It would be better to keep
threads separate.
Always pre-process the data. Clean the data of obvious mistakes. This can be
simple typographical errors or complicated like an author that wrote too when
they intended two or to. In old English texts spelling was not standardized and
the same word could have multiple spellings within one book or chapter. Removing
punctuation is probably a part of this, though a program like Grammarly would
not work very well if it removed punctuation.

After that it depends on what you are trying to accomplish. Are you interested
in the number of times an author used the word ?a? or ?the? and is ?The?
different from ?the?? Are you modeling word use frequency or comparing
vocabulary between texts.

Too many choices.

Tim

From: Neha gupta <neha.bologna90 at gmail.com>
Sent: Wednesday, April 13, 2022 2:49 PM
To: Bill Dunlap <williamwdunlap at gmail.com>
Cc: Ebert,Timothy Aaron <tebert at ufl.edu>; r-help mailing list
<r-help at r-project.org>
Subject: Re: Error with text analysis data

[External Email]
Someone just told me that you need to pre process the data before model
construction. For instance, make the text to lower case, remove punctuation,
symbols etc and tokenize the text (give number to each word). Then create word
of bags model (not sure about it), and then create a model.

Is it true to perform all these steps?

Best regards

On Wednesday, April 13, 2022, Bill Dunlap <williamwdunlap at
gmail.com<mailto:williamwdunlap at gmail.com>>
wrote:>  I would always suggest working until the model works, no errors and no NA
values
We agree on that.  However, the error gives you no hint about which variables
are causing the problem.  If it did, then it could only tell about the first
variable with the problem.  I think you would get to your working model faster
if you got NA's for the constant columns and then could drop them all at
once (or otherwise deal with them).

-Bill

On Wed, Apr 13, 2022 at 9:40 AM Ebert,Timothy Aaron <tebert at
ufl.edu<mailto:tebert at ufl.edu>> wrote:
I suspect that it is because you are looking at two types of error, both telling
you that the model was not appropriate. In the ?error in contrasts? there is
nothing to contrast in the model. For a numerical constant the program
calculates the standard deviation and ends with a division by zero. Division by
zero is undefined, or NA.

I would always suggest working until the model works, no errors and no NA
values. The reason is that I can get NA in several ways and I need to understand
why. If I just ignore the NA in my model I may be assuming the wrong thing.

Tim

From: Bill Dunlap <williamwdunlap at gmail.com<mailto:williamwdunlap at
gmail.com>>
Sent: Wednesday, April 13, 2022 12:23 PM
To: Ebert,Timothy Aaron <tebert at ufl.edu<mailto:tebert at
ufl.edu>>
Cc: Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at
gmail.com>>; r-help mailing list <r-help at
r-project.org<mailto:r-help at r-project.org>>
Subject: Re: [R] Error with text analysis data

[External Email]
Constant columns can be the model when you do some subsetting or are exploring a
new dataset.  My objection is that constant columns of numbers and logicals are
fine but those of characters and factors are not.

-Bill

On Wed, Apr 13, 2022 at 9:15 AM Ebert,Timothy Aaron <tebert at
ufl.edu<mailto:tebert at ufl.edu>> wrote:
What is the goal of having a constant in the model? To me that seems pointless.
Also there is no variability in sexCode regardless of whether you call it
integer or factor. So the model y ~ sexCode is just a strange way to look at the
variability in y and it would be better to do something like summarize(y) or
mean(y) if that was the goal.

Tim

-----Original Message-----
From: R-help <r-help-bounces at r-project.org<mailto:r-help-bounces at
r-project.org>> On Behalf Of Bill Dunlap
Sent: Wednesday, April 13, 2022 9:56 AM
To: Neha gupta <neha.bologna90 at gmail.com<mailto:neha.bologna90 at
gmail.com>>
Cc: r-help mailing list <r-help at r-project.org<mailto:r-help at
r-project.org>>
Subject: Re: [R] Error with text analysis data

[External Email]

This sounds like what I think is a bug in stats::model.matrix.default(): a
numeric column with all identical entries is fine but a constant character or
factor column is not.
> d <- data.frame(y=1:5, sex=rep("Female",5)) d$sexFactor <-
> factor(d$sex, levels=c("Male","Female")) d$sexCode
<-
> as.integer(d$sexFactor) d  y    sex sexFactor sexCode
1 1 Female    Female       2
2 2 Female    Female       2
3 3 Female    Female       2
4 4 Female    Female       2
5 5 Female    Female       2> lm(y~sex, data=d)Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more
levels> lm(y~sexFactor, data=d)Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
  contrasts can be applied only to factors with 2 or more
levels> lm(y~sexCode, data=d)
Call:
lm(formula = y ~ sexCode, data = d)

Coefficients:
(Intercept)      sexCode
          3           NA

Calling traceback() after the error would clarify this.

-Bill

On Tue, Apr 12, 2022 at 3:12 PM Neha gupta <neha.bologna90 at
gmail.com<mailto:neha.bologna90 at gmail.com>> wrote:
> Hello everyone, I have text data with output variable have three subgroups.
> I am using the following code but getting the error message (see error
> after the code).
>
> d=read.csv("SONAR_RULES.csv", stringsAsFactors = FALSE)
> d$REMEDIATION_FUNCTION=NULL d$DEF_REMEDIATION_GAP_MULT=NULL
> d$REMEDIATION_BASE_EFFORT=NULL
>
> index <- createDataPartition(d$TYPE, p = .70,list = FALSE) tr <-
> d[index, ] ts <- d[-index, ]
>
> ctrl <- trainControl(method = "cv",number=3, index = index,
classProbs
> = TRUE, summaryFunction = multiClassSummary)
>
> ran <- train(TYPE ~ ., data = tr,
>                     method = "rpart",
>                     ## Will create 48 parameter combinations
>                     tuneLength = 3,
>                     na.action= na.pass,
>                     metric = "Accuracy",
>                     preProc = c("center", "scale",
"nzv"),
>                     trControl = ctrl)
> getTrainPerf(ran)
>
> *It gives me error:*
>
>
> *Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
> contrasts can be applied only to factors with 2 or more levels*
>
>
> *My data is as follow*
>
> Rows: 1,819
> Columns: 14
> $ PLUGIN_RULE_KEY             <chr>
"InsufficientBranchCoverage",
> "InsufficientLin~
> $ PLUGIN_CONFIG_KEY           <chr> "", "",
"", "", "", "", "",
"", "", "",
> "S1120~
> $ PLUGIN_NAME                 <chr> "common-java",
"common-java",
> "common-java", "~
> $ DESCRIPTION                 <chr> "An issue is created on a
file as soon
> as the ~
> $ SEVERITY                    <chr> "MAJOR",
"MAJOR", "MAJOR", "MAJOR",
> "MAJOR", "~
> $ NAME                        <chr> "Branches should have
sufficient
> coverage by t~
> $ DEF_REMEDIATION_FUNCTION    <chr> "LINEAR",
"LINEAR", "LINEAR",
> "LINEAR_OFFSET",~
> $ REMEDIATION_GAP_MULT        <lgl> NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA,
> NA, NA~
> $ DEF_REMEDIATION_BASE_EFFORT <chr> "", "",
"", "10min", "", "",
> "5min", "5min", "~
> $ GAP_DESCRIPTION             <chr> "number of uncovered
conditions",
> "number of l~
> $ SYSTEM_TAGS                 <chr> "bad-practice",
"bad-practice",
> "convention", ~
> $ IS_TEMPLATE                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
> 0, 0, 0~
> $ DESCRIPTION_FORMAT          <chr> "HTML",
"HTML", "HTML", "HTML", "HTML",
> "HTML"~
> $ TYPE                        <chr> "CODE_SMELL",
"CODE_SMELL",
> "CODE_SMELL", "COD~
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org> mailing list
-- To UNSUBSCRIBE and more, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
>
man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> Rzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxo
> RrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&e> PLEASE
do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
>
g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWx
> oRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&e> and
provide commented, minimal, self-contained, reproducible code.
>
        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
UNSUBSCRIBE and more, see
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=f3IyuRfeDDjr_8UWlwyBTC5Yn4Y56QV4FjYC0GCWcVc&ePLEASE
do read the posting guide
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=HOpL0ELxWdK0xzzVxRd_DnxukD-qPEQIBxDJnlSkAQrae1FdSHYJTfWxoRrVO5eP&s=Vo6cRRCeqGApsiEGGtA6pndDHjOIuGFOs7BOkJMvuaw&eand
provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

R help - Apr 2022 - Error with text analysis data

[R] Error with text analysis data

[R] Error with text analysis data