Without a representative sample of data, it is very hard to understand your
question or to be specific about suggestions. See [1] for some ideas about how
to communicate questions online.
Not that "clearing" data would usually mean deleting it, as in
rm(data). From context I assume you mean "cleaning", where invalid
characters need to be removed.
Also assuming that you have a data frame with some columns that are categorical
data:
1) If the values are contaminated or incomplete (don't have rows
representing every possible category) then it is almost always better to delay
converting to factor until after data are cleaned. The read.table family of
functions include a "stringsAsFactors=FALSE" option that will prevent
automatic conversion of columns with unknown types into factors. This is also
useful for contaminated numeric columns. Only after the vector of character data
is clean and as complete as it can be should you convert to factor.
Note that most data sets have a variety of column types, and even after
resolving issues discussed here your function is not necessarily going to work
with every input data file that you encounter. Specifically, not every column of
data should be converted to factor. With this in mind, it can be helpful to look
for ways to confirm that the date you are processing is what you expect it to
be. Often this is implemented by confirming that specific columns have specific
kinds of data in them. That is using a loop may be TOO flexible... apply this
cleaning loop cautiously.
2) Most functions in R can process whole vectors of data at once, so your inner
loop should not be necessary. Specifically, the line
data[[i]] <- gsub( " +", " ", data[[i]] )
would replace all sequences of one or more spaces in every element of the vector
with a single space.
(Your j loop also goes too many times... str_replace_all(data[[i]], "
", " ") is affecting the whole column, but you repeat it
unnecessarily.)
3) I don't know what a "depurate" value is.
4) You should be able to convert your cleaned character column to factor with
the "factor" function... like
data[[i]] <- factor( data[[i]] )
Note that if you know certain levels should be possible but not all of them are
actually present (e.g. "Small", "Medium", and
"Large" but no data with "Small" are present) then you will
need to specify the levels as a parameter to the factor function. See the help
file ?factor.
5) You have several lines of code at the end that appear to execute regardless
of whether the column is a factor or not. They should be within the braces of
the if statement.
6) Please read the Posting Guide mentioned at the end of this and every post on
this list, specifically regarding posting in plain text. Your code was partially
damaged by the HTML email format.
[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On August 12, 2014 5:42:13 AM PDT, "Maicel Monz?n P?rez" <maicel at
infomed.sld.cu> wrote:>Hello List,
>
>I did this script to clear data after import (I don?t know is ok ).
>After
>its execution levels and label values got lost. Could some explain me
>to
>reassign levels again in the script (new depurate value)?
>
>Best regard
>
>Maicel Monzon MD, PHD
>
>Center of Cybernetic Apply to Medicine
>
># data cleaning script
>
>library(stringr)
>
>for(i in 1:length(data)) {
>
> if (is.factor(data[[i]])==T)
>
> {for(j in 1:sum(str_detect(data[,i], " ")))
>
> {data[[i]]<-str_replace_all(data[[i]], " ", " ")}}
>
> data[[i]]<-str_trim (data[[i]],side = "both")
>
> data[[i]]<-tolower(data[[i]])
>
>}
>
>Note: ? ? is 2 blank space and ? ? only one
>
>
>
>
>
>--
>Nunca digas nunca, di mejor: gracias, permiso, disculpe.
>
>Este mensaje le ha llegado mediante el servicio de correo electronico
>que ofrece Infomed para respaldar el cumplimiento de las misiones del
>Sistema Nacional de Salud. La persona que envia este correo asume el
>compromiso de usar el servicio a tales fines y cumplir con las
>regulaciones establecidas
>
>Infomed: http://www.sld.cu/
>
>
>
>
> [[alternative HTML version deleted]]
>
>
>
>------------------------------------------------------------------------
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.