Below. On Tue, Jun 13, 2023 at 2:18?PM <avi.e.gross at gmail.com> wrote:> > > Javad, > > There may be nothing wrong with the methods people are showing you and ifit satisfied you, great.> > But I note you have lots of data in over a quarter million rows. If muchof the text data is redundant, and you want to simplify some operations such as changing some of the values to others I multiple ways, have you done any learning about an R feature very useful for dealing with categorical data called "factors"?> > If you have a vector or a column in a data.frame that contains text, thenit can be replaced by a factor that often takes way less space as it stores a sort of dictionary of all the unique values and just records numbers like 1,2,3 to tell which one each item is. -- This is false. It used to be true a **long time ago**, but R has for quite a while used hashing/global string tables to avoid this problem. See here <https://stackoverflow.com/questions/50310092/why-does-r-use-factors-to-store-characters> for details/references. As a result, I think many would argue that working with strings *as strings,* not factors, if often a better default, though of course there are still situations where factors are useful (e.g. in ordering results by factor levels where the desired level order is not alphabetical). **I would appreciate correction/ clarification if my claims are wrong or misleading! ** In any case, please do check such claims before making them on this list. Cheers, Bert> > You can access the values using levels(whatever) and also change them.There are packages that make this straightforward such as forcats which is one of the tidyverse packages that also includes many other tools some find useful but are beyond the usual scope of this mailing list.> > As an example, if you have a vector in mydata$col1 then code like: > > mydata$col1 <- factor(mydata$col1) > > No matter which way you do it, you can now access the levels and makewhatever changes, and save the changes. One example could be to apply some variant of grep to make the substitution. There is a family of functions build in such as sub() that matches a Regular Expression and replaces it with what you want.> > This has a similar result to changing all entries without doing all thework. I mean if item 5 used to be "OLD" and is now "NEW" then any of you quarter million entries that have a 5 will now be seen as having a value of "NEW".> > I will stop here and suggest you may want to read some book that explainsR as a unified set of features with some emphasis on using it for the features it is intended to have that can make life easier, rather than using just features it shares with most languages. Some of your questions indicate you have less grounding and are mainly following recipes you stumble across.> > Otherwise, you will have a collection of what you call "codes" and otherslike me call programming and that don't necessarily fit well together.> > > -----Original Message----- > From: R-help r-help-bounces at r-project.org <mailto:r-help-bounces at r-project.org> On Behalf Of javad bayat> Sent: Tuesday, June 13, 2023 3:47 PM > To: Eric Berger ericjberger at gmail.com <mailto:ericjberger at gmail.com> > Cc: R-help at r-project.org <mailto:R-help at r-project.org> > Subject: Re: [R] Problem with filling dataframe's column > > Dear all; > I used these codes and I get what I wanted. > Sincerely > > pat = c("Level 12","Level 22","0") > data3 = data2[-which(data2$Layer == pat),] > dim(data2) > [1] 281549 9 > dim(data3) > [1] 244075 9 > > On Tue, Jun 13, 2023 at 11:36?AM Eric Berger < <mailto:ericjberger at gmail.com> ericjberger at gmail.com> wrote:> > > Hi Javed, > > grep returns the positions of the matches. See an example below. > > > > > v <- c("abc", "bcd", "def") > > > v > > [1] "abc" "bcd" "def" > > > grep("cd",v) > > [1] 2 > > > w <- v[-grep("cd",v)] > > > w > > [1] "abc" "def" > > > > > > > > > On Tue, Jun 13, 2023 at 8:50?AM javad bayat < <mailto:j.bayat194 at gmail.com> j.bayat194 at gmail.com> wrote:> > > > > > Dear Rui; > > > Hi. I used your codes, but it seems it didn't work for me. > > > > > > > pat <- c("_esmdes|_Des Section|0") > > > > dim(data2) > > > [1] 281549 9 > > > > grep(pat, data2$Layer) > > > > dim(data2) > > > [1] 281549 9 > > > > > > What does grep function do? I expected the function to remove 3 rowsof> > the > > > dataframe. > > > I do not know the reason. > > > > > > > > > > > > > > > > > > > > > On Mon, Jun 12, 2023 at 5:16?PM Rui Barradas < <mailto:ruipbarradas at sapo.pt> ruipbarradas at sapo.pt>> > wrote: > > > > > > > ?s 23:13 de 12/06/2023, javad bayat escreveu: > > > > > Dear Rui; > > > > > Many thanks for the email. I tried your codes and found that the > > length > > > > of > > > > > the "Values" and "Names" vectors must be equal, otherwise theresults> > > > will > > > > > not be useful. > > > > > For some of the characters in the Layer column that I do not needto> > be > > > > > filled in the LU column, I used "NA". > > > > > But I need to delete some of the rows from the table as they are > > useless > > > > > for me. I tried this code to delete entire rows of the dataframe > > which > > > > > contained these three value in the Layer column: It gave me the > > following > > > > > error. > > > > > > > > > >> data3 = data2[-grep(c("_esmdes","_Des Section","0"),data2$Layer),]> > > > > Warning message: > > > > > In grep(c("_esmdes", "_Des Section", "0"), data2$Layer) : > > > > > argument 'pattern' has length > 1 and only the firstelement> > will > > > > be > > > > > used > > > > > > > > > >> data3 = data2[!grepl(c("_esmdes","_Des Section","0"),data2$Layer),]> > > > > Warning message: > > > > > In grepl(c("_esmdes", "_Des Section", "0"), data2$Layer) : > > > > > argument 'pattern' has length > 1 and only the first element > > will be > > > > > used > > > > > > > > > > How can I do this? > > > > > Sincerely > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Jun 11, 2023 at 5:03?PM Rui Barradas < <mailto:ruipbarradas at sapo.pt> ruipbarradas at sapo.pt>> > > > wrote: > > > > > > > > > >> ?s 13:18 de 11/06/2023, Rui Barradas escreveu: > > > > >>> ?s 22:54 de 11/06/2023, javad bayat escreveu: > > > > >>>> Dear Rui; > > > > >>>> Many thanks for your email. I used one of your codes, > > > > >>>> "data2$LU[which(data2$Layer == "Level 12")] <- "Park"", and it > > works > > > > >>>> correctly for me. > > > > >>>> Actually I need to expand the codes so as to consider all > > "Levels" in > > > > >> the > > > > >>>> "Layer" column. There are more than hundred levels in the Layer > > > > column. > > > > >>>> If I use your provided code, I have to write it hundred oftime as> > > > >> below: > > > > >>>> data2$LU[which(data2$Layer == "Level 1")] <- "Park"; > > > > >>>> data2$LU[which(data2$Layer == "Level 2")] <- "Agri"; > > > > >>>> ... > > > > >>>> ... > > > > >>>> ... > > > > >>>> . > > > > >>>> Is there any other way to expand the code in order to consider > > all of > > > > >> the > > > > >>>> levels simultaneously? Like the below code: > > > > >>>> data2$LU[which(data2$Layer == c("Level 1","Level 2", "Level 3", > > ...))] > > > > >> <- > > > > >>>> c("Park", "Agri", "GS", ...) > > > > >>>> > > > > >>>> > > > > >>>> Sincerely > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On Sun, Jun 11, 2023 at 1:43?PM Rui Barradas < > > <mailto:ruipbarradas at sapo.pt> ruipbarradas at sapo.pt> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> ?s 21:05 de 11/06/2023, javad bayat escreveu: > > > > >>>>>> Dear R users; > > > > >>>>>> I am trying to fill a column based on a specific value in > > another > > > > >>>>>> column > > > > >>>>> of > > > > >>>>>> a dataframe, but it seems there is a problem with the codes! > > > > >>>>>> The "Layer" and the "LU" are two different columns of the > > dataframe. > > > > >>>>>> How can I fix this? > > > > >>>>>> Sincerely > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> for (i in 1:nrow(data2$Layer)){ > > > > >>>>>> if (data2$Layer == "Level 12") { > > > > >>>>>> data2$LU == "Park" > > > > >>>>>> } > > > > >>>>>> } > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>> Hello, > > > > >>>>> > > > > >>>>> There are two bugs in your code, > > > > >>>>> > > > > >>>>> 1) the index i is not used in the loop > > > > >>>>> 2) the assignment operator is `<-`, not `==` > > > > >>>>> > > > > >>>>> > > > > >>>>> Here is the loop corrected. > > > > >>>>> > > > > >>>>> for (i in 1:nrow(data2$Layer)){ > > > > >>>>> if (data2$Layer[i] == "Level 12") { > > > > >>>>> data2$LU[i] <- "Park" > > > > >>>>> } > > > > >>>>> } > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> But R is a vectorized language, the following two ways are the > > > > idiomac > > > > >>>>> ways of doing what you want to do. > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> i <- data2$Layer == "Level 12" > > > > >>>>> data2$LU[i] <- "Park" > > > > >>>>> > > > > >>>>> # equivalent one-liner > > > > >>>>> data2$LU[data2$Layer == "Level 12"] <- "Park" > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> If there are NA's in data2$Layer it's probably safer to use > > ?which() > > > > in > > > > >>>>> the logical index, to have a numeric one. > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> i <- which(data2$Layer == "Level 12") > > > > >>>>> data2$LU[i] <- "Park" > > > > >>>>> > > > > >>>>> # equivalent one-liner > > > > >>>>> data2$LU[which(data2$Layer == "Level 12")] <- "Park" > > > > >>>>> > > > > >>>>> > > > > >>>>> Hope this helps, > > > > >>>>> > > > > >>>>> Rui Barradas > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>> Hello, > > > > >>> > > > > >>> You don't need to repeat the same instruction 100+ times, thereis> > a > > > > way > > > > >>> of assigning all new LU values at the same time with match(). > > > > >>> This assumes that you have the new values in a vector. > > > > >> > > > > >> Sorry, this is not clear. I mean > > > > >> > > > > >> > > > > >> This assumes that you have the new values in a vector, the vector > > Names > > > > >> below. The vector of values to be matched is created from thedata.> > > > >> > > > > >> > > > > >> Rui Barradas > > > > >> > > > > >>> > > > > >>> > > > > >>> Values <- sort(unique(data2$Layer)) > > > > >>> Names <- c("Park", "Agri", "GS") > > > > >>> > > > > >>> i <- match(data2$Layer, Values) > > > > >>> data2$LU <- Names[i] > > > > >>> > > > > >>> > > > > >>> Hope this helps, > > > > >>> > > > > >>> Rui Barradas > > > > >>> > > > > >>> ______________________________________________ > > > > >>> <mailto:R-help at r-project.org> R-help at r-project.org mailinglist -- To UNSUBSCRIBE and more, see> > > > >>> <https://stat.ethz.ch/mailman/listinfo/r-help>https://stat.ethz.ch/mailman/listinfo/r-help> > > > >>> PLEASE do read the posting guide > > > > >>> <http://www.R-project.org/posting-guide.html>http://www.R-project.org/posting-guide.html> > > > >>> and provide commented, minimal, self-contained, reproduciblecode.> > > > >> > > > > >> > > > > > > > > > Hello, > > > > > > > > Please cc the r-help list, R-Help is threaded and this can in the > > future > > > > be helpful to others. > > > > > > > > You can combine several patters like this: > > > > > > > > > > > > pat <- c("_esmdes|_Des Section|0") > > > > grep(pat, data2$Layer) > > > > > > > > or, programatically, > > > > > > > > > > > > pat <- paste(c("_esmdes","_Des Section","0"), collapse = "|") > > > > > > > > > > > > Hope this helps, > > > > > > > > Rui Barradas > > > > > > > > > > > > > > -- > > > Best Regards > > > Javad Bayat > > > M.Sc. Environment Engineering > > > Alternative Mail: <mailto:bayat194 at yahoo.com> bayat194 at yahoo.com > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > <mailto:R-help at r-project.org> R-help at r-project.org mailing list --To UNSUBSCRIBE and more, see> > > <https://stat.ethz.ch/mailman/listinfo/r-help>https://stat.ethz.ch/mailman/listinfo/r-help> > > PLEASE do read the posting guide > > <http://www.R-project.org/posting-guide.html>http://www.R-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > Best Regards > Javad Bayat > M.Sc. Environment Engineering > Alternative Mail: <mailto:bayat194 at yahoo.com> bayat194 at yahoo.com > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
@vi@e@gross m@iii@g oii gm@ii@com
2023-Jun-13 23:24 UTC
[R] Problem with filling dataframe's column
Bert, I stand corrected. What I said may have once been true but apparently the implementation seems to have changed at some level. I did not factor that in. Nevertheless, whether you use an index as a key or as an offset into an attached vector of labels, it seems to work the same and I think my comment applies well enough that changing a few labels instead of scanning lots of entries can sometimes be a good think. As far as I can tell, external interface seem the same for now. One issue with R for a long time was how they did not do something more like a Python dictionary and it looks like ? ABOVE From: Bert Gunter <bgunter.4567 at gmail.com> Sent: Tuesday, June 13, 2023 6:15 PM To: avi.e.gross at gmail.com Cc: javad bayat <j.bayat194 at gmail.com>; R-help at r-project.org Subject: Re: [R] Problem with filling dataframe's column Below. On Tue, Jun 13, 2023 at 2:18?PM <avi.e.gross at gmail.com <mailto:avi.e.gross at gmail.com> > wrote:> > > Javad, > > There may be nothing wrong with the methods people are showing you and if it satisfied you, great. > > But I note you have lots of data in over a quarter million rows. If much of the text data is redundant, and you want to simplify some operations such as changing some of the values to others I multiple ways, have you done any learning about an R feature very useful for dealing with categorical data called "factors"? > > If you have a vector or a column in a data.frame that contains text, then it can be replaced by a factor that often takes way less space as it stores a sort of dictionary of all the unique values and just records numbers like 1,2,3 to tell which one each item is.-- This is false. It used to be true a **long time ago**, but R has for quite a while used hashing/global string tables to avoid this problem. See here <https://stackoverflow.com/questions/50310092/why-does-r-use-factors-to-store-characters> for details/references. As a result, I think many would argue that working with strings *as strings,* not factors, if often a better default, though of course there are still situations where factors are useful (e.g. in ordering results by factor levels where the desired level order is not alphabetical). **I would appreciate correction/ clarification if my claims are wrong or misleading! ** In any case, please do check such claims before making them on this list. Cheers, Bert [[alternative HTML version deleted]]