thr3ads.net - R help - [R] Problem with filling dataframe's column [Jun 2023]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2023-Jun-13 22:14 UTC

[R] Problem with filling dataframe's column

Below.


On Tue, Jun 13, 2023 at 2:18?PM <avi.e.gross at gmail.com>
wrote:>
>
> Javad,
>
> There may be nothing wrong with the methods people are showing you and if
it satisfied you, great.>
> But I note you have lots of data in over a quarter million rows. If muchof the text data is redundant, and you want to simplify some operations
such as changing some of the values to others I multiple ways, have you
done any learning about an R feature very useful for dealing with
categorical data called "factors"?>
> If you have a vector or a column in a data.frame that contains text, thenit can be replaced by a factor that often takes way less space as it stores
a sort of dictionary of all the unique values and just records numbers like
1,2,3 to tell which one each item is.

-- This is false. It used to be true a **long time ago**, but R has for
quite a while used hashing/global string tables to avoid this problem. See
here
<https://stackoverflow.com/questions/50310092/why-does-r-use-factors-to-store-characters>
for details/references.
As a result, I think many would argue that working with strings *as
strings,* not factors, if often a better default, though of course there
are still situations where factors are useful (e.g. in ordering results by
factor levels where the desired level order is not alphabetical).

**I would appreciate correction/ clarification if my claims are wrong or
misleading! **

In any case, please do check such claims before making them on this list.

Cheers,
Bert

>
> You can access the values using levels(whatever) and also change them.There are packages that make this straightforward such as forcats which is
one of the tidyverse packages that also includes many other tools some find
useful but are beyond the usual scope of this mailing
list.>
> As an example, if you have a vector in mydata$col1 then code like:
>
> mydata$col1 <- factor(mydata$col1)
>
> No matter which way you do it, you can now access the levels and makewhatever changes, and save the changes. One example could be to apply some
variant of grep to make the substitution. There is a family of functions
build in such as sub() that matches a Regular Expression and replaces it
with what you want.>
> This has a similar result to changing all entries without doing all thework. I mean if item 5 used to be "OLD" and is now "NEW"
then any of you
quarter million entries that have a 5 will now be seen as having a value of
"NEW".>
> I will stop here and suggest you may want to read some book that explainsR as a unified set of features with some emphasis on using it for the
features it is intended to have that can make life easier, rather than
using just features it shares with most languages. Some of your questions
indicate you have less grounding and are mainly following recipes you
stumble across.>
> Otherwise, you will have a collection of what you call "codes"
and otherslike me call programming and that don't necessarily fit well
together.>
>
> -----Original Message-----
> From: R-help r-help-bounces at r-project.org <mailto:r-help-bounces at r-project.org>  On Behalf Of javad
bayat> Sent: Tuesday, June 13, 2023 3:47 PM
> To: Eric Berger ericjberger at gmail.com <mailto:ericjberger at
gmail.com>
> Cc: R-help at r-project.org <mailto:R-help at r-project.org>
> Subject: Re: [R] Problem with filling dataframe's column
>
> Dear all;
> I used these codes and I get what I wanted.
> Sincerely
>
> pat = c("Level 12","Level 22","0")
> data3 = data2[-which(data2$Layer == pat),]
> dim(data2)
> [1] 281549      9
> dim(data3)
> [1] 244075      9
>
> On Tue, Jun 13, 2023 at 11:36?AM Eric Berger < <mailto:ericjberger at gmail.com> ericjberger at gmail.com>
wrote:>
> > Hi Javed,
> > grep returns the positions of the matches. See an example below.
> >
> > > v <- c("abc", "bcd", "def")
> > > v
> > [1] "abc" "bcd" "def"
> > > grep("cd",v)
> > [1] 2
> > > w <- v[-grep("cd",v)]
> > > w
> > [1] "abc" "def"
> > >
> >
> >
> > On Tue, Jun 13, 2023 at 8:50?AM javad bayat < <mailto:j.bayat194 at gmail.com> j.bayat194 at gmail.com>
wrote:> > >
> > > Dear Rui;
> > > Hi. I used your codes, but it seems it didn't work for me.
> > >
> > > > pat <- c("_esmdes|_Des Section|0")
> > > > dim(data2)
> > >     [1]  281549      9
> > > > grep(pat, data2$Layer)
> > > > dim(data2)
> > >     [1]  281549      9
> > >
> > > What does grep function do? I expected the function to remove 3
rows
of> > the
> > > dataframe.
> > > I do not know the reason.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jun 12, 2023 at 5:16?PM Rui Barradas < <mailto:
ruipbarradas at sapo.pt> ruipbarradas at sapo.pt>> > wrote:
> > >
> > > > ?s 23:13 de 12/06/2023, javad bayat escreveu:
> > > > > Dear Rui;
> > > > > Many thanks for the email. I tried your codes and found
that the
> > length
> > > > of
> > > > > the "Values" and "Names" vectors
must be equal, otherwise the
results> > > > will
> > > > > not be useful.
> > > > > For some of the characters in the Layer column that I
do not need
to> > be
> > > > > filled in the LU column, I used "NA".
> > > > > But I need to delete some of the rows from the table as
they are
> > useless
> > > > > for me. I tried this code to delete entire rows of the
dataframe
> > which
> > > > > contained these three value in the Layer column: It
gave me the
> > following
> > > > > error.
> > > > >
> > > > >> data3 =
data2[-grep(c("_esmdes","_Des Section","0"),
data2$Layer),]> > > > >       Warning message:
> > > > >        In grep(c("_esmdes", "_Des
Section", "0"), data2$Layer) :
> > > > >        argument 'pattern' has length > 1 and
only the first
element> > will
> > > > be
> > > > > used
> > > > >
> > > > >> data3 =
data2[!grepl(c("_esmdes","_Des Section","0"),
data2$Layer),]> > > > >      Warning message:
> > > > >      In grepl(c("_esmdes", "_Des
Section", "0"), data2$Layer) :
> > > > >      argument 'pattern' has length > 1 and
only the first element
> > will be
> > > > > used
> > > > >
> > > > > How can I do this?
> > > > > Sincerely
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Jun 11, 2023 at 5:03?PM Rui Barradas <
<mailto:
ruipbarradas at sapo.pt> ruipbarradas at sapo.pt>> > > > wrote:
> > > > >
> > > > >> ?s 13:18 de 11/06/2023, Rui Barradas escreveu:
> > > > >>> ?s 22:54 de 11/06/2023, javad bayat escreveu:
> > > > >>>> Dear Rui;
> > > > >>>> Many thanks for your email. I used one of
your codes,
> > > > >>>> "data2$LU[which(data2$Layer ==
"Level 12")] <- "Park"", and it
> > works
> > > > >>>> correctly for me.
> > > > >>>> Actually I need to expand the codes so as
to consider all
> > "Levels" in
> > > > >> the
> > > > >>>> "Layer" column. There are more
than hundred levels in the Layer
> > > > column.
> > > > >>>> If I use your provided code, I have to
write it hundred of
time as> > > > >> below:
> > > > >>>> data2$LU[which(data2$Layer == "Level
1")] <- "Park";
> > > > >>>> data2$LU[which(data2$Layer == "Level
2")] <- "Agri";
> > > > >>>> ...
> > > > >>>> ...
> > > > >>>> ...
> > > > >>>> .
> > > > >>>> Is there any other way to expand the code
in order to consider
> > all of
> > > > >> the
> > > > >>>> levels simultaneously? Like the below code:
> > > > >>>> data2$LU[which(data2$Layer == c("Level
1","Level 2", "Level 3",
> > ...))]
> > > > >> <-
> > > > >>>> c("Park", "Agri",
"GS", ...)
> > > > >>>>
> > > > >>>>
> > > > >>>> Sincerely
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Sun, Jun 11, 2023 at 1:43?PM Rui
Barradas <
> >  <mailto:ruipbarradas at sapo.pt> ruipbarradas at sapo.pt>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> ?s 21:05 de 11/06/2023, javad bayat
escreveu:
> > > > >>>>>> Dear R users;
> > > > >>>>>> I am trying to fill a column based
on a specific value in
> > another
> > > > >>>>>> column
> > > > >>>>> of
> > > > >>>>>> a dataframe, but it seems there is
a problem with the codes!
> > > > >>>>>> The "Layer" and the
"LU" are two different columns of the
> > dataframe.
> > > > >>>>>> How can I fix this?
> > > > >>>>>> Sincerely
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> for (i in 1:nrow(data2$Layer)){
> > > > >>>>>>              if (data2$Layer ==
"Level 12") {
> > > > >>>>>>                  data2$LU ==
"Park"
> > > > >>>>>>                  }
> > > > >>>>>>              }
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>> Hello,
> > > > >>>>>
> > > > >>>>> There are two bugs in your code,
> > > > >>>>>
> > > > >>>>> 1) the index i is not used in the loop
> > > > >>>>> 2) the assignment operator is `<-`,
not `==`
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Here is the loop corrected.
> > > > >>>>>
> > > > >>>>> for (i in 1:nrow(data2$Layer)){
> > > > >>>>>      if (data2$Layer[i] == "Level
12") {
> > > > >>>>>        data2$LU[i] <-
"Park"
> > > > >>>>>      }
> > > > >>>>> }
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> But R is a vectorized language, the
following two ways are the
> > > > idiomac
> > > > >>>>> ways of doing what you want to do.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> i <- data2$Layer == "Level
12"
> > > > >>>>> data2$LU[i] <- "Park"
> > > > >>>>>
> > > > >>>>> # equivalent one-liner
> > > > >>>>> data2$LU[data2$Layer == "Level
12"] <- "Park"
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> If there are NA's in data2$Layer
it's probably safer to use
> > ?which()
> > > > in
> > > > >>>>> the logical index, to have a numeric
one.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> i <- which(data2$Layer ==
"Level 12")
> > > > >>>>> data2$LU[i] <- "Park"
> > > > >>>>>
> > > > >>>>> # equivalent one-liner
> > > > >>>>> data2$LU[which(data2$Layer ==
"Level 12")] <- "Park"
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Hope this helps,
> > > > >>>>>
> > > > >>>>> Rui Barradas
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>> Hello,
> > > > >>>
> > > > >>> You don't need to repeat the same
instruction 100+ times, there
is> > a
> > > > way
> > > > >>> of assigning all new LU values at the same time
with match().
> > > > >>> This assumes that you have the new values in a
vector.
> > > > >>
> > > > >> Sorry, this is not clear. I mean
> > > > >>
> > > > >>
> > > > >> This assumes that you have the new values in a
vector, the vector
> > Names
> > > > >> below. The vector of values to be matched is
created from the
data.> > > > >>
> > > > >>
> > > > >> Rui Barradas
> > > > >>
> > > > >>>
> > > > >>>
> > > > >>> Values <- sort(unique(data2$Layer))
> > > > >>> Names <- c("Park",
"Agri", "GS")
> > > > >>>
> > > > >>> i <- match(data2$Layer, Values)
> > > > >>> data2$LU <- Names[i]
> > > > >>>
> > > > >>>
> > > > >>> Hope this helps,
> > > > >>>
> > > > >>> Rui Barradas
> > > > >>>
> > > > >>> ______________________________________________
> > > > >>>  <mailto:R-help at r-project.org> R-help
at r-project.org mailing
list -- To UNSUBSCRIBE and more, see> > > > >>> 
<https://stat.ethz.ch/mailman/listinfo/r-help>
https://stat.ethz.ch/mailman/listinfo/r-help> > > > >>> PLEASE do read the posting guide
> > > > >>> 
<http://www.R-project.org/posting-guide.html>
http://www.R-project.org/posting-guide.html> > > > >>> and provide commented, minimal, self-contained,
reproducible
code.> > > > >>
> > > > >>
> > > > >
> > > > Hello,
> > > >
> > > > Please cc the r-help list, R-Help is threaded and this can
in the
> > future
> > > > be helpful to others.
> > > >
> > > > You can combine several patters like this:
> > > >
> > > >
> > > > pat <- c("_esmdes|_Des Section|0")
> > > > grep(pat, data2$Layer)
> > > >
> > > > or, programatically,
> > > >
> > > >
> > > > pat <- paste(c("_esmdes","_Des
Section","0"), collapse = "|")
> > > >
> > > >
> > > > Hope this helps,
> > > >
> > > > Rui Barradas
> > > >
> > > >
> > >
> > > --
> > > Best Regards
> > > Javad Bayat
> > > M.Sc. Environment Engineering
> > > Alternative Mail:  <mailto:bayat194 at yahoo.com> bayat194
at yahoo.com
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > >  <mailto:R-help at r-project.org> R-help at r-project.org
mailing list --
To UNSUBSCRIBE and more, see> > >  <https://stat.ethz.ch/mailman/listinfo/r-help>
https://stat.ethz.ch/mailman/listinfo/r-help> > > PLEASE do read the posting guide
> >  <http://www.R-project.org/posting-guide.html>
http://www.R-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible
code.
> >
>
>
> --
> Best Regards
> Javad Bayat
> M.Sc. Environment Engineering
> Alternative Mail:  <mailto:bayat194 at yahoo.com> bayat194 at
yahoo.com
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]

@vi@e@gross m@iii@g oii gm@ii@com

2023-Jun-13 23:24 UTC

head link

[R] Problem with filling dataframe's column

Bert,

I stand corrected. What I said may have once been true but apparently the
implementation seems to have changed at some level.

I did not factor that in.

Nevertheless, whether you use an index as a key or as an offset into an attached
vector of labels, it seems to work the same and I think my comment applies well
enough that changing a few labels instead of scanning lots of entries can
sometimes be a good think. As far as I can tell, external interface seem the
same for now.

One issue with R for a long time was how they did not do something more like a
Python dictionary and it looks like ?

ABOVE

From: Bert Gunter <bgunter.4567 at gmail.com> 
Sent: Tuesday, June 13, 2023 6:15 PM
To: avi.e.gross at gmail.com
Cc: javad bayat <j.bayat194 at gmail.com>; R-help at r-project.org
Subject: Re: [R] Problem with filling dataframe's column

Below.

On Tue, Jun 13, 2023 at 2:18?PM <avi.e.gross at gmail.com
<mailto:avi.e.gross at gmail.com> > wrote:>
>  
> Javad,
>
> There may be nothing wrong with the methods people are showing you and if
it satisfied you, great.
>
> But I note you have lots of data in over a quarter million rows. If much of
the text data is redundant, and you want to simplify some operations such as
changing some of the values to others I multiple ways, have you done any
learning about an R feature very useful for dealing with categorical data called
"factors"?
>
> If you have a vector or a column in a data.frame that contains text, then
it can be replaced by a factor that often takes way less space as it stores a
sort of dictionary of all the unique values and just records numbers like 1,2,3
to tell which one each item is. 
-- This is false. It used to be true a **long time ago**, but R has for quite a
while used hashing/global string tables to avoid this problem. See here
<https://stackoverflow.com/questions/50310092/why-does-r-use-factors-to-store-characters>
for details/references.
As a result, I think many would argue that working with strings *as strings,*
not factors, if often a better default, though of course there are still
situations where factors are useful (e.g. in ordering results by factor levels
where the desired level order is not alphabetical).

**I would appreciate correction/ clarification if my claims are wrong or
misleading! **

In any case, please do check such claims before making them on this list.

Cheers,
Bert

	[[alternative HTML version deleted]]

R help - Jun 2023 - Problem with filling dataframe's column

[R] Problem with filling dataframe's column

[R] Problem with filling dataframe's column