I'm sorry if this has come across as a homework assignment!I was trying to provide a simple example. There are actually 38323 rows of data, each row is an observation of the percent that each of those veg types occupies in a spatial unit - where each line adds to 90 - and values are different every line. I need a way to categorize the data, so I can reduce the number of unique observations. So instead of 38323 unique observations - I can reduce this to X number of High/Med/Low X number of Med/Low/High X number of Low/High/Med etc... for all combinations I hope this makes it more clear...... thank you all for your responses, JC On Sun, May 29, 2022 at 1:16 PM Avi Gross via R-help <r-help at r-project.org> wrote:> Tom, > You may have a very different impression of what was asked! LOL! > Unless Janet clarifies what seems a bit like a homework assignment, it > seems to be a fairly simple and straightforward assignment with exactly > three rows/columns and asking how to replace the variables, in a sense, by > finding the high and low and perhaps thus identifying the medium, but to do > this for each row without changing the order of the resulting data.frame. > I note most techniques people have used focus on columns, not rows, but an > all-numeric data.frame can be transposed, or converted to a matrix and > later converted back. > If this is HW, the question becomes what has been taught so far and is > supposed to be used in solving it. Can they make their own functions > perhaps to be called three times, once per row or column, to replace that > row/column, or can they use some form of loop to iterate over the columns? > Does it need to sort of be done in place or can they create gradually a > second data.frame and then move the pointer to it and lots of other similar > ideas. > I am not sure, other than as a HW assignment, why this transformation > would need to be done but of course, there may well be a reason. > I note that the particular example shown just happens to create almost a > magic square as the sum of rows and columns and the major diagonal happen > to be 0, albeit the reverse diagonal is all 50's. > Again, there are many solutions imaginable but the goal may be more > specific and I shudder to supply one given that too often questions here > are not detailed enough and are misunderstood. In this case, I thought I > understood until I saw what Tom wrote! LOL! > I will add this. Is it guaranteed that no two items in the same row are > never equal or is there some requirement for how to handle a tie? And note > there are base R functions called min() and max() and you can ask for > things like: > > if ( current == min(mydata[1,])) ... > > > -----Original Message----- > From: Tom Woolman <twoolman at ontargettek.com> > To: Janet Choate <jsc.eco at gmail.com> > Cc: r-help at r-project.org > Sent: Sun, May 29, 2022 3:42 pm > Subject: Re: [R] categorizing data > > > Some ideas: > > You could create a cluster model with k=3 for each of the 3 variables, > to determine what constitutes high/medium/low centroid values for each > of the 3 types of plant types. Centroid values could then be used as the > upper/lower boundary ranges for high/med/low. > > Or utilize a histogram for each variable, and use quantiles or > densities, etc. to determine the natural breaks for the high/med/low > ranges for each of the IVs. > > > > > On 2022-05-29 15:28, Janet Choate wrote: > > Hi R community, > > I have a data frame with three variables, where each row adds up to 90. > > I want to assign a category of low, medium, or high to the values in > > each > > row - where the lowest value per row will be set to 10, the medium > > value > > set to 30, and the high value set to 50 - so each row still adds up to > > 90. > > > > For example: > > Data: Orig > > tree shrub grass > > 32 11 47 > > 23 41 26 > > 49 23 18 > > > > Data: New > > tree shrub grass > > 30 10 50 > > 10 50 30 > > 50 30 10 > > > > I am not attaching any code here as I have not been able to write > > anything > > effective! appreciate help with this! > > thank you, > > JC > > > > -- > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Tague Team Lab Manager 1005 Bren Hall UCSB, Santa Barbara, CA. [[alternative HTML version deleted]]
Janet, Thanks for the clarification. What obfuscated your example a bit was specifying replacement by numbers adding up to 90.?Nothing wrong with that but R also often is used with categorical variables like strings of "low"",?"Medium" and "HIGH". You can then calculate other statistics using sum() and other methods but?using 10, 30 and 50 may be what you want or need. Since you have lots of rows, you may not necessarily want to use some techniques such as?conversions to/from a matrix or transpositions. Others have made suggestion on ways to go and?a fairly straightforward and simple way to go would be a loop from 1 to nrow(df) in which you evaluate?that row. In a sense, it is cleaner to leave the data alone and make your changes to a second?data.frame or three new columns added to the current one and one way is judicious use of ifelse()?combined with min() and max(). Assuming you looped on variable "index" then in your loop youcan refer to the three column entries as df[index,] or df[index, 1:3] to get a horizontal slice as a?vector. you can feed that to min() and max() and have code in your explicit loop like: threesome <-?df[index, 1:3] ; low <- min(threesome) ; high <- max(threesome) Now to populate the three items with a replacement of 10, 30 50 you could use a nested if statement?(or ifelse if doing it vectorized) with logic like: if (df[index,1] == min) {df[index, 1] <- 10)} else { if?(df[index,1] == max) {df[index, 1] <- 50)}?} else {df[index, 1] <- 30 } My mailer formats my code horribly, hence the one-line codes as EXAMPLES best written on many lines.The point is you can repeat similar code changing the second through last items in place since you grabbed a?copy of the original contents. Would I do it this way? Nope. I can see dozens of valid ways including some much more concise?or faster and some using functionality in packages like dplyr. I would likely use a helper function or two?and even replace entire rows at once. It sounds like no matter which way you do this, you might want to do something like?the table() command to tabulate each column and count the occurrences. Again, others have provided ways that are worth considering and the above is just a way to get?you started with actual code.? -----Original Message----- From: Janet Choate <jsc.eco at gmail.com> To: Avi Gross <avigross at verizon.net> Cc: r-help at r-project.org <r-help at r-project.org> Sent: Sun, May 29, 2022 4:31 pm Subject: Re: [R] categorizing data I'm sorry if this has come across as a homework assignment!I was trying to provide a simple example.There are actually 38323 rows of data, each row is an observation of the percent that each of those veg types occupies in a spatial unit - where each line adds to 90 - and values are different every line.?I need a way to categorize the?data, so I can reduce the number of unique observations. So instead of 38323 unique observations - I can reduce this to?X number of High/Med/LowX number of Med/Low/HighX number of Low/High/Medetc... for all combinations I hope this makes it more clear......thank you all for your responses,JC On Sun, May 29, 2022 at 1:16 PM Avi Gross via R-help <r-help at r-project.org> wrote: Tom, You may have a very different impression of what was asked! LOL! Unless Janet clarifies what seems a bit like a homework assignment, it seems?to be a fairly simple and straightforward assignment with exactly three rows/columns and?asking how to replace the variables, in a sense, by finding the high and low and?perhaps thus identifying the medium, but to do this for each row without changing?the order of the resulting data.frame. I note most techniques people have used focus on columns, not rows, but an all-numeric?data.frame can be transposed, or converted to a matrix and later converted back. If this is HW, the question becomes what has been taught so far and is supposed to be?used in solving it. Can they make their own functions perhaps to be called three times,?once per row or column, to replace that row/column, or can they use some form of loop to?iterate over the columns? Does it need to sort of be done in place or can they create gradually?a second data.frame and then move the pointer to it and lots of other similar ideas. I am not sure, other than as a HW assignment, why this transformation would need to be done?but of course, there may well be a reason. I note that the particular example shown just happens to create almost a magic square as the sum?of rows and columns and the major diagonal happen to be 0, albeit the reverse diagonal is all 50's.? Again, there are many solutions imaginable but the goal may be more specific and I shudder to?supply one given that too often questions here are not detailed enough and are misunderstood.?In this case, I thought I understood until I saw what Tom wrote! LOL! I will add this. Is it guaranteed that no two items in the same row are never equal or is there some?requirement for how to handle a tie? And note there are base R functions called min() and max()?and you can ask for things like: if ( current == min(mydata[1,])) ... -----Original Message----- From: Tom Woolman <twoolman at ontargettek.com> To: Janet Choate <jsc.eco at gmail.com> Cc: r-help at r-project.org Sent: Sun, May 29, 2022 3:42 pm Subject: Re: [R] categorizing data Some ideas: You could create a cluster model with k=3 for each of the 3 variables, to determine what constitutes high/medium/low centroid values for each of the 3 types of plant types. Centroid values could then be used as the upper/lower boundary ranges for high/med/low. Or utilize a histogram for each variable, and use quantiles or densities, etc. to determine the natural breaks for the high/med/low ranges for each of the IVs. On 2022-05-29 15:28, Janet Choate wrote:> Hi R community, > I have a data frame with three variables, where each row adds up to 90. > I want to assign a category of low, medium, or high to the values in > each > row - where the lowest value per row will be set to 10, the medium > value > set to 30, and the high value set to 50 - so each row still adds up to > 90. > > For example: > Data: Orig > tree? shrub? grass > 32? ? 11? ? ? 47 > 23? ? ? 41? ? ? 26 > 49? ? ? 23? ? ? 18 > > Data: New > tree? shrub? grass > 30? ? ? 10? ? ? 50 > 10? ? ? 50? ? 30 > 50? ? ? 30? ? 10 > > I am not attaching any code here as I have not been able to write > anything > effective! appreciate help with this! > thank you, > JC > > -- > > ??? [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ? ? ? ? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Tague Team Lab Manager1005 Bren HallUCSB, Santa Barbara, CA. [[alternative HTML version deleted]]
Hi Janet: here is a start to give you the idea, now you need loop either use a "for" or one of the apply functions. 1. Preallocate new data (i am lazy so it is array, for example of size three. 2. order the data and set values. junk <- array(0, dim = c(2,3)) values <- c(10, 30, 50) junk[1, order(c(32, 11, 17))] <- values junk[1, ] [1] 50 10 30 This works because order() returns the index of the ordering, not the values. HTH, -Roy> On May 29, 2022, at 1:31 PM, Janet Choate <jsc.eco at gmail.com> wrote: > > I'm sorry if this has come across as a homework assignment!I was trying to > provide a simple example. > There are actually 38323 rows of data, each row is an observation of the > percent that each of those veg types occupies in a spatial unit - where > each line adds to 90 - and values are different every line. > I need a way to categorize the data, so I can reduce the number of unique > observations. > > So instead of 38323 unique observations - I can reduce this to > X number of High/Med/Low > X number of Med/Low/High > X number of Low/High/Med > etc... for all combinations > > I hope this makes it more clear...... > thank you all for your responses, > JC > > On Sun, May 29, 2022 at 1:16 PM Avi Gross via R-help <r-help at r-project.org> > wrote: > >> Tom, >> You may have a very different impression of what was asked! LOL! >> Unless Janet clarifies what seems a bit like a homework assignment, it >> seems to be a fairly simple and straightforward assignment with exactly >> three rows/columns and asking how to replace the variables, in a sense, by >> finding the high and low and perhaps thus identifying the medium, but to do >> this for each row without changing the order of the resulting data.frame. >> I note most techniques people have used focus on columns, not rows, but an >> all-numeric data.frame can be transposed, or converted to a matrix and >> later converted back. >> If this is HW, the question becomes what has been taught so far and is >> supposed to be used in solving it. Can they make their own functions >> perhaps to be called three times, once per row or column, to replace that >> row/column, or can they use some form of loop to iterate over the columns? >> Does it need to sort of be done in place or can they create gradually a >> second data.frame and then move the pointer to it and lots of other similar >> ideas. >> I am not sure, other than as a HW assignment, why this transformation >> would need to be done but of course, there may well be a reason. >> I note that the particular example shown just happens to create almost a >> magic square as the sum of rows and columns and the major diagonal happen >> to be 0, albeit the reverse diagonal is all 50's. >> Again, there are many solutions imaginable but the goal may be more >> specific and I shudder to supply one given that too often questions here >> are not detailed enough and are misunderstood. In this case, I thought I >> understood until I saw what Tom wrote! LOL! >> I will add this. Is it guaranteed that no two items in the same row are >> never equal or is there some requirement for how to handle a tie? And note >> there are base R functions called min() and max() and you can ask for >> things like: >> >> if ( current == min(mydata[1,])) ... >> >> >> -----Original Message----- >> From: Tom Woolman <twoolman at ontargettek.com> >> To: Janet Choate <jsc.eco at gmail.com> >> Cc: r-help at r-project.org >> Sent: Sun, May 29, 2022 3:42 pm >> Subject: Re: [R] categorizing data >> >> >> Some ideas: >> >> You could create a cluster model with k=3 for each of the 3 variables, >> to determine what constitutes high/medium/low centroid values for each >> of the 3 types of plant types. Centroid values could then be used as the >> upper/lower boundary ranges for high/med/low. >> >> Or utilize a histogram for each variable, and use quantiles or >> densities, etc. to determine the natural breaks for the high/med/low >> ranges for each of the IVs. >> >> >> >> >> On 2022-05-29 15:28, Janet Choate wrote: >>> Hi R community, >>> I have a data frame with three variables, where each row adds up to 90. >>> I want to assign a category of low, medium, or high to the values in >>> each >>> row - where the lowest value per row will be set to 10, the medium >>> value >>> set to 30, and the high value set to 50 - so each row still adds up to >>> 90. >>> >>> For example: >>> Data: Orig >>> tree shrub grass >>> 32 11 47 >>> 23 41 26 >>> 49 23 18 >>> >>> Data: New >>> tree shrub grass >>> 30 10 50 >>> 10 50 30 >>> 50 30 10 >>> >>> I am not attaching any code here as I have not been able to write >>> anything >>> effective! appreciate help with this! >>> thank you, >>> JC >>> >>> -- >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > Tague Team Lab Manager > 1005 Bren Hall > UCSB, Santa Barbara, CA. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new street address*** 110 McAllister Way Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: Roy.Mendelssohn at noaa.gov www: https://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.