Stephen,
Languages have their own philosophies and are often focused initially on doing
specific things well. Later, they tend to accumulate additional functionality
both in the base language and extensions.
I am wondering if you have explained your need precisely enough to get the
answers you want.
SQL and Python have their own ways and both have advantages but also huge
deficiencies relative to just base R.
But there are rules you live with and if you choose day a data.frame to store
things in, the columns must all be the same length. The unique members of one
data.frame are likely to not be the same number so storing them in a data.frame
does not work. They can be stored quite few other ways, such as a list of
lists.
And what is your definition of ease? I can program in Python and SQL and way
over a hundred other languages and I know I need to adapt my thinking to the
flow of the language and not the other way around. Base R was not designed to be
like either SQL or Python. But it can be extended quite a few ways to do just
about anything.
What you ran into for example is the fact that some functionality is more
selective in what it works on. A data.frame with one column is logically the
same as a matrix with one column and as a vector but in reality, they are not
the same thing. Yes, they can be converted into each other fairly trivially.
Sort() seems to care what you feed it. If you did not worry about efficiency,
you could have a version of sort that accepts a wide variety of inputs, converts
any it can to some possibly common internal form, then converts the output back
into the form it was received in, or uses a command-line option to specify the
output format. It is not hard in R to make such a function as it has the
primitives needed to examine an arbitrary object and see what dimensions it has
for some number of types and so on, and has utilities to do the conversion.
If you want a language that has calculated every possible combination of ways to
combine functions and already made tens of thousands available, good luck. What
languages (including Python and R) expect is for you to compose such
combinations yourself in one of many ways. The annoying discussions here between
purists and those wanting to use pre-made packages aside, your question can be
handled in many of the ways we already discussed. They include making your own
(often very small) function that implements consolidating the many steps into
one logical step. It can mean using pipelines like the new "|>"
operator recently added to base R or the older versions often used in the
tidyverse packages like "%>%".
You want to take a data.frame and select a column at a time and ask for it to be
made into unique values then ordered and shown. So you want a VECTOR and your
initial use of the "[" operator does not take the underlying list
structure of a data.frame apart the way you might have thought but as a narrow
data.frame. So you MAY need to either extract it using "[[" or use
various routines R supplies like unlist() or as.vector().
Here is a pipeline using this as my data:
mydf <- data.frame(ints=c(5,4,3,3,4,5),
chars=c("z","i","t","s","t","i"))
Note the number of unique items differs s does the data type:
mydf
ints chars
1 5 z
2 4 i
3 3 t
4 3 s
5 4 t
6 5 i
To handle the columns one at a time can be done using a pipeline like:
> mydf[2] |> unlist() |> unique() |> sort()
[1] "i" "s" "t" "z"
> mydf[1] |> unlist() |> unique() |> sort()
[1] 3 4 5
The above takes a two-column data.frame and restricts it into a one-column
data.frame and then passes the new temporary variable/object into the command
line of the unlist() function which returns an object (again temporary) which is
a vector (in one case numeric and in the other character) and then that result
is passed into the command line of unique() which returns a shorter vector in
the same order and then you pass it on to sort() which reorders it.
Note the first steps can be shortened if using the "[[" notation or by
using the named way of asking for a column:
> mydf[[1]] |> unique() |> sort()
[1] 3 4 5
> mydf$ints |> unique() |> sort()
[1] 3 4 5
But pipelines are simply syntactic sugar mostly so you also can just nest
function calls as in sort(unique(unlist(mydf[1]))) or do what I showed earlier
of creating a function that does the work invisibly and call that.
Python often does their own version of pipelines by adding a dot at the end and
calling a method and if needed another dot and then calling a method on the
resulting object and so on. But that is arguably more limiting in some ways and
more powerful in others. Different paradigms. In R, you do not do
object.method1.method2(args).method3(args) so a pieline method is used to sort
of so something related.
Now if your need was to do your operation on an entire data.frame at once, then
sometimes you will find a way to do it easily and sometimes use things like
functional programming techniques. It is so common to calculate the sums or
means of columns in a data.frame (or matrix) that functions like rowSums() and
colSums() and colMeans() are available in R. But they also allow fairly
arbitrary things to be done too as in the lapply() family of functions that
applies an arbitrary function perhaps including arguments, like:
lapply(mydf, max)
sapply(mydf, `[`, 2)
The latter takes the second value in each and every column of the data.frame and
when possible, consolidates the results. Of course the uniqueness criterion when
producing uneven numbers of results, does not simplify. Below I show how you can
do many things including nested methods:
> lapply(mydf, sort)
$ints
[1] 3 3 4 4 5 5
$chars
[1] "i" "i" "s" "t" "t"
"z"
> lapply(lapply(mydf, sort), unique)
$ints
[1] 3 4 5
$chars
[1] "i" "s" "t" "z"
> lapply(lapply(mydf, unique), sort)
$ints
[1] 3 4 5
$chars
[1] "i" "s" "t" "z"
> lapply(lapply(lapply(mydf, unique), sort), toupper)
$ints
[1] "3" "4" "5"
$chars
[1] "I" "S" "T" "Z"
R has plenty of other such primitives that allow you to compose things many ways
including other variants like Filter and Reduce and pmap and so on, with way
more in various packages.
It is simply wrong to insist that a language you are not very familiar with is
not able to (often fairly easily) do all kinds of things.
Back to your question, if I may, I think one of my earlier posts on this topic
suggested another. Use factors which are part of base-R to perform the unique()
for you and then extract the unique levels and re-order them by sorting.
> sort(levels(factor(mydf[[1]])))
[1] "3" "4" "5"
> sort(levels(factor(mydf[[2]])))
[1] "i" "s" "t" "z"
But note this converts everything to characters so a numeric may need to be
converted back, and yes, the sorting is not done numerically.
Generally, there are oodles of ways to do anything. If this were Python, you
might create an object that maintains a sorted set for example but that just
hides the complexity as the various methods of the underlying object have to
carefully deal keeping track of the current order and dealing with how things
are added into the right place or tightening up the data structure if something
is removed all the time. Others simply supply a sorted() method to use only when
you actually need that. R can be done in similar ways and you can create objects
of quite a few kinds to implement some things but it does not often seem
necessary, at least to me.
I can imagine writing a function that makes a data.frame even from vectors of
unequal length by calculating the length of the longest vector and then setting
each shorter vector to be longer with code like:
length(a) <- longest
You can then patch together all the results into a data.frame with trailing NA
values on some columns.
I quickly cobbled together a few lines that can do that and can be placed inside
a function to return this:
lapply(lapply(lapply(mydf, unique), sort), toupper) -> uneven
longest <- max(unlist(lapply(uneven, length)))
answer <- data.frame(lapply(uneven, `length<-`, longest))
print(answer)
ints chars
1 3 I
2 4 S
3 5 T
4 <NA> Z
Now this has a single NA but I suggest generalizes well to a more complex
example:
ints lower upper
1 10 k Z
2 9 j A
3 8 i Z
4 7 h A
5 6 g Z
6 5 f A
7 4 h Z
8 3 i A
9 2 j Z
10 1 k A
11 2 l Z
12 3 m A
These are uneven and three columns so I tried a function version:
mydf2 <- data.frame(ints = c(10:1, 2:3),
lower = c(letters[11:6], letters[8:13]),
upper = rep(c("Z", "A"), 6))
unisortuneven <- function(anydf) {
uneven <- lapply(lapply(lapply(anydf, unique), sort), toupper)
longest <- max(unlist(lapply(uneven, length)))
data.frame(lapply(uneven, `length<-`, longest))
}
unisortuneven(mydf2)
ints lower upper
1 1 F A
2 2 G Z
3 3 H <NA>
4 4 I <NA>
5 5 J <NA>
6 6 K <NA>
7 7 L <NA>
8 8 M <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
The above does not format great for text, sadly, so is better shown as the
transpose for display purposes:
> t(unisortuneven(mydf2))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
ints "1" "2" "3" "4"
"5" "6" "7" "8" "9"
"10"
lower "F" "G" "H" "I"
"J" "K" "L" "M" NA NA
upper "A" "Z" NA NA NA NA NA NA NA NA
But hopefully it makes my point that a little thinking and KNOWING about
features of R like how to use a functionalized version of length() that sets a
changed value using the odd notation of `length<-` can let you solve all
kinds of problems in a somewhat abstract manner. Of course the above function is
not refined and will not handle some useful transformations or deal with errors.
That can make it quite a bit harder and in some cases, make it a good idea to
find someone sharing a package where they did the hard work and documented
exactly what their function does.
I am eclectic and happy to switch tools at a moment's notice if they offer
an interesting way to do something. But, within a language, I learn the darn
rules and also the idioms often used and then choose from among many ways I can
see to solve something and use what is available. You had a trivial solution
available to you to simply do one step at a time and save intermediate values,
transforming at times. Some of us have sent you more general solutions. Do you
still think what you want is so much harder to do in R, or that perhaps you are
not thinking in R and thus want it to do it some other way other languages do?
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Stephen H.
Dawson, DSL via R-help
Sent: Tuesday, December 21, 2021 10:16 AM
To: Rui Barradas <ruipbarradas at sapo.pt>; Stephen H. Dawson, DSL via
R-help <r-help at r-project.org>
Subject: Re: [R] Adding SORT to UNIQUE
Thanks everyone for the replies.
It is clear one either needs to write a function or put the unique entries into
another dataframe.
It seems odd R cannot sort a list of unique column entries with ease.
Python and SQL can do it with ease.
QUESTION
Is there a simpler means than other than the unique function to capture distinct
column entries, then sort that list?
*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com <http://www.shdawson.com>
On 12/20/21 5:53 PM, Rui Barradas wrote:> Hello,
>
> Inline.
>
> ?s 21:18 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
>> Thanks.
>>
>> sort(unique(Data[[1]]))
>>
>> This syntax provides row numbers, not column values.
>
> This is not right.
> The syntax Data[1] extracts a sub-data.frame, the syntax Data[[1]]
> extracts the column vector.
>
> As for my previous answer, it was not addressing the question, I
> misinterpreted it as being a question on how to sort by numeric order
> when the data is not numeric. Here is a, hopefully, complete answer.
> Still with package stringr.
>
>
> cols_to_sort <- 1:4
>
> Data2 <- lapply(Data[cols_to_sort], \(x){
> stringr::str_sort(unique(x), numeric = TRUE)
> })
>
>
> Or using Avi's suggestion of writing a function to do all the work and
> simplify the lapply loop later,
>
>
> unisort2 <- function(vec, ...) stringr::str_sort(unique(vec), ...)
> Data2 <- lapply(Data[cols_to_sort], unisort, numeric = TRUE)
>
>
> Hope this helps,
>
> Rui Barradas
>
>
>>
>> *Stephen Dawson, DSL*
>> /Executive Strategy Consultant/
>> Business & Technology
>> +1 (865) 804-3454
>> http://www.shdawson.com <http://www.shdawson.com>
>>
>>
>> On 12/20/21 11:58 AM, Stephen H. Dawson, DSL via R-help wrote:
>>> Hi,
>>>
>>>
>>> Running a simple syntax set to review entries in dataframe columns.
>>> Here is the working code.
>>>
>>> Data <- read.csv("./input/Source.csv", header=T)
>>> describe(Data)
>>> summary(Data)
>>> unique(Data[1])
>>> unique(Data[2])
>>> unique(Data[3])
>>> unique(Data[4])
>>>
>>> I would like to add sort the unique entries. The data in the
various
>>> columns are not defined as numbers, but also text. I realize 1 and
>>> 10 will not sort properly, as the column is not defined as a
number,
>>> but want to see what I have in the columns viewed as sorted.
>>>
>>> QUESTION
>>> What is the best process to sort unique output, please?
>>>
>>>
>>> Thanks.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.