K Purna Prakash
2020-Aug-04 11:54 UTC
[R] Mathematical working procedure of duplicated() function in r
Dear Sir(s), I request you to provide the detailed* internal mathematical working mechanism of the following function *for better understanding. *x[duplicated(x) | duplicated(x, fromLast=TRUE), ]* I am having some confusion in understanding how duplicates are being identified when thousands of records are there. I will look for a positive response. Thank you, K.Purna Prakash. [[alternative HTML version deleted]]
Rui Barradas
2020-Aug-04 16:35 UTC
[R] Mathematical working procedure of duplicated() function in r
Hello, R is open source, you can see exactly what is the internal working of any function. You can have access to the code by typing the function's name without parenthesis at an R command line. > duplicated function (x, incomparables = FALSE, ...) UseMethod("duplicated") <bytecode: 0x55e5ef683040> <environment: namespace:base> Now, this tells users that duplicated is a generic function, and that there are methods written to handle the different S3 classes of objects x. When this happens, there is always a default method, duplicated.default > duplicated.default function (x, incomparables = FALSE, fromLast = FALSE, nmax = NA, ...) .Internal(duplicated(x, incomparables, fromLast, if (is.factor(x)) min(length(x), nlevels(x) + 1L) else nmax)) <bytecode: 0x55e5ef6826a0> <environment: namespace:base> The default method calls .Internal(duplicated, etc). So you'll have to download the R sources, if you haven't done it yet, and search for a file where that function might be. The file is src/main/duplicate.c Good reading. Also, like the posting guide asks R-Help users to do, please post in plain text, not in HTML. Hope this helps, Rui Barradas ?s 12:54 de 04/08/20, K Purna Prakash escreveu:> Dear Sir(s), > I request you to provide the detailed* internal mathematical working > mechanism of the following function *for better understanding. > *x[duplicated(x) | duplicated(x, fromLast=TRUE), ]* > I am having some confusion in understanding how duplicates are being > identified when thousands of records are there. > I will look for a positive response. > Thank you, > K.Purna Prakash. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Greg Snow
2020-Aug-04 19:22 UTC
[R] Mathematical working procedure of duplicated() function in r
Rui pointed out that you can examine the source yourself. FAQ 7.40 has a link to an article with detail on finding and examining the source code. A general algorithm for checking for duplicates follows (I have not examined to R source code to see if they use something more clever). Create an empty object (I will call it seen). This could be a simple vector, but for efficiency it is better to use an object type that has fast lookup, e.g. binary tree, associative array/hash/dictionary, etc. Create an empty vector of logicals the same length as x (I will call it result). loop from 1 to the length of x (or from the length to 1 if fromLast=TRUE), on each iteration check to see if the value of x[i] is in seen If it is: set result[i] to TRUE If it is not: add the current value to seen and set result[i] to false After the loop finishes, throw away seen and reclaim the memory, then return result. Since it looks like you are using this on a matrix or data frame, there is probably a preprocessing step that combines all the values on each row into a single character string. On Tue, Aug 4, 2020 at 6:45 AM K Purna Prakash <prakash.nani at gmail.com> wrote:> > Dear Sir(s), > I request you to provide the detailed* internal mathematical working > mechanism of the following function *for better understanding. > *x[duplicated(x) | duplicated(x, fromLast=TRUE), ]* > I am having some confusion in understanding how duplicates are being > identified when thousands of records are there. > I will look for a positive response. > Thank you, > K.Purna Prakash. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com