Bert Gunter
2024-Sep-04 06:32 UTC
[R] fixed set.seed + kmeans output disagree on distinct platforms
I have no clue, but I did note that you are using different versions of BLAS/LAPACK on the different platforms. Could that be (part) of the issue? Cheers, Bert On Tue, Sep 3, 2024 at 10:24?PM Iago Gin? V?zquez <iago.gine at sjd.es> wrote:> Hi all, > > I build a dataset processing in the same way the same data in Windows than > in Linux. > > The output of Windows processing is: > https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads > The output of Linux processing is: > https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads > > exdata=as.matrix(read.csv(" > https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads", > header=FALSE)) > exdata2=as.matrix(read.csv(" > https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads", > header=FALSE)) > > They are not identical (`identical(exdata,exdata2)` is FALSE), but they > are essentially equal (`all.equal(exdata,exdata2)` is TRUE). If I run > > set.seed(20232260) > exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750) > > I get > > exkmns$centers > V1 V2 V3 V4 V5 V6 > 1 -0.4910731 -0.2662055 0.57928758 0.14267293 -0.03013791 0.106472717 > 2 0.5301237 0.2815620 -0.23898532 1.00979412 -0.26123328 0.068099931 > 3 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855 > 4 -0.2616257 0.5680582 0.55387437 -0.09562789 -0.01706577 -0.028248679 > 5 -0.4820078 -0.1667370 -0.46533618 -0.05271446 0.05477352 0.005236259 > 6 0.6455994 -0.1396674 0.05988547 -0.15557399 0.62766365 0.031051986 > 7 0.1072127 0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130 > > both in Windows (1) and in Linux (2, 3) up to rows order. If I run in > Linux in my computer (2) > > set.seed(20232260) > exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750) > > then, I get > > exkmns2$centers > V1 V2 V3 V4 V5 V6 > 1 0.64559941 -0.1396674 0.05988547 -0.15557399 0.62766365 0.03105199 > 2 -0.26162573 0.5680582 0.55387437 -0.09562789 -0.01706577 -0.02824868 > 3 0.53012369 0.2815620 -0.23898532 1.00979412 -0.26123328 0.06809993 > 4 0.03409765 0.3492520 -0.36910409 -0.40721418 -0.21482793 0.03073180 > 5 -0.58527394 -0.1790337 -0.46778956 0.03573883 0.15473589 -0.07980379 > 6 -0.49107314 -0.2662055 0.57928758 0.14267293 -0.03013791 0.10647272 > 7 0.22552984 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.10753886 > > therefore, all rows essentially equal except for rows 5 and 7 of first > dataset (5 and 4 of second dataset). With a bit more detail: > > * > Row 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855 > belongs to exdata (and exdata2) and is center of both outputs > * > Row 0.1072127 0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130 > belongs to the dataset and it is only center of exdata output > * > Row -0.4820078 -0.1667370 -0.46533618 -0.05271446 0.05477352 0.005236259 > does not belong to the dataset and it is only center of exdata output > * > Row -0.58527394 -0.1790337 -0.46778956 0.03573883 0.15473589 -0.07980379 > belongs to the dataset and it is only center for exdata2 on Linux in my > computer > * > Row 0.03409765 0.3492520 -0.36910409 -0.40721418 -0.21482793 0.03073180 > does not belong to the dataset and it is only center for exdata2 on Linux > in my computer > * > All other 4 rows (1,2,4 and 6 of first output) do not belong to the > dataset and are common centers. > > Even, further, if I run > > set.seed(20232260) > exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750) > > in posit.cloud (3), I get the same result than above. However, if I run > (both in posit.cloud or in Windows) > > set.seed(20232260) > exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750) > > then I get > > > exkmns2$centers > V1 V2 V3 V4 V5 V6 > 1 0.6426035 -0.1449498 0.05843435 -0.1527968 0.62943077 0.02984948 > 2 -0.4092382 -0.3740695 0.69597037 0.1956896 -0.05026200 -0.01453132 > 3 0.1072127 0.5538876 -0.33117098 -0.4320920 -0.18646403 -0.08127313 > 4 0.2255298 -0.5165964 -0.02498471 -0.2043827 -0.41224195 -0.10753886 > 5 0.5301237 0.2815620 -0.23898532 1.0097941 -0.26123328 0.06809993 > 6 -0.5223387 -0.1484517 -0.38982567 -0.0341488 0.06446446 0.03622056 > 7 -0.2701703 0.5263218 0.52942311 -0.1112202 -0.03460591 0.03577287 > > So only its rows 4 and 5 are common centers to both of previous outputs > and row 3 is common width exdata centers. > > Does all this have any sense? > > Thanks! > > Iago > > (1) > R version 4.4.1 (2024-06-14 ucrt) > Platform: x86_64-w64-mingw32/x64 > Running under: Windows 10 x64 (build 19045) > > Matrix products: default > > (2) > R version 4.4.1 (2024-06-14) > Platform: x86_64-pc-linux-gnu > Running under: Debian GNU/Linux 12 (bookworm) > > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 > LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; > LAPACK version 3.11.0 > > (3) > R version 4.4.1 (2024-06-14) > Platform: x86_64-pc-linux-gnu > Running under: Ubuntu 20.04.6 LTS > > Matrix products: default > BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/ > libopenblasp-r0.3.8.so; LAPACK version 3.9.0 > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Martin Maechler
2024-Sep-04 08:41 UTC
[R] fixed set.seed + kmeans output disagree on distinct platforms
>>>>> Bert Gunter >>>>> on Tue, 3 Sep 2024 23:32:25 -0700 writes:> I have no clue, but I did note that you are using different versions of > BLAS/LAPACK on the different platforms. Could that be (part) of the issue? Good catch! My gut feeling would say "yes!" that is almost surely part of the issue. > Cheers, > Bert Additionally, careful reading of the help page (*before* any post ..) would have shown Note: The clusters are numbered in the returned object, but they are a _set_ and no ordering is implied. (Their apparent ordering may differ by platform.) Martin > On Tue, Sep 3, 2024 at 10:24?PM Iago Gin? V?zquez <iago.gine at sjd.es> wrote: >> Hi all, >> >> I build a dataset processing in the same way the same data in Windows than >> in Linux. >> >> The output of Windows processing is: >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads >> The output of Linux processing is: >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads >> >> exdata=as.matrix(read.csv(" >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads", >> header=FALSE)) >> exdata2=as.matrix(read.csv(" >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads", >> header=FALSE)) >> >> They are not identical (`identical(exdata,exdata2)` is FALSE), but they >> are essentially equal (`all.equal(exdata,exdata2)` is TRUE). If I run >> >> set.seed(20232260) >> exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750) >> >> I get >> >> exkmns$centers >> V1 V2 V3 V4 V5 V6 >> 1 -0.4910731 -0.2662055 0.57928758 0.14267293 -0.03013791 0.106472717 >> 2 0.5301237 0.2815620 -0.23898532 1.00979412 -0.26123328 0.068099931 >> 3 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855 >> 4 -0.2616257 0.5680582 0.55387437 -0.09562789 -0.01706577 -0.028248679 >> 5 -0.4820078 -0.1667370 -0.46533618 -0.05271446 0.05477352 0.005236259 >> 6 0.6455994 -0.1396674 0.05988547 -0.15557399 0.62766365 0.031051986 >> 7 0.1072127 0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130 >> >> both in Windows (1) and in Linux (2, 3) up to rows order. If I run in >> Linux in my computer (2) >> >> set.seed(20232260) >> exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750) >> >> then, I get >> >> exkmns2$centers >> V1 V2 V3 V4 V5 V6 >> 1 0.64559941 -0.1396674 0.05988547 -0.15557399 0.62766365 0.03105199 >> 2 -0.26162573 0.5680582 0.55387437 -0.09562789 -0.01706577 -0.02824868 >> 3 0.53012369 0.2815620 -0.23898532 1.00979412 -0.26123328 0.06809993 >> 4 0.03409765 0.3492520 -0.36910409 -0.40721418 -0.21482793 0.03073180 >> 5 -0.58527394 -0.1790337 -0.46778956 0.03573883 0.15473589 -0.07980379 >> 6 -0.49107314 -0.2662055 0.57928758 0.14267293 -0.03013791 0.10647272 >> 7 0.22552984 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.10753886 >> >> therefore, all rows essentially equal except for rows 5 and 7 of first >> dataset (5 and 4 of second dataset). With a bit more detail: >> >> * >> Row 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855 >> belongs to exdata (and exdata2) and is center of both outputs >> * >> Row 0.1072127 0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130 >> belongs to the dataset and it is only center of exdata output >> * >> Row -0.4820078 -0.1667370 -0.46533618 -0.05271446 0.05477352 0.005236259 >> does not belong to the dataset and it is only center of exdata output >> * >> Row -0.58527394 -0.1790337 -0.46778956 0.03573883 0.15473589 -0.07980379 >> belongs to the dataset and it is only center for exdata2 on Linux in my >> computer >> * >> Row 0.03409765 0.3492520 -0.36910409 -0.40721418 -0.21482793 0.03073180 >> does not belong to the dataset and it is only center for exdata2 on Linux >> in my computer >> * >> All other 4 rows (1,2,4 and 6 of first output) do not belong to the >> dataset and are common centers. >> >> Even, further, if I run >> >> set.seed(20232260) >> exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750) >> >> in posit.cloud (3), I get the same result than above. However, if I run >> (both in posit.cloud or in Windows) >> >> set.seed(20232260) >> exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750) >> >> then I get >> >> >> exkmns2$centers >> V1 V2 V3 V4 V5 V6 >> 1 0.6426035 -0.1449498 0.05843435 -0.1527968 0.62943077 0.02984948 >> 2 -0.4092382 -0.3740695 0.69597037 0.1956896 -0.05026200 -0.01453132 >> 3 0.1072127 0.5538876 -0.33117098 -0.4320920 -0.18646403 -0.08127313 >> 4 0.2255298 -0.5165964 -0.02498471 -0.2043827 -0.41224195 -0.10753886 >> 5 0.5301237 0.2815620 -0.23898532 1.0097941 -0.26123328 0.06809993 >> 6 -0.5223387 -0.1484517 -0.38982567 -0.0341488 0.06446446 0.03622056 >> 7 -0.2701703 0.5263218 0.52942311 -0.1112202 -0.03460591 0.03577287 >> >> So only its rows 4 and 5 are common centers to both of previous outputs >> and row 3 is common width exdata centers. >> >> Does all this have any sense? >> >> Thanks! >> >> Iago >> >> (1) >> R version 4.4.1 (2024-06-14 ucrt) >> Platform: x86_64-w64-mingw32/x64 >> Running under: Windows 10 x64 (build 19045) >> >> Matrix products: default >> >> (2) >> R version 4.4.1 (2024-06-14) >> Platform: x86_64-pc-linux-gnu >> Running under: Debian GNU/Linux 12 (bookworm) >> >> Matrix products: default >> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 >> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so; >> LAPACK version 3.11.0 >> >> (3) >> R version 4.4.1 (2024-06-14) >> Platform: x86_64-pc-linux-gnu >> Running under: Ubuntu 20.04.6 LTS >> >> Matrix products: default >> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/ >> libopenblasp-r0.3.8.so; LAPACK version 3.9.0 >>