Poling, William
2020-May-11 15:58 UTC
[R] Help with Kmeans output and using broom to tidy etc..
#RStudio Version Version 1.2.1335 need this one--> 1.2.5019 sessionInfo() # R version 4.0.0 Patched (2020-05-03 r78349) #Platform: x86_64-w64-mingw32/x64 (64-bit) #Running under: Windows 10 x64 (build 17763) Hello: I have data that I am trying to manipulate for Kmeans clustering. Original data looks like this str(geo1) # 'data.frame': 2352 obs. of 5 variables: # $ ID: Factor w/ 2352 levels "101040199600",..: 590 908 976 509 1674 690 1336 86 726 1702 ... # $ state : Factor w/ 41 levels "AL","AR","AZ",..: 32 10 25 11 9 32 13 31 12 12 ... # $ city : Factor w/ 1337 levels "ABBOTTSTOWN",..: 932 156 230 698 965 1330 515 727 1127 1304 ... # $ latitude : num 40.4 31.2 40.8 42.1 26.8 ... # $ longitude : num -79.9 -81.5 -74 -91.6 -82.1 ... I created a subset adding column prop_of_total str(trnd1_tbl) tibble [1,457 x 5] (S3: tbl_df/tbl/data.frame) $ city : Factor w/ 1337 levels "ABBOTTSTOWN",..: 1 2 3 4 5 6 7 8 9 10 ... $ state : Factor w/ 41 levels "AL","AR","AZ",..: 32 36 10 28 12 36 10 11 26 38 ... $ Basecountsum : num [1:1457] 2352 2352 2352 2352 2352 ... $ Basecount2 : num [1:1457] 1 1 1 1 1 2 1 1 2 1 ... $ prop_of_total: num [1:1457] 0.000425 0.000425 0.000425 0.000425 0.000425 ... Then I spread it trnd2_tbl <- trnd1_tbl %>% dplyr::select(city, state, prop_of_total) %>% spread(key = city, value = prop_of_total, fill = 0) #remove the NA's with fill str(trnd2_tbl)#tibble [41 x 1,338] (S3: tbl_df/tbl/data.frame) Then I run a Kmeans kmeans_obj1 <- trnd2_tbl %>% dplyr::select(- state) %>% kmeans(centers = 20, nstart = 100) str(kmeans_obj1) List of 9 $ cluster : int [1:41] 11 11 9 11 11 4 11 11 16 2 ... $ centers : num [1:20, 1:1337] 0 0 0 0 0 0 0 0 0 0 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:20] "1" "2" "3" "4" ... .. ..$ : chr [1:1337] "ABBOTTSTOWN" "ABILENE" "ACWORTH" "ADAMS" ... $ totss : num 0.00158 $ withinss : num [1:20] 0 0 0 0 0 0 0 0 0 0 ... $ tot.withinss: num 0.0000848 $ betweenss : num 0.0015 $ size : int [1:20] 1 1 1 1 1 1 1 1 1 1 ... $ iter : int 3 $ ifault : int 0 - attr(*, "class")= chr "kmeans" Then I go and try to tidy: #Tidy, glance, augment #Just makes it easier to use or view the obj's in the obj list broom::tidy(kmeans_obj1) %>% glimpse() broom::glance(kmeans_obj1) ##A tibble: 1 x 4 # totss tot.withinss betweenss iter # <dbl> <dbl> <dbl> <int> # 1 0.00158 0.0000848 0.00150 3 However, when I run this piece I get an error: broom::augment(kmeans_obj1, trnd2_tbl) %>% dplyr::select(city, .cluster) #Error: Must subset columns with a valid subscript vector. # The subscript has the wrong type `data.frame< # u: double # x: double>`.i It must be numeric or character. Here is the back trace: rlang::last_error() # Backtrace: # 1. broom::augment(kmeans_obj1, trnd2_tbl) # 9. dplyr::select(., city, .cluster) # 11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...)) # 12. tidyselect:::eval_select_impl(...) # 20. tidyselect:::vars_select_eval(...) # 21. tidyselect:::walk_data_tree(expr, data_mask, context_mask) # 22. tidyselect:::eval_c(expr, data_mask, context_mask) # 23. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init) # 24. tidyselect:::walk_data_tree(new, data_mask, context_mask) # 25. tidyselect:::as_indices_sel_impl(...) # 26. tidyselect:::as_indices_impl(x, vars, strict = strict) # 27. vctrs::vec_as_subscript(x, logical = "error") I am not sure what I am supposed to fix? Maybe someone has had similar error and can advise me please? Thank you. WHP Proprietary NOTICE TO RECIPIENT OF INFORMATION:\ This e-mail may con...{{dropped:16}}
Eric Berger
2020-May-12 13:39 UTC
[R] Help with Kmeans output and using broom to tidy etc..
Can you create a reproducible example? Your question involves objects that are unknown to us. (geo1, trnd1_tbl) On Tue, May 12, 2020 at 2:41 PM Poling, William via R-help < r-help at r-project.org> wrote:> #RStudio Version Version 1.2.1335 need this one--> 1.2.5019 > sessionInfo() > # R version 4.0.0 Patched (2020-05-03 r78349) > #Platform: x86_64-w64-mingw32/x64 (64-bit) > #Running under: Windows 10 x64 (build 17763) > > Hello: > > I have data that I am trying to manipulate for Kmeans clustering. > > Original data looks like this > > str(geo1) > # 'data.frame': 2352 obs. of 5 variables: > # $ ID: Factor w/ 2352 levels "101040199600",..: 590 908 976 509 1674 690 > 1336 86 726 1702 ... > # $ state : Factor w/ 41 levels "AL","AR","AZ",..: 32 10 25 11 9 > 32 13 31 12 12 ... > # $ city : Factor w/ 1337 levels "ABBOTTSTOWN",..: 932 156 230 > 698 965 1330 515 727 1127 1304 ... > # $ latitude : num 40.4 31.2 40.8 42.1 26.8 ... > # $ longitude : num -79.9 -81.5 -74 -91.6 -82.1 ... > > I created a subset adding column prop_of_total > str(trnd1_tbl) > tibble [1,457 x 5] (S3: tbl_df/tbl/data.frame) > $ city : Factor w/ 1337 levels "ABBOTTSTOWN",..: 1 2 3 4 5 6 7 8 > 9 10 ... > $ state : Factor w/ 41 levels "AL","AR","AZ",..: 32 36 10 28 12 36 > 10 11 26 38 ... > $ Basecountsum : num [1:1457] 2352 2352 2352 2352 2352 ... > $ Basecount2 : num [1:1457] 1 1 1 1 1 2 1 1 2 1 ... > $ prop_of_total: num [1:1457] 0.000425 0.000425 0.000425 0.000425 > 0.000425 ... > > > Then I spread it > > trnd2_tbl <- trnd1_tbl %>% > dplyr::select(city, state, prop_of_total) %>% > spread(key = city, value = prop_of_total, fill = 0) #remove the NA's > with fill > > str(trnd2_tbl)#tibble [41 x 1,338] (S3: tbl_df/tbl/data.frame) > > Then I run a Kmeans > > kmeans_obj1 <- trnd2_tbl %>% > dplyr::select(- state) %>% > kmeans(centers = 20, nstart = 100) > > str(kmeans_obj1) > List of 9 > $ cluster : int [1:41] 11 11 9 11 11 4 11 11 16 2 ... > $ centers : num [1:20, 1:1337] 0 0 0 0 0 0 0 0 0 0 ... > ..- attr(*, "dimnames")=List of 2 > .. ..$ : chr [1:20] "1" "2" "3" "4" ... > .. ..$ : chr [1:1337] "ABBOTTSTOWN" "ABILENE" "ACWORTH" "ADAMS" ... > $ totss : num 0.00158 > $ withinss : num [1:20] 0 0 0 0 0 0 0 0 0 0 ... > $ tot.withinss: num 0.0000848 > $ betweenss : num 0.0015 > $ size : int [1:20] 1 1 1 1 1 1 1 1 1 1 ... > $ iter : int 3 > $ ifault : int 0 > - attr(*, "class")= chr "kmeans" > > Then I go and try to tidy: > > #Tidy, glance, augment > #Just makes it easier to use or view the obj's in the obj list > > broom::tidy(kmeans_obj1) %>% glimpse() > > broom::glance(kmeans_obj1) > ##A tibble: 1 x 4 > # totss tot.withinss betweenss iter > # <dbl> <dbl> <dbl> <int> > # 1 0.00158 0.0000848 0.00150 3 > > However, when I run this piece I get an error: > > broom::augment(kmeans_obj1, trnd2_tbl) %>% > dplyr::select(city, .cluster) > > #Error: Must subset columns with a valid subscript vector. > # The subscript has the wrong type `data.frame< > # u: double > # x: double > >`. > i It must be numeric or character. > > Here is the back trace: > > rlang::last_error() > > # Backtrace: > # 1. broom::augment(kmeans_obj1, trnd2_tbl) > # 9. dplyr::select(., city, .cluster) > # 11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...)) > # 12. tidyselect:::eval_select_impl(...) > # 20. tidyselect:::vars_select_eval(...) > # 21. tidyselect:::walk_data_tree(expr, data_mask, context_mask) > # 22. tidyselect:::eval_c(expr, data_mask, context_mask) > # 23. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init) > # 24. tidyselect:::walk_data_tree(new, data_mask, context_mask) > # 25. tidyselect:::as_indices_sel_impl(...) > # 26. tidyselect:::as_indices_impl(x, vars, strict = strict) > # 27. vctrs::vec_as_subscript(x, logical = "error") > > I am not sure what I am supposed to fix? > > Maybe someone has had similar error and can advise me please? > > Thank you. > > WHP > > > > > > > > Proprietary > > NOTICE TO RECIPIENT OF INFORMATION:\ This e-mail may con...{{dropped:16}} > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Poling, William
2020-May-12 16:10 UTC
[R] Help with Kmeans output and using broom to tidy etc..
Hello Eric, thank you so much for your consideration. Here are snippets of data that I hope will be helpful WHP geo1a <- geo1[, c(2:5)] <-- eliminating ID which is not useful for my purposes anyway #This is for R-Help use geo1a <- geo1a %>% top_n(25) state city latitude longitude 1 ME FAIRFIELD 44.64485 -69.65948 2 ME JONESPORT 44.57935 -67.56743 3 ME CASWELL 46.97529 -67.83023 4 ME ELLSWORTH 44.52916 -68.38717 5 ME VASSALBORO 44.45095 -69.60629 6 ME UNION 44.20059 -69.26123 7 ME PALERMO 44.45142 -69.41115 8 ME ORONO 44.87426 -68.68327 9 ME SANGERVILLE 45.10138 -69.33580 10 ME ISLESBORO 44.29015 -68.90812 11 ME TOPSHAM 43.93600 -69.96565 12 ME FREEPORT 43.84089 -70.11160 13 ME SKOWHEGAN 44.76687 -69.71644 14 ME MILLINOCKET 45.65501 -68.70261 15 ME ORRINGTON 44.72417 -68.74026 16 ME ST. GEORGE 43.96726 -69.20827 17 ME FORT FAIRFIELD 46.80911 -67.88079 18 ME MARS HILL 46.56580 -67.89006 19 ME FREEPORT 43.85302 -70.03726 20 ME EASTON 46.64143 -67.91203 21 ME WATERVILLE 44.53621 -69.65913 22 ME BRUNSWICK 43.87771 -69.96297 23 ME BRUNSWICK 43.91719 -69.89905 24 ME BUCKSPORT 44.60665 -68.81892 25 ME FAYETTE 44.46380 -70.12047 trnd1_tbla <- trnd1_tbl %>% top_n(25) print(trnd1_tbla) head(trnd1_tbla,n=25) A tibble: 25 x 5 city state Basecountsum Basecount2 prop_of_total <fct> <fct> <dbl> <dbl> <dbl> 1 ATLANTA GA 2352 12 0.00510 2 BRADENTON FL 2352 8 0.00340 3 BROOKLYN NY 2352 30 0.0128 4 CHARLOTTE NC 2352 8 0.00340 5 CHICAGO IL 2352 17 0.00723 6 COLUMBUS OH 2352 11 0.00468 7 CUMMING GA 2352 8 0.00340 8 DALLAS TX 2352 8 0.00340 9 ERIE PA 2352 12 0.00510 10 HOUSTON TX 2352 12 0.00510 # ... with 15 more rows WHP From: Eric Berger <ericjberger at gmail.com> Sent: Tuesday, May 12, 2020 8:39 AM To: Poling, William <PolingW at aetna.com> Cc: r-help at r-project.org Subject: [EXTERNAL] Re: [R] Help with Kmeans output and using broom to tidy etc.. **** External Email - Use Caution **** Can you create a reproducible example?? Your question involves objects that are unknown to us. (geo1, trnd1_tbl) On Tue, May 12, 2020 at 2:41 PM Poling, William via R-help <mailto:r-help at r-project.org> wrote: #RStudio Version Version 1.2.1335 need this one--> 1.2.5019 sessionInfo() # R version 4.0.0 Patched (2020-05-03 r78349) #Platform: x86_64-w64-mingw32/x64 (64-bit) #Running under: Windows 10 x64 (build 17763) Hello: I have data that I am trying to manipulate for Kmeans clustering. Original data looks like this str(geo1) # 'data.frame': 2352 obs. of? 5 variables: # $ ID: Factor w/ 2352 levels "101040199600",..: 590 908 976 509 1674 690 1336 86 726 1702 ... # $ state? ? ? ? ? ?: Factor w/ 41 levels "AL","AR","AZ",..: 32 10 25 11 9 32 13 31 12 12 ... # $ city? ? ? ? ? ? : Factor w/ 1337 levels "ABBOTTSTOWN",..: 932 156 230 698 965 1330 515 727 1127 1304 ... # $ latitude? ? ? ? : num? 40.4 31.2 40.8 42.1 26.8 ... # $ longitude? ? ? ?: num? -79.9 -81.5 -74 -91.6 -82.1 ... I created a subset adding column prop_of_total str(trnd1_tbl) tibble [1,457 x 5] (S3: tbl_df/tbl/data.frame) ?$ city? ? ? ? ?: Factor w/ 1337 levels "ABBOTTSTOWN",..: 1 2 3 4 5 6 7 8 9 10 ... ?$ state? ? ? ? : Factor w/ 41 levels "AL","AR","AZ",..: 32 36 10 28 12 36 10 11 26 38 ... ?$ Basecountsum : num [1:1457] 2352 2352 2352 2352 2352 ... ?$ Basecount2? ?: num [1:1457] 1 1 1 1 1 2 1 1 2 1 ... ?$ prop_of_total: num [1:1457] 0.000425 0.000425 0.000425 0.000425 0.000425 ... Then I spread it trnd2_tbl <- trnd1_tbl %>% ? ? dplyr::select(city, state, prop_of_total) %>% ? ? spread(key = city, value = prop_of_total, fill = 0) #remove the NA's with fill str(trnd2_tbl)#tibble [41 x 1,338] (S3: tbl_df/tbl/data.frame) Then I run a Kmeans kmeans_obj1 <- trnd2_tbl? %>% ? dplyr::select(- state) %>% ? kmeans(centers = 20, nstart = 100) str(kmeans_obj1) List of 9 ?$ cluster? ? ?: int [1:41] 11 11 9 11 11 4 11 11 16 2 ... ?$ centers? ? ?: num [1:20, 1:1337] 0 0 0 0 0 0 0 0 0 0 ... ? ..- attr(*, "dimnames")=List of 2 ? .. ..$ : chr [1:20] "1" "2" "3" "4" ... ? .. ..$ : chr [1:1337] "ABBOTTSTOWN" "ABILENE" "ACWORTH" "ADAMS" ... ?$ totss? ? ? ?: num 0.00158 ?$ withinss? ? : num [1:20] 0 0 0 0 0 0 0 0 0 0 ... ?$ tot.withinss: num 0.0000848 ?$ betweenss? ?: num 0.0015 ?$ size? ? ? ? : int [1:20] 1 1 1 1 1 1 1 1 1 1 ... ?$ iter? ? ? ? : int 3 ?$ ifault? ? ? : int 0 ?- attr(*, "class")= chr "kmeans" Then I go and try to tidy: #Tidy, glance, augment #Just makes it easier to use or view the obj's in the obj list ? broom::tidy(kmeans_obj1) %>% glimpse() ? ? ? ? broom::glance(kmeans_obj1) ##A tibble: 1 x 4 # totss tot.withinss betweenss? iter # <dbl>? ? ? ? <dbl>? ? ?<dbl> <int> #? ?1 0.00158? ? 0.0000848? ?0.00150? ? ?3 However, when I run this piece I get an error: broom::augment(kmeans_obj1, trnd2_tbl) %>% ? dplyr::select(city, .cluster)? ? ? ? ? ? ? #Error: Must subset columns with a valid subscript vector. # The subscript has the wrong type `data.frame< ?# u: double #? x: double>`.i It must be numeric or character. Here is the back trace: rlang::last_error() # Backtrace: #? ?1. broom::augment(kmeans_obj1, trnd2_tbl) # 9. dplyr::select(., city, .cluster) # 11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...)) # 12. tidyselect:::eval_select_impl(...) # 20. tidyselect:::vars_select_eval(...) # 21. tidyselect:::walk_data_tree(expr, data_mask, context_mask) # 22. tidyselect:::eval_c(expr, data_mask, context_mask) # 23. tidyselect:::reduce_sels(node, data_mask, context_mask, init = init) # 24. tidyselect:::walk_data_tree(new, data_mask, context_mask) # 25. tidyselect:::as_indices_sel_impl(...) # 26. tidyselect:::as_indices_impl(x, vars, strict = strict) # 27. vctrs::vec_as_subscript(x, logical = "error") I am not sure what I am supposed to fix? Maybe someone has had similar error and can advise me please? Thank you. WHP Proprietary NOTICE TO RECIPIENT OF INFORMATION:\ This e-mail may con...{{dropped:16}} ______________________________________________ mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=eSV6ISkAsnmonaRvNdtmx4Lr9vumgXwMYF87DoRP86s&ePLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwMFaQ&c=wluqKIiwffOpZ6k5sqMWMBOn0vyYnlulRJmmvOXCFpM&r=j7MrcIQm2xjHa8v-2mTpmTCtKvneM2ExlYvnUWbsByY&m=sMhCVDVDKajwJ9te2qVsWXQ2aq4kAe7150EICM51Pw4&s=8wmXM73ofNcrn1i9gF-qxOzj7zRJZSPcaA5qg0vggG4&eand provide commented, minimal, self-contained, reproducible code. Proprietary NOTICE TO RECIPIENT OF INFORMATION: This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. This e-mail may also contain protected health information (PHI) with information about sensitive medical conditions, including, but not limited to, treatment for substance use disorders, behavioral health, HIV/AIDS, or pregnancy. This type of information may be protected by various federal and/or state laws which prohibit any further disclosure without the express written consent of the person to whom it pertains or as otherwise permitted by law. Any unauthorized further disclosure may be considered a violation of federal and/or state law. A general authorization for the release of medical or other information may NOT be sufficient consent for release of this type of information. Thank you. Aetna