sessionInfo at end of message. I have data that I was given as an Excel .xlsx file. It contains 96266 lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv format, using the pipe character as a field separator. This produced a file with 96266 lines. When I read it into R thusly:> skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",sep = "|", header = TRUE, comment.char = "", skip = 0) the resulting skip0.dd dataframe has only 58208 lines:> dim(skip0.dd)[1] 58208 24 I've tried a variety of things to troubleshoot. Using head() and tail(), the expected first and last lines (comparing to the .csv file) do indeed exist in skip0.dd. Several arbitrary lines from the "middle" of the csv file are also present in the skip0.dd dataframe. I tried reading only the first column, which is integer, but still it appears that not all lines are read in:> classes <- c(NA, rep("NULL", 23)) > skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv",sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses = classes)> dim(skip01.dd)[1] 58208 1 Skipping the first 50000 lines nominally should give me a dataframe of 46266 lines, or at least one of 50000 fewer lines than skip0.dd (i.e. 8208 lines), but it does neither:> skip50000.dd <-read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|", header = TRUE, comment.char = "", skip = 50000)> dim(skip50000.dd)[1] 22170 24 Any thoughts on what might be going wrong? Some funky characters from Excel or OpenOffice.org lurking in the .csv file? Perhaps I'd have more success with one of the packages that enables reading directly from an .xlsx file. Thanks. --Chris Ryan SUNY Upstate Medical University Binghamton Clinical Campus Broome County Health Department Binghamton University ####################################> sessionInfo()R version 3.5.3 (2019-03-11) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17763) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.8.3 stringr_1.4.0 Hmisc_4.2-0 ggplot2_3.2.1 [5] Formula_1.2-3 survival_2.44-1.1 lattice_0.20-38 loaded via a namespace (and not attached): [1] Rcpp_1.0.1 pillar_1.4.0 compiler_3.5.3 [4] RColorBrewer_1.1-2 tools_3.5.3 base64enc_0.1-3 [7] digest_0.6.18 zeallot_0.1.0 rpart_4.1-13 [10] checkmate_1.9.3 tibble_2.1.1 gtable_0.3.0 [13] htmlTable_1.13.1 pkgconfig_2.0.2 rlang_0.4.0 [16] Matrix_1.2-15 rstudioapi_0.10 xfun_0.7 [19] gridExtra_2.3 knitr_1.23 withr_2.1.2 [22] cluster_2.0.7-1 htmlwidgets_1.3 vctrs_0.2.0 [25] grid_3.5.3 nnet_7.3-12 tidyselect_0.2.5 [28] data.table_1.12.2 glue_1.3.1 R6_2.4.0 [31] foreign_0.8-71 latticeExtra_0.6-28 purrr_0.3.2 [34] magrittr_1.5 htmltools_0.3.6 backports_1.1.4 [37] scales_1.0.0 splines_3.5.3 assertthat_0.2.1 [40] colorspace_1.4-1 stringi_1.4.3 acepack_1.4.1 [43] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4
On 21.02.2020 20:10, Christopher W. Ryan wrote:> sessionInfo at end of message. > > I have data that I was given as an Excel .xlsx file. It contains 96266 > lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv > format, using the pipe character as a field separator. This produced a > file with 96266 lines. > > When I read it into R thusly: > >> skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", > sep = "|", header = TRUE, comment.char = "", skip = 0) > > the resulting skip0.dd dataframe has only 58208 lines: > >> dim(skip0.dd) > [1] 58208 24 > > > I've tried a variety of things to troubleshoot. Using head() and tail(), > the expected first and last lines (comparing to the .csv file) do indeed > exist in skip0.dd. Several arbitrary lines from the "middle" of the csv > file are also present in the skip0.dd dataframe. > > I tried reading only the first column, which is integer, but still it > appears that not all lines are read in: > >> classes <- c(NA, rep("NULL", 23)) >> skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", > sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses = classes) >> dim(skip01.dd) > [1] 58208 1 > > Skipping the first 50000 lines nominally should give me a dataframe of > 46266 lines, or at least one of 50000 fewer lines than skip0.dd (i.e. > 8208 lines), but it does neither: > >> skip50000.dd <- > read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|", > header = TRUE, comment.char = "", skip = 50000) >> dim(skip50000.dd) > [1] 22170 24 > > Any thoughts on what might be going wrong? Some funky characters from > Excel or OpenOffice.org lurking in the .csv file?quotes are a typical proiblem, what if you try with arg quote=""?> > Perhaps I'd have more success with one of the packages that enables > reading directly from an .xlsx file. > > Thanks. > > --Chris Ryan > SUNY Upstate Medical University Binghamton Clinical Campus > Broome County Health Department > Binghamton University > > > #################################### >> sessionInfo() > R version 3.5.3 (2019-03-11) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 10 x64 (build 17763) > > Matrix products: default > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] dplyr_0.8.3 stringr_1.4.0 Hmisc_4.2-0 ggplot2_3.2.1 > [5] Formula_1.2-3 survival_2.44-1.1 lattice_0.20-38 > > loaded via a namespace (and not attached): > [1] Rcpp_1.0.1 pillar_1.4.0 compiler_3.5.3 > [4] RColorBrewer_1.1-2 tools_3.5.3 base64enc_0.1-3 > [7] digest_0.6.18 zeallot_0.1.0 rpart_4.1-13 > [10] checkmate_1.9.3 tibble_2.1.1 gtable_0.3.0 > [13] htmlTable_1.13.1 pkgconfig_2.0.2 rlang_0.4.0 > [16] Matrix_1.2-15 rstudioapi_0.10 xfun_0.7 > [19] gridExtra_2.3 knitr_1.23 withr_2.1.2 > [22] cluster_2.0.7-1 htmlwidgets_1.3 vctrs_0.2.0 > [25] grid_3.5.3 nnet_7.3-12 tidyselect_0.2.5 > [28] data.table_1.12.2 glue_1.3.1 R6_2.4.0 > [31] foreign_0.8-71 latticeExtra_0.6-28 purrr_0.3.2 > [34] magrittr_1.5 htmltools_0.3.6 backports_1.1.4 > [37] scales_1.0.0 splines_3.5.3 assertthat_0.2.1 > [40] colorspace_1.4-1 stringi_1.4.3 acepack_1.4.1 > [43] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4 > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Christopher W. Ryan
2020-Feb-21 19:25 UTC
[R] [External Email] Re: incomplete reading of a large csv file
Ah, Uwe, you are a lifesaver. Although there should not have been, there were some lines with entries like this in the 6th field: medical alarm - unk problem "B" I would have thought that my effort to read just the first field of each line, uniformly an integer, would have neutralized any issues arising from problematic characters in other fields. Apparently not. So the entire file is processed in some manner, even when using the colClasses argument to restrict the result to the first column? Thanks. --Chris Ryan Uwe Ligges wrote:> > > On 21.02.2020 20:10, Christopher W. Ryan wrote: >> sessionInfo at end of message. >> >> I have data that I was given as an Excel .xlsx file. It contains 96266 >> lines and 24 columns. I opened it in OpenOffice.org and saved it in .csv >> format, using the pipe character as a field separator. This produced a >> file with 96266 lines. >> >> When I read it into R thusly: >> >>> skip0.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", >> sep = "|", header = TRUE, comment.char = "", skip = 0) >> >> the resulting skip0.dd dataframe has only 58208 lines: >> >>> dim(skip0.dd) >> [1] 58208??? 24 >> >> >> I've tried a variety of things to troubleshoot. Using head() and tail(), >> the expected first and last lines (comparing to the .csv file) do indeed >> exist in skip0.dd.? Several arbitrary lines from the "middle" of the csv >> file are also present in the skip0.dd dataframe. >> >> I tried reading only the first column, which is integer, but still it >> appears that not all lines are read in: >> >>> classes <- c(NA, rep("NULL", 23)) >>> skip01.dd <- read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", >> sep = "|", header = TRUE, comment.char = "", skip = 0, colClasses >> classes) >>> dim(skip01.dd) >> [1] 58208??? 1 >> >> Skipping the first 50000 lines nominally should give me a dataframe of >> 46266 lines, or at least one of 50000 fewer lines than skip0.dd (i.e. >> 8208 lines), but it does neither: >> >>> skip50000.dd <- >> read.csv("AmbulanceDispatches2017-2019-02-18-2020.csv", sep = "|", >> header = TRUE, comment.char = "", skip = 50000) >>> dim(skip50000.dd) >> [1] 22170??? 24 >> >> Any thoughts on what might be going wrong? Some funky characters from >> Excel or OpenOffice.org lurking in the .csv file? > > quotes are a typical proiblem, what if you try with arg quote=""? > > > > >> >> Perhaps I'd have more success with one of the packages that enables >> reading directly from an .xlsx file. >> >> Thanks. >> >> --Chris Ryan >> SUNY Upstate Medical University Binghamton Clinical Campus >> Broome County Health Department >> Binghamton University >> >> >> #################################### >>> sessionInfo() >> R version 3.5.3 (2019-03-11) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> Running under: Windows 10 x64 (build 17763) >> >> Matrix products: default >> >> locale: >> [1] LC_COLLATE=English_United States.1252 >> [2] LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats???? graphics? grDevices utils???? datasets? methods?? base >> >> other attached packages: >> [1] dplyr_0.8.3?????? stringr_1.4.0???? Hmisc_4.2-0?????? ggplot2_3.2.1 >> [5] Formula_1.2-3???? survival_2.44-1.1 lattice_0.20-38 >> >> loaded via a namespace (and not attached): >> ? [1] Rcpp_1.0.1????????? pillar_1.4.0??????? compiler_3.5.3 >> ? [4] RColorBrewer_1.1-2? tools_3.5.3???????? base64enc_0.1-3 >> ? [7] digest_0.6.18?????? zeallot_0.1.0?????? rpart_4.1-13 >> [10] checkmate_1.9.3???? tibble_2.1.1??????? gtable_0.3.0 >> [13] htmlTable_1.13.1??? pkgconfig_2.0.2???? rlang_0.4.0 >> [16] Matrix_1.2-15?????? rstudioapi_0.10???? xfun_0.7 >> [19] gridExtra_2.3?????? knitr_1.23????????? withr_2.1.2 >> [22] cluster_2.0.7-1???? htmlwidgets_1.3???? vctrs_0.2.0 >> [25] grid_3.5.3????????? nnet_7.3-12???????? tidyselect_0.2.5 >> [28] data.table_1.12.2?? glue_1.3.1????????? R6_2.4.0 >> [31] foreign_0.8-71????? latticeExtra_0.6-28 purrr_0.3.2 >> [34] magrittr_1.5??????? htmltools_0.3.6???? backports_1.1.4 >> [37] scales_1.0.0??????? splines_3.5.3?????? assertthat_0.2.1 >> [40] colorspace_1.4-1??? stringi_1.4.3?????? acepack_1.4.1 >> [43] lazyeval_0.2.2????? munsell_0.5.0?????? crayon_1.3.4 >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >>