Hi You already got answer from Avi. I often use dim(data) to inspect how many rows/columns I have. After that I check if some columns contain all or many NA values. colSums(is.na(data)) keep <- which(colSums(is.na(data))<nnn) cleaned.data <- data[, keep] Cheers Petr> -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of Luigi Marongiu > Sent: Friday, August 6, 2021 7:34 AM > To: Duncan Murdoch <murdoch.duncan at gmail.com> > Cc: r-help <r-help at r-project.org> > Subject: Re: [R] Sanity check in loading large dataframe > > Ok, so nothing to worry about. Yet, are there other checks I canimplement?> Thank you > > On Thu, 5 Aug 2021, 15:40 Duncan Murdoch, <murdoch.duncan at gmail.com> > wrote: > > > On 05/08/2021 9:16 a.m., Luigi Marongiu wrote: > > > Hello, > > > I am using a large spreadsheet (over 600 variables). > > > I tried `str` to check the dimensions of the spreadsheet and I got > > > ``` >> (str(df)) > 'data.frame': 302 obs. of 626 variables: > > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ... > > > .... > > > $ v1_medicamento___aceta : int 1 NA NA NA NA NA NA NA NA NA ... > > > [list output truncated] > > > NULL > > > ``` > > > I understand that `[list output truncated]` means that there are > > more > variables than those allowed by str to be displayed as rows. > > Thus I > increased the row's output with: > > > ``` > > > > > >> (str(df, list.len=1000)) > > > 'data.frame': 302 obs. of 626 variables: > > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ... > > > ... > > > NULL > > > ``` > > > > > > Does `NULL` mean that some of the variables are not closed? > > (perhaps a > missing comma somewhere) > Is there a way to check the > > sanity of the data and avoid that some > separator is not in the > > right place? > > > Thank you > > > > The NULL is the value returned by str(). Normally it is not printed, > > but when you wrap str in parens as (str(df, list.len=1000)), that > > forces the value to print. > > > > str() is unusual in R functions in that it prints to the console as it > > runs and returns nothing. Many other functions construct a value > > which is only displayed if you print it, but something like > > > > x <- str(df, list.len=1000) > > > > will print the same as if there was no assignment, and then assign > > NULL to x. > > > > Duncan Murdoch > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
OK, thank you. On Fri, Aug 6, 2021 at 9:56 AM PIKAL Petr <petr.pikal at precheza.cz> wrote:> > Hi > > You already got answer from Avi. I often use dim(data) to inspect how many > rows/columns I have. > After that I check if some columns contain all or many NA values. > > colSums(is.na(data)) > keep <- which(colSums(is.na(data))<nnn) > cleaned.data <- data[, keep] > > Cheers > Petr > > > > -----Original Message----- > > From: R-help <r-help-bounces at r-project.org> On Behalf Of Luigi Marongiu > > Sent: Friday, August 6, 2021 7:34 AM > > To: Duncan Murdoch <murdoch.duncan at gmail.com> > > Cc: r-help <r-help at r-project.org> > > Subject: Re: [R] Sanity check in loading large dataframe > > > > Ok, so nothing to worry about. Yet, are there other checks I can > implement? > > Thank you > > > > On Thu, 5 Aug 2021, 15:40 Duncan Murdoch, <murdoch.duncan at gmail.com> > > wrote: > > > > > On 05/08/2021 9:16 a.m., Luigi Marongiu wrote: > > > > Hello, > > > > I am using a large spreadsheet (over 600 variables). > > > > I tried `str` to check the dimensions of the spreadsheet and I got > > > > ``` >> (str(df)) > 'data.frame': 302 obs. of 626 variables: > > > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ... > > > > .... > > > > $ v1_medicamento___aceta : int 1 NA NA NA NA NA NA NA NA NA ... > > > > [list output truncated] > > > > NULL > > > > ``` > > > > I understand that `[list output truncated]` means that there are > > > more > variables than those allowed by str to be displayed as rows. > > > Thus I > increased the row's output with: > > > > ``` > > > > > > > >> (str(df, list.len=1000)) > > > > 'data.frame': 302 obs. of 626 variables: > > > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ... > > > > ... > > > > NULL > > > > ``` > > > > > > > > Does `NULL` mean that some of the variables are not closed? > > > (perhaps a > missing comma somewhere) > Is there a way to check the > > > sanity of the data and avoid that some > separator is not in the > > > right place? > > > > Thank you > > > > > > The NULL is the value returned by str(). Normally it is not printed, > > > but when you wrap str in parens as (str(df, list.len=1000)), that > > > forces the value to print. > > > > > > str() is unusual in R functions in that it prints to the console as it > > > runs and returns nothing. Many other functions construct a value > > > which is only displayed if you print it, but something like > > > > > > x <- str(df, list.len=1000) > > > > > > will print the same as if there was no assignment, and then assign > > > NULL to x. > > > > > > Duncan Murdoch > > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > and provide commented, minimal, self-contained, reproducible code.-- Best regards, Luigi
... but remove the which() and use logical indexing ... ;-) Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Aug 6, 2021 at 12:57 AM PIKAL Petr <petr.pikal at precheza.cz> wrote:> Hi > > You already got answer from Avi. I often use dim(data) to inspect how many > rows/columns I have. > After that I check if some columns contain all or many NA values. > > colSums(is.na(data)) > keep <- which(colSums(is.na(data))<nnn) > cleaned.data <- data[, keep] > > Cheers > Petr > > > > -----Original Message----- > > From: R-help <r-help-bounces at r-project.org> On Behalf Of Luigi Marongiu > > Sent: Friday, August 6, 2021 7:34 AM > > To: Duncan Murdoch <murdoch.duncan at gmail.com> > > Cc: r-help <r-help at r-project.org> > > Subject: Re: [R] Sanity check in loading large dataframe > > > > Ok, so nothing to worry about. Yet, are there other checks I can > implement? > > Thank you > > > > On Thu, 5 Aug 2021, 15:40 Duncan Murdoch, <murdoch.duncan at gmail.com> > > wrote: > > > > > On 05/08/2021 9:16 a.m., Luigi Marongiu wrote: > > > > Hello, > > > > I am using a large spreadsheet (over 600 variables). > > > > I tried `str` to check the dimensions of the spreadsheet and I got > > > > ``` >> (str(df)) > 'data.frame': 302 obs. of 626 variables: > > > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ... > > > > .... > > > > $ v1_medicamento___aceta : int 1 NA NA NA NA NA NA NA NA NA ... > > > > [list output truncated] > > > > NULL > > > > ``` > > > > I understand that `[list output truncated]` means that there are > > > more > variables than those allowed by str to be displayed as rows. > > > Thus I > increased the row's output with: > > > > ``` > > > > > > > >> (str(df, list.len=1000)) > > > > 'data.frame': 302 obs. of 626 variables: > > > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ... > > > > ... > > > > NULL > > > > ``` > > > > > > > > Does `NULL` mean that some of the variables are not closed? > > > (perhaps a > missing comma somewhere) > Is there a way to check the > > > sanity of the data and avoid that some > separator is not in the > > > right place? > > > > Thank you > > > > > > The NULL is the value returned by str(). Normally it is not printed, > > > but when you wrap str in parens as (str(df, list.len=1000)), that > > > forces the value to print. > > > > > > str() is unusual in R functions in that it prints to the console as it > > > runs and returns nothing. Many other functions construct a value > > > which is only displayed if you print it, but something like > > > > > > x <- str(df, list.len=1000) > > > > > > will print the same as if there was no assignment, and then assign > > > NULL to x. > > > > > > Duncan Murdoch > > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]