Hi Bert
Yes, in this case which is not necessary. But in case NAs are involved
sometimes logical indexing is not a best choice as NA propagates to the
result, which may be not wanted.
x <- 1:10
x[c(2,5)] <- NA
y<- letters[1:10]
y[x<5]
[1] "a" NA "c" "d" NA
y[which(x<5)]
[1] "a" "c" "d"
dat <- data.frame(x,y)
dat[x<5,]
x y
1 1 a
NA NA <NA>
3 3 c
4 4 d
NA.1 NA <NA>
> dat[which(x<5),]
x y
1 1 a
3 3 c
4 4 d
Both results are OK, but one has to consider this NA value propagation.
Cheers
Petr
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Friday, August 6, 2021 1:29 PM
To: PIKAL Petr <petr.pikal at precheza.cz>
Cc: Luigi Marongiu <marongiu.luigi at gmail.com>; r-help <r-help at
r-project.org>
Subject: Re: [R] Sanity check in loading large dataframe
... but remove the which() and use logical indexing ... ;-)
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Aug 6, 2021 at 12:57 AM PIKAL Petr <mailto:petr.pikal at
precheza.cz>
wrote:
Hi
You already got answer from Avi. I often use dim(data) to inspect how many
rows/columns I have.
After that I check if some columns contain all or many NA values.
colSums(http://is.na(data))
keep <- which(colSums(http://is.na(data))<nnn)
cleaned.data <- data[, keep]
Cheers
Petr
> -----Original Message-----
> From: R-help <mailto:r-help-bounces at r-project.org> On Behalf Of
Luigi
> Marongiu
> Sent: Friday, August 6, 2021 7:34 AM
> To: Duncan Murdoch <mailto:murdoch.duncan at gmail.com>
> Cc: r-help <mailto:r-help at r-project.org>
> Subject: Re: [R] Sanity check in loading large dataframe
>
> Ok, so nothing to worry about. Yet, are there other checks I can
implement?> Thank you
>
> On Thu, 5 Aug 2021, 15:40 Duncan Murdoch, <mailto:murdoch.duncan at
gmail.com>
> wrote:
>
> > On 05/08/2021 9:16 a.m., Luigi Marongiu wrote:
> > > Hello,
> > > I am using a large spreadsheet (over 600 variables).
> > > I tried `str` to check the dimensions of the spreadsheet and I
got
> > > ``` >> (str(df)) > 'data.frame': 302 obs. of
626 variables:
> > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ...
> > > ....
> > > $ v1_medicamento___aceta : int 1 NA NA NA NA NA NA NA NA NA
...
> > > [list output truncated]
> > > NULL
> > > ```
> > > I understand that `[list output truncated]` means that there are
> > more > variables than those allowed by str to be displayed as
rows.
> > Thus I > increased the row's output with:
> > > ```
> > >
> > >> (str(df, list.len=1000))
> > > 'data.frame': 302 obs. of 626 variables:
> > > $ record_id : int 1 1 1 1 1 1 1 1 1 1 ...
> > > ...
> > > NULL
> > > ```
> > >
> > > Does `NULL` mean that some of the variables are not closed?
> > (perhaps a > missing comma somewhere) > Is there a way to
check the
> > sanity of the data and avoid that some > separator is not in the
> > right place?
> > > Thank you
> >
> > The NULL is the value returned by str(). Normally it is not printed,
> > but when you wrap str in parens as (str(df, list.len=1000)), that
> > forces the value to print.
> >
> > str() is unusual in R functions in that it prints to the console as it
> > runs and returns nothing. Many other functions construct a value
> > which is only displayed if you print it, but something like
> >
> > x <- str(df, list.len=1000)
> >
> > will print the same as if there was no assignment, and then assign
> > NULL to x.
> >
> > Duncan Murdoch
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
______________________________________________
mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.