thr3ads.net - R help - [R] Sanity check in loading large dataframe [Aug 2021]

If this information is useful, please help other people find it:
Share via:

Luigi Marongiu

2021-Aug-05 13:16 UTC

[R] Sanity check in loading large dataframe

Hello,
I am using a large spreadsheet (over 600 variables).
I tried `str` to check the dimensions of the spreadsheet and I got
```> (str(df))'data.frame': 302 obs. of  626 variables:
 $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
....
$ v1_medicamento___aceta    : int  1 NA NA NA NA NA NA NA NA NA ...
  [list output truncated]
NULL
```
I understand that `[list output truncated]` means that there are more
variables than those allowed by str to be displayed as rows. Thus I
increased the row's output with:
```
> (str(df, list.len=1000))'data.frame': 302 obs. of  626 variables:
 $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
...
NULL
```

Does `NULL` mean that some of the variables are not closed? (perhaps a
missing comma somewhere)
Is there a way to check the sanity of the data and avoid that some
separator is not in the right place?
Thank you



-- 
Best regards,
Luigi

Duncan Murdoch

2021-Aug-05 13:40 UTC

head link

[R] Sanity check in loading large dataframe

On 05/08/2021 9:16 a.m., Luigi Marongiu wrote:
 > Hello,
 > I am using a large spreadsheet (over 600 variables).
 > I tried `str` to check the dimensions of the spreadsheet and I got
 > ```
 >> (str(df))
 > 'data.frame': 302 obs. of  626 variables:
 >   $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
 > ....
 > $ v1_medicamento___aceta    : int  1 NA NA NA NA NA NA NA NA NA ...
 >    [list output truncated]
 > NULL
 > ```
 > I understand that `[list output truncated]` means that there are more
 > variables than those allowed by str to be displayed as rows. Thus I
 > increased the row's output with:
 > ```
 >
 >> (str(df, list.len=1000))
 > 'data.frame': 302 obs. of  626 variables:
 >   $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
 > ...
 > NULL
 > ```
 >
 > Does `NULL` mean that some of the variables are not closed? (perhaps a
 > missing comma somewhere)
 > Is there a way to check the sanity of the data and avoid that some
 > separator is not in the right place?
 > Thank you

The NULL is the value returned by str().  Normally it is not printed, 
but when you wrap str in parens as (str(df, list.len=1000)), that forces 
the value to print.

str() is unusual in R functions in that it prints to the console as it 
runs and returns nothing.  Many other functions construct a value which 
is only displayed if you print it, but something like

x <- str(df, list.len=1000)

will print the same as if there was no assignment, and then assign NULL 
to x.

Duncan Murdoch

Avi Gross

2021-Aug-05 16:01 UTC

head link

[R] Sanity check in loading large dataframe

Luigi,

Duncan answered part of your question. My feedback is to consider looking at
your data using other tools besides str(). 

There are ways in base R to get lists of row or column names or count them
or ask what types they are and so forth.

Printing an entire large object is hard but printing many subsets can give
you a handle on it.

You may also want to use packages in the tidyverse such as dplyr and work
with tibbles as a mild variation on a data.frame.

I am not sure what you are hoping to do with str() besides getting the
number of rows and columns but consider:

	dim(df)
	nrow(df)
	ncol(df)

To get names: 
	names(df)
	colnames(df)
	rownames(df)

To get many kinds of info about columns in your data.frame, various
functional methods like this can be used:
	sapply(df, typeof)

The above will tell you for each column if it is an integer or double or
other things.
	
To do more interesting things there are packages. The psych package, for
example, lets you get some metrics about each column:
	psych::describe(df)

And you can use various methods of subsetting to limit what you are looking
at and only show or print a manageable amount.

You seem to be asking about sanity checking in your subject line and that
depends on what you want to check. Clearly that can include making sure
various columns of data are valid in being of the expected data type or not
having any NA values or even removing outliers and so on. Tools are there
for much of that including the few I mention. Your data may seem huge but I
have worked on much larger ones. One suggestion is to consider trimming some
of that data before working on it IF some is not needed. Both base R and the
tidyverse have lots to offer to do such things.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Luigi Marongiu
Sent: Thursday, August 5, 2021 9:16 AM
To: r-help <r-help at r-project.org>
Subject: [R] Sanity check in loading large dataframe

Hello,
I am using a large spreadsheet (over 600 variables).
I tried `str` to check the dimensions of the spreadsheet and I got
```> (str(df))'data.frame': 302 obs. of  626 variables:
 $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
....
$ v1_medicamento___aceta    : int  1 NA NA NA NA NA NA NA NA NA ...
  [list output truncated]
NULL
```
I understand that `[list output truncated]` means that there are more
variables than those allowed by str to be displayed as rows. Thus I
increased the row's output with:
```
> (str(df, list.len=1000))'data.frame': 302 obs. of  626 variables:
 $ record_id                 : int  1 1 1 1 1 1 1 1 1 1 ...
...
NULL
```

Does `NULL` mean that some of the variables are not closed? (perhaps a
missing comma somewhere) Is there a way to check the sanity of the data and
avoid that some separator is not in the right place?
Thank you



--
Best regards,
Luigi

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Aug 2021 - Sanity check in loading large dataframe

[R] Sanity check in loading large dataframe

[R] Sanity check in loading large dataframe

[R] Sanity check in loading large dataframe