thr3ads.net - R help - [R] About size of data frames [Aug 2025]

If this information is useful, please help other people find it:
Share via:

Stefano Sofia

2025-Aug-14 11:27 UTC

[R] About size of data frames

Dear R-list users,

let me ask you a very general question about performance of big data frames.

I deal with semi-hourly meteorological data of about 70 sensors during 28 winter
seasons.


It means that for each sensor I have 48 data for each day, 181 days for each
winter season (182 in case of leap year): 48 * 181 * 28 = 234,576

234,576 * 70 = 16420320

>From the computational point of view it is better to deal with a single data
frame of approximately 16.5 M rows and 3 columns (one for data, one for sensor
code and one for value), with a single data frame of approximately 235,000 rows
and 141 rows or 70 different data frames of approximately 235,000 rows and 3
rows? Or it doesn't make any difference?
I personally would prefer the first choice, because it would be easier for me to
deal with a single data frame and few columns.


Thank you for your usual help

Stefano


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia MSc, PhD
Civil Protection Department - Marche Region - Italy
Meteo Section
Snow Section
Via Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere
informazioni confidenziali, pertanto ? destinato solo a persone autorizzate alla
ricezione. I messaggi di posta elettronica per i client di Regione Marche
possono contenere informazioni confidenziali e con privilegi legali. Se non si ?
il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo
messaggio. Se si ? ricevuto questo messaggio per errore, inoltrarlo al mittente
ed eliminarlo completamente dal sistema del proprio computer. Ai sensi
dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla DGR n.
74/2021, si segnala che, in caso di necessit? ed urgenza, la risposta al
presente messaggio di posta elettronica pu? essere visionata da persone estranee
al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by persons
entitled to receive the confidential information it may contain. E-mail messages
to clients of Regione Marche may contain information that is confidential and
legally privileged. Please do not read, copy, forward, or store this message
unless you are an intended recipient of it. If you have received this message in
error, please forward it to the sender and delete it completely from your
computer system.

	[[alternative HTML version deleted]]

Duncan Murdoch

2025-Aug-14 12:01 UTC

head link

[R] About size of data frames

On 2025-08-14 7:27 a.m., Stefano Sofia via R-help wrote:> Dear R-list users,
> 
> let me ask you a very general question about performance of big data
frames.
> 
> I deal with semi-hourly meteorological data of about 70 sensors during 28
winter seasons.
> 
> 
> It means that for each sensor I have 48 data for each day, 181 days for
each winter season (182 in case of leap year): 48 * 181 * 28 = 234,576
> 
> 234,576 * 70 = 16420320
> 
> 
>  From the computational point of view it is better to deal with a single
data frame of approximately 16.5 M rows and 3 columns (one for data, one for
sensor code and one for value), with a single data frame of approximately
235,000 rows and 141 rows or 70 different data frames of approximately 235,000
rows and 3 rows? Or it doesn't make any difference?
> 
> I personally would prefer the first choice, because it would be easier for
me to deal with a single data frame and few columns.
> 
It really depends on what computations you're doing.  As a general rule, 
column operations are faster than row operations.  (Also as a general 
rule, arrays are faster than dataframes, but are much more limited in 
what they can hold:  all entries must be the same type, which probably 
won't work for your data.)

So I'd guess your 3 column solution would likely be best.

Duncan Murdoch

Rui Barradas

2025-Aug-14 17:54 UTC

head link

[R] About size of data frames

On 8/14/2025 12:27 PM, Stefano Sofia via R-help wrote:> Dear R-list users,
> 
> let me ask you a very general question about performance of big data
frames.
> 
> I deal with semi-hourly meteorological data of about 70 sensors during 28
winter seasons.
> 
> 
> It means that for each sensor I have 48 data for each day, 181 days for
each winter season (182 in case of leap year): 48 * 181 * 28 = 234,576
> 
> 234,576 * 70 = 16420320
> 
> 
>  From the computational point of view it is better to deal with a single
data frame of approximately 16.5 M rows and 3 columns (one for data, one for
sensor code and one for value), with a single data frame of approximately
235,000 rows and 141 rows or 70 different data frames of approximately 235,000
rows and 3 rows? Or it doesn't make any difference?
> 
> I personally would prefer the first choice, because it would be easier for
me to deal with a single data frame and few columns.
> 
> 
> Thank you for your usual help
> 
> Stefano
> 
> 
>           (oo)
> --oOO--( )--OOo--------------------------------------
> Stefano Sofia MSc, PhD
> Civil Protection Department - Marche Region - Italy
> Meteo Section
> Snow Section
> Via Colle Ameno 5
> 60126 Torrette di Ancona, Ancona (AN)
> Uff: +39 071 806 7743
> E-mail: stefano.sofia at regione.marche.it
> ---Oo---------oO----------------------------------------
> 
> ________________________________
> 
> AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere
informazioni confidenziali, pertanto ? destinato solo a persone autorizzate alla
ricezione. I messaggi di posta elettronica per i client di Regione Marche
possono contenere informazioni confidenziali e con privilegi legali. Se non si ?
il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo
messaggio. Se si ? ricevuto questo messaggio per errore, inoltrarlo al mittente
ed eliminarlo completamente dal sistema del proprio computer. Ai sensi
dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla DGR n.
74/2021, si segnala che, in caso di necessit? ed urgenza, la risposta al
presente messaggio di posta elettronica pu? essere visionata da persone estranee
al destinatario.
> IMPORTANT NOTICE: This e-mail message is intended to be received only by
persons entitled to receive the confidential information it may contain. E-mail
messages to clients of Regione Marche may contain information that is
confidential and legally privileged. Please do not read, copy, forward, or store
this message unless you are an intended recipient of it. If you have received
this message in error, please forward it to the sender and delete it completely
from your computer system.
> 
> 	[[alternative HTML version deleted]]
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,

First of all, 48 * 181 * 28 = 243,264, not 234,576.
And 243264 * 70 = 17,028,480.

As for the question, why don't you try it with smaller data sets?
In the test bellow I have tested with the sizes you have posted and the 
many columns (wide format) is fastest. Then the df's list, then the 4 
columns (long format).
4 columns because it's sensor, day, season and data.
And the wide format df is only 72 columns wide, one for day, one for 
season and one for each sensor.

The test computes mean values aggregated by day and season. When the 
data is in the long format it must also include the sensor, so there is 
an extra aggregation column.

The test is very simple, real results probably depend on the functions 
you want to apply to the data.



# create the test data
makeDataLong <- function(sensor, x) {
   x[["data"]] <- rnorm(nrow(df1))
   cbind.data.frame(sensor, x)
}

makeDataWide <- function(sensor, x) {
   x[[sensor]] <- rnorm(nrow(x))
   x
}

set.seed(2025)

n_per_day <- 48
n_days <- 181
n_seasons <- 28
n_sensors <- 70

day <- rep(1:n_days, each = n_per_day)
season <- 1:n_seasons
sensor_names <- sprintf("sensor_%02d", 1:n_sensors)
df1 <- expand.grid(day = day, season = season, KEEP.OUT.ATTRS = FALSE)

df_list <- lapply(1:n_sensors, makeDataLong, x = df1)
names(df_list) <- sensor_names
df_long <- lapply(1:n_sensors, makeDataLong, x = df1) |> do.call(rbind, 
args = _)
df_wide <- df1
for(s in sensor_names) {
   df_wide <- makeDataWide(s, df_wide)
}


# test functions
f <- function(x) aggregate(data ~ season + day, data = x, mean)
g <- function(x) aggregate(data ~ sensor + season + day, data = x, mean)
h <- function(x) aggregate(. ~ season + day, x, mean)

# timings
bench::mark(
   list_base = lapply(df_list, f),
   long_base = g(df_long),
   wide_base = h(df_wide),
   check = FALSE
)



Hope this helps,

Rui Barradas

Jeff Reichman

2025-Aug-14 23:00 UTC

head link

[R] About size of data frames

great question, and one that touches on both performance and usability in R.
Here's a breakdown of the trade-offs and recommendations:

You're comparing three data structure strategies for handling ~16.5 million
observations:

- Single long data frame, ~16.5M rows ? 3 columns,  Simple to manage, easy to
filter/group, tidyverse-friendly,  May require more memory; slower row-wise
operations
- Wide data frame,  ~235K rows ? 141 columns,  Fast column-wise operations; good
for matrix-style analysis, to reshape/filter; less tidy
- List of 70 data frames,  Each ~235K rows ? 3 columns,  Parallel processing
possible; modular,  Complex to manage; harder to aggregate or compare

Performance Considerations
- Memory Efficiency: A single long data frame is generally more memory-efficient
than a list of data frames, especially if column types are consistent.
- Vectorization: R is optimized for vectorized operations. A long format works
well with dplyr, data.table, and tidyverse tools.
- Parallelism: If you plan to process each sensor independently, a list of data
frames could allow parallel computation using future, furrr, or parallel.
- Reshaping Costs: Wide formats are fast for matrix-style operations but can be
cumbersome when filtering by time, sensor, or value.

I'd stick with the single long-format data frame:
- It aligns with tidy data principles.
- It's easier to filter, group, and summarize.
- It integrates seamlessly with packages like ggplot2, dplyr, and data.table.

If performance becomes an issue:
- Consider converting to a data.table object (setDT(df)), which is highly
optimized for large datasets.
- Use indexing and keys for faster filtering.
- Use arrow::read_parquet() or fst::write_fst() for fast disk I/O if you need to
save/load frequently.

If you're doing seasonal analysis, consider adding a season column. That
way, you can easily group by sensor, season, and day without needing to split
the data.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Stefano Sofia
via R-help
Sent: Thursday, August 14, 2025 6:27 AM
To: r-help at R-project.org
Subject: [R] About size of data frames

Dear R-list users,

let me ask you a very general question about performance of big data frames.

I deal with semi-hourly meteorological data of about 70 sensors during 28 winter
seasons.


It means that for each sensor I have 48 data for each day, 181 days for each
winter season (182 in case of leap year): 48 * 181 * 28 = 234,576

234,576 * 70 = 16420320

>From the computational point of view it is better to deal with a single data
frame of approximately 16.5 M rows and 3 columns (one for data, one for sensor
code and one for value), with a single data frame of approximately 235,000 rows
and 141 rows or 70 different data frames of approximately 235,000 rows and 3
rows? Or it doesn't make any difference?
I personally would prefer the first choice, because it would be easier for me to
deal with a single data frame and few columns.


Thank you for your usual help

Stefano


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia MSc, PhD
Civil Protection Department - Marche Region - Italy Meteo Section Snow Section
Via Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu  contenere
informazioni confidenziali, pertanto   destinato solo a persone autorizzate alla
ricezione. I messaggi di posta elettronica per i client di Regione Marche
possono contenere informazioni confidenziali e con privilegi legali. Se non si  
il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo
messaggio. Se si   ricevuto questo messaggio per errore, inoltrarlo al mittente
ed eliminarlo completamente dal sistema del proprio computer. Ai sensi
dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla DGR n.
74/2021, si segnala che, in caso di necessit  ed urgenza, la risposta al
presente messaggio di posta elettronica pu  essere visionata da persone estranee
al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by persons
entitled to receive the confidential information it may contain. E-mail messages
to clients of Regione Marche may contain information that is confidential and
legally privileged. Please do not read, copy, forward, or store this message
unless you are an intended recipient of it. If you have received this message in
error, please forward it to the sender and delete it completely from your
computer system.

	[[alternative HTML version deleted]]

R help - Aug 2025 - About size of data frames

[R] About size of data frames

[R] About size of data frames

[R] About size of data frames

[R] About size of data frames