thr3ads.net - R help - [R] Help merging large datasets in R [May 2025]

If this information is useful, please help other people find it:
Share via:

Deelen, M. (Mirjam)

2025-May-07 06:53 UTC

[R] Help merging large datasets in R

Hi guys,
For my MSc. thesis i am using R studio. The goal is for me to merge a couple (6)
of relatively large datasets (min of 200.000 and max of 2mil rows). I have now
been able to do so, however I think something might be going wrong in my codes.
For reference, i have a dataset 1 (200.000), dataset 2 (600.000), dataset 3
(2mil) and dataset 4 (2mil) merged into one dataset of 4mil, and dataset 5
(4mil) and dataset 6 (4mil) merged into one dataset of 8mil.
What i have done so far is the following:

  *   Merged dataset 1 and dataset 2 using the following code = merged 1 <-
dataset 2[dataset 1, nomatch = NA]. This results in a dataset of 600.000 (looks
to be alright).
  *   Merged the dataset merged 1 and datasets 3/4 using the following code =
merged 2 <- dataset 3/4[merged 1, nomatch = NA, allow.cartesian = TRUE]. This
results in a dataset of 21mil (as expected). To this i have applied an
additional criteria (dates in dataset 3/4 should be within 365 days of the dates
in merged 1), which reduces merged 2 to around 170.000.
  *   Merged the dataset merged 2 and datasets 5/6 using the following code =
merged 3 <- dataset 5/6[merged 2, nomatch = NA, allow.cartesian = TRUE].
Again, this results in a dataset of 8mil (as expected). And again, to this i
have applied an additional criteria (dates in dataset 5/6 should be within 365
days of the dates in merged 2), which reduces merged 3 to around 50.000.

What I'm now thinking, is how can the merging + additional criteria lead to
such a loss of cases ?? The first merge, of dataset 1 and dataset 2, results in
an amount that I think should be the final amount of cases. I understand that by
adding an additional criteria the number of possible matches when merging
datasets 3/4 and 5/6 is reduced, but I'm not sure this should lead to SUCH a
loss. Besides this, the additional criteria was added to reduce the duplication
of information that is now happening when merging datasets 3/4 and 5/6.
All cases appear once in dataset 1, but could appear a couple more times in the
following datasets (say twice in dataset 2, four times in datasets 3/4 and 8
times in datasets 5/6). Which results in a 1 x 2 x 4 x 8 duplication of
information when merging the datasets without additional criteria.
So sum this up, my questions are
  *   Are there any tips as to not have this duplication ? (so I can drop the
additonal criteria and the final amount of cases, probably, increases).
  *   Or are there any tips as to figure out where in these steps cases are lost
?

Thanks!
Mirjam


	[[alternative HTML version deleted]]

Ebert,Timothy Aaron

2025-May-07 20:46 UTC

head link

[R] Help merging large datasets in R

Some issues:
1) Variable names cannot have spaces. "merged 1" is not valid but
"merged_1" is a valid alternative.
2) You need to tell R what to merge by. It looks like you may be using data
tables rather than a data frame.
merged <- dataset2[dataset1, on = "id", nomatch = NA]

3) Alternatively: join functions from the dplyr package, cbind(), and rbind()
can be used in different ways to combine data.
4) Make sure the process is successful. It looked like 200K + 600K + 2000K +
2000K = 4000K which is obviously wrong. If I merge 200 k and 600k and get back
600k, I am either not telling the whole story or I made an error. Maybe I have
200k observations of 10 variables and 600k observations of 4 variables and I get
an object of 600k with 13 variables (one is lost because it was present in both
and used to merge the two data sets).

Please try again. Possibly start by extracting the first ten rows from each data
set. Try merging that where it is easy to check. Then expand to using all the
data. Your approach should work with a little modification. Alternatively make a
couple of small fake data sets and play with those. Once you have the code
correct there should be no problem expanding to the real data.

Tim

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Deelen, M.
(Mirjam) via R-help
Sent: Wednesday, May 7, 2025 2:54 AM
To: r-help at r-project.org
Subject: [R] Help merging large datasets in R

[External Email]

Hi guys,
For my MSc. thesis i am using R studio. The goal is for me to merge a couple (6)
of relatively large datasets (min of 200.000 and max of 2mil rows). I have now
been able to do so, however I think something might be going wrong in my codes.
For reference, i have a dataset 1 (200.000), dataset 2 (600.000), dataset 3
(2mil) and dataset 4 (2mil) merged into one dataset of 4mil, and dataset 5
(4mil) and dataset 6 (4mil) merged into one dataset of 8mil.
What i have done so far is the following:

  *   Merged dataset 1 and dataset 2 using the following code = merged 1 <-
dataset 2[dataset 1, nomatch = NA]. This results in a dataset of 600.000 (looks
to be alright).
  *   Merged the dataset merged 1 and datasets 3/4 using the following code =
merged 2 <- dataset 3/4[merged 1, nomatch = NA, allow.cartesian = TRUE]. This
results in a dataset of 21mil (as expected). To this i have applied an
additional criteria (dates in dataset 3/4 should be within 365 days of the dates
in merged 1), which reduces merged 2 to around 170.000.
  *   Merged the dataset merged 2 and datasets 5/6 using the following code =
merged 3 <- dataset 5/6[merged 2, nomatch = NA, allow.cartesian = TRUE].
Again, this results in a dataset of 8mil (as expected). And again, to this i
have applied an additional criteria (dates in dataset 5/6 should be within 365
days of the dates in merged 2), which reduces merged 3 to around 50.000.

What I'm now thinking, is how can the merging + additional criteria lead to
such a loss of cases ?? The first merge, of dataset 1 and dataset 2, results in
an amount that I think should be the final amount of cases. I understand that by
adding an additional criteria the number of possible matches when merging
datasets 3/4 and 5/6 is reduced, but I'm not sure this should lead to SUCH a
loss. Besides this, the additional criteria was added to reduce the duplication
of information that is now happening when merging datasets 3/4 and 5/6.
All cases appear once in dataset 1, but could appear a couple more times in the
following datasets (say twice in dataset 2, four times in datasets 3/4 and 8
times in datasets 5/6). Which results in a 1 x 2 x 4 x 8 duplication of
information when merging the datasets without additional criteria.
So sum this up, my questions are
  *   Are there any tips as to not have this duplication ? (so I can drop the
additonal criteria and the final amount of cases, probably, increases).
  *   Or are there any tips as to figure out where in these steps cases are lost
?

Thanks!
Mirjam


        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2025-May-08 01:03 UTC

head link

[R] Help merging large datasets in R

I suggest you study a bit more about types of join (a.k.a. merge) operations. In
particular, most of the time you will want to use inner join or left join
operations with large data sets. In order to avoid problems with duplication you
will need to pay close attention to which set of columns can be used to uniquely
identify each and every record. Then you want to confirm that there are no
duplicates where you don't want them. Cartesian joins are very specialized
and almost certainly should only be used on small data frames... I suspect you
should be stacking the data frames rather than building all combinations of
rows.

Please do read the Posting Guide... and try to make a minimal reproducible
example (Google it, and or study the vignette that comes with the reprex
package) that illustrates with a few rows from each table what is going on that
you want to fix. The dput function is another big help in constructing this. The
help list really works best when we share code snippets that don't work
right in our R sessions and comment on them. Make an effort to learn how your
email client can be configured to compose your email in plain text so the
formatting does not corrupt your code example.

On May 6, 2025 11:53:42 PM PDT, "Deelen, M. (Mirjam) via R-help"
<r-help at r-project.org> wrote:>Hi guys,
>For my MSc. thesis i am using R studio. The goal is for me to merge a couple
(6) of relatively large datasets (min of 200.000 and max of 2mil rows). I have
now been able to do so, however I think something might be going wrong in my
codes.
>For reference, i have a dataset 1 (200.000), dataset 2 (600.000), dataset 3
(2mil) and dataset 4 (2mil) merged into one dataset of 4mil, and dataset 5
(4mil) and dataset 6 (4mil) merged into one dataset of 8mil.
>What i have done so far is the following:
>
>  *   Merged dataset 1 and dataset 2 using the following code = merged 1
<- dataset 2[dataset 1, nomatch = NA]. This results in a dataset of 600.000
(looks to be alright).
>  *   Merged the dataset merged 1 and datasets 3/4 using the following code
= merged 2 <- dataset 3/4[merged 1, nomatch = NA, allow.cartesian = TRUE].
This results in a dataset of 21mil (as expected). To this i have applied an
additional criteria (dates in dataset 3/4 should be within 365 days of the dates
in merged 1), which reduces merged 2 to around 170.000.
>  *   Merged the dataset merged 2 and datasets 5/6 using the following code
= merged 3 <- dataset 5/6[merged 2, nomatch = NA, allow.cartesian = TRUE].
Again, this results in a dataset of 8mil (as expected). And again, to this i
have applied an additional criteria (dates in dataset 5/6 should be within 365
days of the dates in merged 2), which reduces merged 3 to around 50.000.
>
>What I'm now thinking, is how can the merging + additional criteria lead
to such a loss of cases ?? The first merge, of dataset 1 and dataset 2, results
in an amount that I think should be the final amount of cases. I understand that
by adding an additional criteria the number of possible matches when merging
datasets 3/4 and 5/6 is reduced, but I'm not sure this should lead to SUCH a
loss. Besides this, the additional criteria was added to reduce the duplication
of information that is now happening when merging datasets 3/4 and 5/6.
>All cases appear once in dataset 1, but could appear a couple more times in
the following datasets (say twice in dataset 2, four times in datasets 3/4 and 8
times in datasets 5/6). Which results in a 1 x 2 x 4 x 8 duplication of
information when merging the datasets without additional criteria.
>So sum this up, my questions are>
>  *   Are there any tips as to not have this duplication ? (so I can drop
the additonal criteria and the final amount of cases, probably, increases).
>  *   Or are there any tips as to figure out where in these steps cases are
lost ?
>
>Thanks!
>Mirjam
>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.

Maybe Matching Threads

Search for more maybe matching threads

R help - May 2025 - Help merging large datasets in R

[R] Help merging large datasets in R

[R] Help merging large datasets in R

[R] Help merging large datasets in R

Maybe Matching Threads