thr3ads.net - R help - [R] Combine two dataframe with different row number and interpolation between values [Sep 2022]

If this information is useful, please help other people find it:
Share via:

Jim Lemon

2022-Sep-01 09:31 UTC

[R] Combine two dataframe with different row number and interpolation between values

Hi Javad,
You seem to have the data frame join worked out, so here is a function
to interpolate over sequences of NAs. I have found it quite useful.

# interpolate over sequences of NAs
# NA sequences at the beginning of a file are replaced with the first
nonNA value
# NA sequences at the end are filled with the last nonNA value
# sequences of all NAs are returned unaltered
interpNA<-function(x) {
 if(any(is.na(x))) {
  xrle<-rle(is.na(x))
  begin<-end<-1
  # if the sequence begins with NA
  # set the initial run of NAs to the 1st non-NA
  if(is.na(x[1])) x[1:xrle$lengths[1]]<-x[xrle$lengths[1]+1]
  for(i in 1:length(xrle$values)) {
   # if this is a run of NA, fill it
   if(xrle$values[i]) {
    end<-begin+xrle$lengths[i]+(i>1)
    if(is.na(x[end])) x[begin:(end-1)]<-x[begin]
    else x[begin:end]<-seq(x[begin],x[end],length.out=1+end-begin)
    begin<-end+xrle$lengths[i]-1
  }
   else {
    # set begin to the end of the non-NA values
    begin<-end+xrle$lengths[i]-1
   }
  }
 }
 return(x)
}

Jim

On Thu, Sep 1, 2022 at 3:29 PM javad bayat <j.bayat194 at gmail.com>
wrote:>
>  Dear Tim;
> The dplyr did not work for me. My data frames have exactly similar columns
> but different row numbers.
> Mr Petr sent this code and the code worked but it copied the second
> dataframe into the first one and did not replace the corresponding row.
> > df3 = merge(df1, df2, all = TRUE)
> Regarding filling the "NA" data, it does not matter for me to
interpolate
> between numbers or put the mean of numbers.
> Sincerely
>
>
>
>
>
>
> On Wed, Aug 31, 2022 at 5:17 PM Ebert,Timothy Aaron <tebert at
ufl.edu> wrote:
>
> > Can I interest you in the join functions in dplyr?
> > https://www.datasciencemadesimple.com/join-in-r-merge-in-r/
> >
> > Filling in missing data is a useful practice when the fake (simulated)
> > data is a small proportion of all data. When 2/3 of the data is fake
one
> > must wonder if anything based on those numbers is real or an artifact
of
> > assumptions made to generate the numbers.
> >
> > I deal with weather data and some weather stations are set up to
average
> > measurements between recorded values and others take a single reading
at
> > regular intervals. How you interpolate might depend on which option
> > describes your data. The alternative is to use the method for
recording ws
> > and apply it to the other data that will be merged with these data. I
> > assume there is more data, otherwise I see little point in expanding
these
> > values out.
> >
> > Tim
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces at r-project.org> On Behalf Of
javad bayat
> > Sent: Wednesday, August 31, 2022 2:09 AM
> > To: r-help at r-project.org
> > Subject: [R] Combine two dataframe with different row number and
> > interpolation between values
> >
> > [External Email]
> >
> >  Dear all,
> > I am trying to combine two large dataframe in order to make a
dataframe
> > with exactly the dimension of the second dataframe.
> > The first df is as follows:
> >
> > df1 = data.frame(y = rep(c(2010,2011,2012,2013,2014), each = 2920), d
> > rep(c(1:365,1:365,1:365,1:365,1:365),each=8),
> >       h = rep(c(seq(3,24, by = 3),seq(3,24, by = 3),seq(3,24, by >
> 3),seq(3,24, by = 3),seq(3,24, by = 3)),365),
> >       ws = rnorm(1:14600, mean=20))
> > > head(df1)
> >      y       d   h        ws
> > 1  2010  1  3     20.71488
> > 2  2010  1  6     19.70125
> > 3  2010  1  9     21.00180
> > 4  2010  1 12     20.29236
> > 5  2010  1 15     20.12317
> > 6  2010  1 18     19.47782
> >
> > The data in the "ws" column were measured with 3 hours
frequency and I
> > need data with one hour frequency. I have made a second df as follows
with
> > one hour frequency for the "ws" column.
> >
> > df2 = data.frame(y = rep(c(2010,2011,2012,2013,2014), each = 8760), d
> > rep(c(1:365,1:365,1:365,1:365,1:365),each=24),
> >       h = rep(c(1:24,1:24,1:24,1:24,1:24),365), ws = "NA")
> > > head(df2)
> >       y      d    h   ws
> > 1  2010  1    1   NA
> > 2  2010  1    2   NA
> > 3  2010  1    3   NA
> > 4  2010  1    4   NA
> > 5  2010  1    5   NA
> > 6  2010  1    6   NA
> >
> > What I am trying to do is combine these two dataframes so as to the
rows in
> > df1 (based on the values of "y", "d",
"h" columns) that have values
> > exactly similar to df2's rows copied in its place in the new df
(df3).
> > For example, in the first dataframe the first row was measured at 3
> > o'clock on the first day of 2010 and this row must be placed on
the third
> > row of the second dataframe which has a similar value (2010, 1, 3).
Like
> > the below
> > table:
> >       y      d    h   ws
> > 1  2010  1    1   NA
> > 2  2010  1    2   NA
> > 3  2010  1    3   20.71488
> > 4  2010  1    4   NA
> > 5  2010  1    5   NA
> > 6  2010  1    6   19.70125
> >
> > But regarding the values of the "ws" column for df2 that do
not have value
> > (at 4 and 5 o'clock), I need to interpolate between the before and
after
> > values to fill in the missing data of the "ws".
> > I have tried the following codes but they did not work correctly.
> >
> > > df3 = merge(df1, df2, by = "y")
> > Error: cannot allocate vector of size 487.9 Mb or
> > > library(dplyr)
> > > df3<- df1%>% full_join(df2)
> >
> >
> > Is there any way to do this?
> > Sincerely
> >
> >
> >
> >
> >
> > --
> > Best Regards
> > Javad Bayat
> > M.Sc. Environment Engineering
> > Alternative Mail: bayat194 at yahoo.com
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >
> >
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C9e63e590cb834ddc23d908da8b1802e2%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637975232519465332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=WigJVAmdLn%2FK7ZtJq28%2Buv4aDmUjNXu6QPabdt5h2iQ%3D&amp;reserved=0
> > PLEASE do read the posting guide
> >
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C9e63e590cb834ddc23d908da8b1802e2%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C637975232519465332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=m1VCPBpnYy%2FwqlOuf5froUVEMBDJKwDuAWS4cFNx1wI%3D&amp;reserved=0
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> --
> Best Regards
> Javad Bayat
> M.Sc. Environment Engineering
> Alternative Mail: bayat194 at yahoo.com
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

PIKAL Petr

2022-Sep-01 10:29 UTC

head link

[R] Combine two dataframe with different row number and interpolation between values

Hi

You could also use approxfun to fill NA values

fff <- approxfun(1:dim(df3)[1], df3$ws.x)
df3$ws2 <- fff(1:dim(df3)[1])
head(df3)
     y d h     ws.x ws.y      ws2
1 2010 1 1       NA   NA       NA
2 2010 1 2       NA   NA       NA
3 2010 1 3 20.46314   NA 20.46314
4 2010 1 4       NA   NA 20.74001
5 2010 1 5       NA   NA 21.01689
6 2010 1 6 21.29376   NA 21.29376

And fill starting or trailing NA by na.locf function from zoo package or
change starting or trailing NA to some value before calling approxfun.

Cheers
Petr
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Jim Lemon
> Sent: Thursday, September 1, 2022 11:31 AM
> To: javad bayat <j.bayat194 at gmail.com>; r-help mailing list
<r-help at r-
> project.org>
> Subject: Re: [R] Combine two dataframe with different row number and
> interpolation between values
> 
> Hi Javad,
> You seem to have the data frame join worked out, so here is a function
> to interpolate over sequences of NAs. I have found it quite useful.
> 
> # interpolate over sequences of NAs
> # NA sequences at the beginning of a file are replaced with the first
> nonNA value
> # NA sequences at the end are filled with the last nonNA value
> # sequences of all NAs are returned unaltered
> interpNA<-function(x) {
>  if(any(is.na(x))) {
>   xrle<-rle(is.na(x))
>   begin<-end<-1
>   # if the sequence begins with NA
>   # set the initial run of NAs to the 1st non-NA
>   if(is.na(x[1])) x[1:xrle$lengths[1]]<-x[xrle$lengths[1]+1]
>   for(i in 1:length(xrle$values)) {
>    # if this is a run of NA, fill it
>    if(xrle$values[i]) {
>     end<-begin+xrle$lengths[i]+(i>1)
>     if(is.na(x[end])) x[begin:(end-1)]<-x[begin]
>     else x[begin:end]<-seq(x[begin],x[end],length.out=1+end-begin)
>     begin<-end+xrle$lengths[i]-1
>   }
>    else {
>     # set begin to the end of the non-NA values
>     begin<-end+xrle$lengths[i]-1
>    }
>   }
>  }
>  return(x)
> }
> 
> Jim
> 
> On Thu, Sep 1, 2022 at 3:29 PM javad bayat <j.bayat194 at gmail.com>
wrote:
> >
> >  Dear Tim;
> > The dplyr did not work for me. My data frames have exactly similar
columns> > but different row numbers.
> > Mr Petr sent this code and the code worked but it copied the second
> > dataframe into the first one and did not replace the corresponding
row.
> > > df3 = merge(df1, df2, all = TRUE)
> > Regarding filling the "NA" data, it does not matter for me
to
interpolate> > between numbers or put the mean of numbers.
> > Sincerely
> >
> >
> >
> >
> >
> >
> > On Wed, Aug 31, 2022 at 5:17 PM Ebert,Timothy Aaron <tebert at
ufl.edu>
> wrote:
> >
> > > Can I interest you in the join functions in dplyr?
> > > https://www.datasciencemadesimple.com/join-in-r-merge-in-r/
> > >
> > > Filling in missing data is a useful practice when the fake
(simulated)
> > > data is a small proportion of all data. When 2/3 of the data is
fake
one> > > must wonder if anything based on those numbers is real or an
artifact
of> > > assumptions made to generate the numbers.
> > >
> > > I deal with weather data and some weather stations are set up to
average> > > measurements between recorded values and others take a single
reading
at> > > regular intervals. How you interpolate might depend on which
option
> > > describes your data. The alternative is to use the method for
recording ws> > > and apply it to the other data that will be merged with these
data. I
> > > assume there is more data, otherwise I see little point in
expanding
these> > > values out.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: R-help <r-help-bounces at r-project.org> On Behalf Of
javad bayat
> > > Sent: Wednesday, August 31, 2022 2:09 AM
> > > To: r-help at r-project.org
> > > Subject: [R] Combine two dataframe with different row number and
> > > interpolation between values
> > >
> > > [External Email]
> > >
> > >  Dear all,
> > > I am trying to combine two large dataframe in order to make a
dataframe> > > with exactly the dimension of the second dataframe.
> > > The first df is as follows:
> > >
> > > df1 = data.frame(y = rep(c(2010,2011,2012,2013,2014), each =
2920), d
> > > rep(c(1:365,1:365,1:365,1:365,1:365),each=8),
> > >       h = rep(c(seq(3,24, by = 3),seq(3,24, by = 3),seq(3,24, by
> > > 3),seq(3,24, by = 3),seq(3,24, by = 3)),365),
> > >       ws = rnorm(1:14600, mean=20))
> > > > head(df1)
> > >      y       d   h        ws
> > > 1  2010  1  3     20.71488
> > > 2  2010  1  6     19.70125
> > > 3  2010  1  9     21.00180
> > > 4  2010  1 12     20.29236
> > > 5  2010  1 15     20.12317
> > > 6  2010  1 18     19.47782
> > >
> > > The data in the "ws" column were measured with 3 hours
frequency and I
> > > need data with one hour frequency. I have made a second df as
follows
with> > > one hour frequency for the "ws" column.
> > >
> > > df2 = data.frame(y = rep(c(2010,2011,2012,2013,2014), each =
8760), d
> > > rep(c(1:365,1:365,1:365,1:365,1:365),each=24),
> > >       h = rep(c(1:24,1:24,1:24,1:24,1:24),365), ws =
"NA")
> > > > head(df2)
> > >       y      d    h   ws
> > > 1  2010  1    1   NA
> > > 2  2010  1    2   NA
> > > 3  2010  1    3   NA
> > > 4  2010  1    4   NA
> > > 5  2010  1    5   NA
> > > 6  2010  1    6   NA
> > >
> > > What I am trying to do is combine these two dataframes so as to
the
rows in> > > df1 (based on the values of "y", "d",
"h" columns) that have values
> > > exactly similar to df2's rows copied in its place in the new
df (df3).
> > > For example, in the first dataframe the first row was measured at
3
> > > o'clock on the first day of 2010 and this row must be placed
on the
third> > > row of the second dataframe which has a similar value (2010, 1,
3).
Like> > > the below
> > > table:
> > >       y      d    h   ws
> > > 1  2010  1    1   NA
> > > 2  2010  1    2   NA
> > > 3  2010  1    3   20.71488
> > > 4  2010  1    4   NA
> > > 5  2010  1    5   NA
> > > 6  2010  1    6   19.70125
> > >
> > > But regarding the values of the "ws" column for df2
that do not have
value> > > (at 4 and 5 o'clock), I need to interpolate between the
before and
after> > > values to fill in the missing data of the "ws".
> > > I have tried the following codes but they did not work correctly.
> > >
> > > > df3 = merge(df1, df2, by = "y")
> > > Error: cannot allocate vector of size 487.9 Mb or
> > > > library(dplyr)
> > > > df3<- df1%>% full_join(df2)
> > >
> > >
> > > Is there any way to do this?
> > > Sincerely
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Javad Bayat
> > > M.Sc. Environment Engineering
> > > Alternative Mail: bayat194 at yahoo.com
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > >
> > >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.et
> hz.ch%2Fmailman%2Flistinfo%2Fr-
> help&amp;data=05%7C01%7Ctebert%40ufl.edu%7C9e63e590cb834ddc23d90
> 8da8b1802e2%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C6379
> 75232519465332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&a
> mp;sdata=WigJVAmdLn%2FK7ZtJq28%2Buv4aDmUjNXu6QPabdt5h2iQ%3D&a
> mp;reserved=0
> > > PLEASE do read the posting guide
> > >
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-
> project.org%2Fposting-
> guide.html&amp;data=05%7C01%7Ctebert%40ufl.edu%7C9e63e590cb834ddc
> 23d908da8b1802e2%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7
> C637975232519465332%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%
> 7C&amp;sdata=m1VCPBpnYy%2FwqlOuf5froUVEMBDJKwDuAWS4cFNx1wI%3
> D&amp;reserved=0
> > > and provide commented, minimal, self-contained, reproducible
code.
> > >
> >
> >
> > --
> > Best Regards
> > Javad Bayat
> > M.Sc. Environment Engineering
> > Alternative Mail: bayat194 at yahoo.com
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2022 - Combine two dataframe with different row number and interpolation between values

[R] Combine two dataframe with different row number and interpolation between values

[R] Combine two dataframe with different row number and interpolation between values