Hi All, I want to extract new variables from a string and add it to the dataframe. Sample data is csv file. dat<-read.csv(text="Year, Sex,string 2002,F,15 xc Ab 2003,F,14 2004,M,18 xb 25 35 21 2005,M,13 25 2006,M,14 ac 256 AV 35 2007,F,11",header=TRUE) The string column has a maximum of five variables. Some rows have all and others may not have all the five variables. If missing then fill it with NA, Desired result is shown below, Year,Sex,string, S1, S2, S3 S4,S5 2002,F,15 xc Ab, 15,xc,Ab, NA, NA 2003,F,14, 14,NA,NA,NA,NA 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21 2005,M,13 25,13, 25,NA,NA,NA 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35 2007,F,11, 11,NA,NA,NA,NA Any help? Thank you in advance.
I would split dat$string into it's own vector, break it apart at the spaces into an array, and then place dat$year and dat$sex in positions 1 and 2 of that newly created array. On Fri, Jul 19, 2024, 12:52?PM Val <valkremk at gmail.com> wrote:> Hi All, > > I want to extract new variables from a string and add it to the dataframe. > Sample data is csv file. > > dat<-read.csv(text="Year, Sex,string > 2002,F,15 xc Ab > 2003,F,14 > 2004,M,18 xb 25 35 21 > 2005,M,13 25 > 2006,M,14 ac 256 AV 35 > 2007,F,11",header=TRUE) > > The string column has a maximum of five variables. Some rows have all > and others may not have all the five variables. If missing then fill > it with NA, > Desired result is shown below, > > > Year,Sex,string, S1, S2, S3 S4,S5 > 2002,F,15 xc Ab, 15,xc,Ab, NA, NA > 2003,F,14, 14,NA,NA,NA,NA > 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21 > 2005,M,13 25,13, 25,NA,NA,NA > 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35 > 2007,F,11, 11,NA,NA,NA,NA > > Any help? > Thank you in advance. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
The desired result is odd.
1) It looks like the string is duplicated in the desired result. The first line
of data has "15, xc, Ab", and the desired result has "15, xc,
Ab, 15, xc, Ab"
2) The example has S1 through S5, but the desired result has data for eight
variables in the first line (not five).
3) The desired result has a different number of variables for each line.
4) Are you assuming that all missing data is at the end of the string? If there
are 5 variables (S1 .... S5), do you know that "15, xc, Ab" is S1 =
15, S2 = 'xc', and S3 = 'Ab' rather than S2=15, S4='xc'
and S5='Ab' ?
This isn't exactly what you asked for, but maybe I was confused somewhere.
This approach puts string data into variables in order. In this approach one
mixes string and numeric data. The string is not duplicated.
library(tidyr)
dat <- read.csv(text="Year,Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11", header=TRUE, stringsAsFactors=FALSE)
# split the 'string' column based on spaces
dat_separated <- dat |>
separate(string, into = paste0("S", 1:5), sep = " ",
fill = "right", extra = "merge")
Tim
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Val
Sent: Friday, July 19, 2024 12:52 PM
To: r-help at R-project.org (r-help at r-project.org) <r-help at
r-project.org>
Subject: [R] Extract
[External Email]
Hi All,
I want to extract new variables from a string and add it to the dataframe.
Sample data is csv file.
dat<-read.csv(text="Year, Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11",header=TRUE)
The string column has a maximum of five variables. Some rows have all and
others may not have all the five variables. If missing then fill it with NA,
Desired result is shown below,
Year,Sex,string, S1, S2, S3 S4,S5
2002,F,15 xc Ab, 15,xc,Ab, NA, NA
2003,F,14, 14,NA,NA,NA,NA
2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
2005,M,13 25,13, 25,NA,NA,NA
2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
2007,F,11, 11,NA,NA,NA,NA
Any help?
Thank you in advance.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
I did not look closely at the solutions that you were offered, but note that you did not specify in your post whether the numbers in your string were to be character or numeric variables after they are broken out into their own columns. I believe that they are character in the solutions, but you should check this. If you want them as numeric, e.g., for further processing, you will need to convert them. Or vice-versa. Bert On Fri, Jul 19, 2024 at 9:52?AM Val <valkremk at gmail.com> wrote:> > Hi All, > > I want to extract new variables from a string and add it to the dataframe. > Sample data is csv file. > > dat<-read.csv(text="Year, Sex,string > 2002,F,15 xc Ab > 2003,F,14 > 2004,M,18 xb 25 35 21 > 2005,M,13 25 > 2006,M,14 ac 256 AV 35 > 2007,F,11",header=TRUE) > > The string column has a maximum of five variables. Some rows have all > and others may not have all the five variables. If missing then fill > it with NA, > Desired result is shown below, > > > Year,Sex,string, S1, S2, S3 S4,S5 > 2002,F,15 xc Ab, 15,xc,Ab, NA, NA > 2003,F,14, 14,NA,NA,NA,NA > 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21 > 2005,M,13 25,13, 25,NA,NA,NA > 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35 > 2007,F,11, 11,NA,NA,NA,NA > > Any help? > Thank you in advance. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
We can use read.table for a base R solution
string <- read.table(text = dat$string, fill = TRUE, header = FALSE,
na.strings = "")
names(string) <- paste0("S", seq_along(string))
cbind(dat[-3], string)
On Fri, Jul 19, 2024 at 12:52?PM Val <valkremk at gmail.com>
wrote:>
> Hi All,
>
> I want to extract new variables from a string and add it to the dataframe.
> Sample data is csv file.
>
> dat<-read.csv(text="Year, Sex,string
> 2002,F,15 xc Ab
> 2003,F,14
> 2004,M,18 xb 25 35 21
> 2005,M,13 25
> 2006,M,14 ac 256 AV 35
> 2007,F,11",header=TRUE)
>
> The string column has a maximum of five variables. Some rows have all
> and others may not have all the five variables. If missing then fill
> it with NA,
> Desired result is shown below,
>
>
> Year,Sex,string, S1, S2, S3 S4,S5
> 2002,F,15 xc Ab, 15,xc,Ab, NA, NA
> 2003,F,14, 14,NA,NA,NA,NA
> 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
> 2005,M,13 25,13, 25,NA,NA,NA
> 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
> 2007,F,11, 11,NA,NA,NA,NA
>
> Any help?
> Thank you in advance.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com