Anna Zakrisson Braeunlich
2014-Oct-12  07:24 UTC
[R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.
Hi,
I have a question how to split a factor name into different columns. I have
metabarcoding data and need to merge the FASTA-file with the taxonomy- and
counttable files (dataframes). To be able to do this merge, I need to isolate
the common identifier, that unfortunately is baked in with a lot of other labels
in the factor name eg:
sequence identifier:
M01271_77_000000000.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2
I want to split this name at every "." to get several columns:
column1: M01271_77_000000000
column2: A8J0P_1_1101_10150_1525
column3: 1
column4: 322519
column5: sample_1
column6: sample_2
I must add that I have no influence on how these names are given. This is how
thay are supplied from Illumina Miseq. I just need to be able to deal with it.
Here is some extremely simplified dummy data to further show the issue at hand:
df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
                  Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
df2 <- data.frame(cbind(B = 13:22, K = rnorm(10)),
                  Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5)))
# I have metabarcoding data with one FASTA-file, one count table and one
taxonomy file
# Above dummy data is just showing the issue at hand. I want to be able to merge
my three
# original data frames (here, the dummy data is only two dataframes). The
problem is that
# the only identifier that is commmon for the dataframes is "hidden"
in the
# factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence need
to be able
# to split this name up into different columns to get "identifierA"
alone as one column name
# Then I can merge the dataframes.
# How can I do this in R. I know that it can be done in excel, but I would like
to
# produce a complete R-script to get a fast pipeline and avoid copy and paste
errors.
# This is what I want it to look:
df1.goal <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
                  Z = factor(rep(LETTERS[1:2], each = 5)),
                  identifierA = factor(rep(LETTERS[1:2], each = 5)),
                  B1298712 = factor(rep(LETTERS[1:2], each = 5)))
# Many thank's and with kind regards
Anna Zakrisson
><((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?.
.><((((?>`?. . ? `?. .? `?. .><((((?>
Anna Zakrisson Braeunlich
PhD student
Department of Ecology, Environment and Plant Sciences
Stockholm University
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige
Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland
E-mail: anna.zakrisson at su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b
><((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?.
.><((((?>`?. . ? `?. .? `?. .><((((?>
	[[alternative HTML version deleted]]
David.Kaethner at dlr.de
2014-Oct-13  08:42 UTC
[R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.
I'm not sure I understood your problem, maybe like this:
# split identifiers into columns
df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
                  Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
id <- names(df1)[3]
x <- do.call(rbind, str_split(id, "\\."))
y <- sapply(x, function(z) z <- df1[,id])
df1.goal <- data.frame(df1[,-3], y)
-dk
-----Urspr?ngliche Nachricht-----
Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im
Auftrag von Anna Zakrisson Braeunlich
Gesendet: Sonntag, 12. Oktober 2014 09:25
An: r-help at r-project.org
Betreff: [R] seqinr ?: Splitting a factor name into several columns. Dealing
with metabarcoding data.
Hi,
I have a question how to split a factor name into different columns. I have
metabarcoding data and need to merge the FASTA-file with the taxonomy- and
counttable files (dataframes). To be able to do this merge, I need to isolate
the common identifier, that unfortunately is baked in with a lot of other labels
in the factor name eg:
sequence identifier:
M01271_77_000000000.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2
I want to split this name at every "." to get several columns:
column1: M01271_77_000000000
column2: A8J0P_1_1101_10150_1525
column3: 1
column4: 322519
column5: sample_1
column6: sample_2
I must add that I have no influence on how these names are given. This is how
thay are supplied from Illumina Miseq. I just need to be able to deal with it.
Here is some extremely simplified dummy data to further show the issue at hand:
df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
                  Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5)))
df2 <- data.frame(cbind(B = 13:22, K = rnorm(10)),
                  Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5)))
# I have metabarcoding data with one FASTA-file, one count table and one
taxonomy file # Above dummy data is just showing the issue at hand. I want to be
able to merge my three # original data frames (here, the dummy data is only two
dataframes). The problem is that # the only identifier that is commmon for the
dataframes is "hidden" in the # factor name eg: Z.identifierA.1298712
and Q.identifierA.4668726. I hence need to be able # to split this name up into
different columns to get "identifierA" alone as one column name # Then
I can merge the dataframes.
# How can I do this in R. I know that it can be done in excel, but I would like
to # produce a complete R-script to get a fast pipeline and avoid copy and paste
errors.
# This is what I want it to look:
df1.goal <- data.frame(cbind(X = 1:10, Y = rnorm(10)),
                  Z = factor(rep(LETTERS[1:2], each = 5)),
                  identifierA = factor(rep(LETTERS[1:2], each = 5)),
                  B1298712 = factor(rep(LETTERS[1:2], each = 5)))
# Many thank's and with kind regards
Anna Zakrisson
><(((( >` . .   ` . .  ` . . ><(((( >` . .   ` . .  ` .
.><(((( >` . .
>` . .  ` . .><(((( >
Anna Zakrisson Braeunlich
PhD student
Department of Ecology, Environment and Plant Sciences Stockholm University
Svante Arrheniusv. 21A
SE-106 91 Stockholm
Sweden/Sverige
Lives in Berlin.
For paper mail:
Katzbachstr. 21
D-10965, Berlin
Germany/Deutschland
E-mail: anna.zakrisson at su.se
Tel work: +49-(0)3091541281
Mobile: +49-(0)15777374888
LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b
><(((( >` . .   ` . .  ` . . ><(((( >` . .   ` . .  ` .
.><(((( >` . .
>` . .  ` . .><(((( >
	[[alternative HTML version deleted]]
Ista Zahn
2014-Oct-13  13:42 UTC
[R] seqinr ?: Splitting a factor name into several columns. Dealing with metabarcoding data.
Hi Anna, On Sun, Oct 12, 2014 at 3:24 AM, Anna Zakrisson Braeunlich <anna.zakrisson at su.se> wrote:> Hi, > > I have a question how to split a factor name into different columns. I have metabarcoding data and need to merge the FASTA-file with the taxonomy- and counttable files (dataframes). To be able to do this merge, I need to isolate the common identifier, that unfortunately is baked in with a lot of other labels in the factor name eg: > sequence identifier: M01271_77_000000000.A8J0P_1_1101_10150_1525.1.322519.sample_1.sample_2 > > I want to split this name at every "." to get several columns: > column1: M01271_77_000000000 > column2: A8J0P_1_1101_10150_1525 > column3: 1 > column4: 322519 > column5: sample_1 > column6: sample_2 > > I must add that I have no influence on how these names are given. This is how thay are supplied from Illumina Miseq. I just need to be able to deal with it. > > Here is some extremely simplified dummy data to further show the issue at hand: > > df1 <- data.frame(cbind(X = 1:10, Y = rnorm(10)), > Z.identifierA.B1298712 = factor(rep(LETTERS[1:2], each = 5))) > df2 <- data.frame(cbind(B = 13:22, K = rnorm(10)), > Q.identifierA.B4668726 = factor(rep(LETTERS[1:2], each = 5))) > > # I have metabarcoding data with one FASTA-file, one count table and one taxonomy file > # Above dummy data is just showing the issue at hand. I want to be able to merge my three > # original data frames (here, the dummy data is only two dataframes). The problem is that > # the only identifier that is commmon for the dataframes is "hidden" in the > # factor name eg: Z.identifierA.1298712 and Q.identifierA.4668726. I hence need to be able > # to split this name up into different columns to get "identifierA" alone as one column name > # Then I can merge the dataframes. > # How can I do this in R. I know that it can be done in excel, but I would like to > # produce a complete R-script to get a fast pipeline and avoid copy and paste errors. > # This is what I want it to look: > > df1.goal <- data.frame(cbind(X = 1:10, Y = rnorm(10)), > Z = factor(rep(LETTERS[1:2], each = 5)), > identifierA = factor(rep(LETTERS[1:2], each = 5)), > B1298712 = factor(rep(LETTERS[1:2], each = 5)))Use strsplit to separate the components, something like separateNames <- strsplit(names(df1)[3], split = "\\.")[[1]] for(name in separateNames) { df1[[name]] <- df1[[3]] } df1[[3]] <- NULL Best, Ista> > # Many thank's and with kind regards > Anna Zakrisson > >><((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?. .><((((?>`?. . ? `?. .? `?. .><((((?> > > Anna Zakrisson Braeunlich > PhD student > > Department of Ecology, Environment and Plant Sciences > Stockholm University > Svante Arrheniusv. 21A > SE-106 91 Stockholm > Sweden/Sverige > > Lives in Berlin. > For paper mail: > Katzbachstr. 21 > D-10965, Berlin > Germany/Deutschland > > E-mail: anna.zakrisson at su.se > Tel work: +49-(0)3091541281 > Mobile: +49-(0)15777374888 > LinkedIn: http://se.linkedin.com/pub/anna-zakrisson-braeunlich/33/5a2/51b > >><((((?>`?. . ? `?. .? `?. . ><((((?>`?. . ? `?. .? `?. .><((((?>`?. . ? `?. .? `?. .><((((?> > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >