thr3ads.net - R help - [R] Fwd: Re: transpose and split dataframe [May 2019]

If this information is useful, please help other people find it:
Share via:

Jim Lemon

2019-May-03 00:32 UTC

[R] Fwd: Re: transpose and split dataframe

Hi Matthew,
I'm not sure whether you want something like your initial request or
David's solution. The result of this can be transformed into the
latter:

mmdf<-read.table(text="Regulator hits
AT1G69490
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
AT1G29860
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
AT1G2986
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
header=TRUE,stringsAsFactors=FALSE)
# split the second column at the commas
hitsplit<-strsplit(mmdf$hits,",")
# define a function that will fill with NAs
NAfill<-function(x,n) return(x[1:n])
# get the maximum length of hits
maxlen<-max(unlist(lapply(hitsplit,length)))
# fill the list with NAs
hitsplit<-lapply(hitsplit,NAfill,maxlen)
# get all the sorted hits
allhits<-sort(unique(unlist(hitsplit)))
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
# change the names of the list
names(tmmdf)<-mmdf$Regulator
# replace all NA values in tmmdf where they appear in hitsplit
for(column in 1:length(hitsplit)) {
 hitmatches<-match(hitsplit[[column]],allhits)
 hitmatches<-hitmatches[!is.na(hitmatches)]
 tmmdf[hitmatches,column]<-allhits[hitmatches]
}

Jim

On Fri, May 3, 2019 at 12:43 AM David L Carlson <dcarlson at tamu.edu>
wrote:>
> We still have only the toy version of your data from your first email. The
second email used dput() as I suggested, but you truncated the results so it is
useless for testing purposes.
>
> Use the following code after creating DataList (up to mx <- ... ) in my
earlier answer:
>
> n <- sapply(DataList, length)
> hits <- unname(unlist(DataList))
> Regulator <- unname(unlist(mapply(rep, names(DataList), times=n)))
> DataTable <- table(hits, Regulator)
>
> #            Regulator
> # hits        AT1G69490 AT2G55980
> #  AT1G05675         1         0
> #  AT1G26380         1         0
> #  AT2G85403         0         1
> #  AT4G31950         1         0
> #  AT4G89223         0         1
> #  AT5G24110         1         0
>
> Now the Regulators and the hits will be listed in alphabetical order. The
table has 0's for Regulators that do not have a particular hit. If you want
NAs:
>
> DataTable[DataTable==0] <- NA
> print(DataTable, na.print="NA")
> #            Regulator
> # hits        AT1G69490 AT2G55980
> #   AT1G05675         1        NA
> #   AT1G26380         1        NA
> #   AT2G85403        NA         1
> #   AT4G31950         1        NA
> #   AT4G89223        NA         1
> #   AT5G24110         1        NA
>
> If you need a data frame instead of a table:
>
> as.data.frame.matrix(DataTable)
>
> ----------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77843-4352
>
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Matthew
> Sent: Tuesday, April 30, 2019 4:31 PM
> To: r-help at r-project.org
> Subject: [R] Fwd: Re: transpose and split dataframe
>
> Thanks for your reply. I was trying to simplify it a little, but must
> have got it wrong. Here is the real dataframe, TF2list:
>
>   str(TF2list)
> 'data.frame':    152 obs. of  2 variables:
>   $ Regulator: Factor w/ 87 levels
"AT1G02065","AT1G13960",..: 17 6 6 54
> 54 82 82 82 82 82 ...
>   $ hits     : Factor w/ 97 levels
>
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
> __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
>
>     And the first few lines resulting from dput(head(TF2list)):
>
> dput(head(TF2list))
> structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
> 82L), .Label = c("AT1G02065", "AT1G13960",
"AT1G18860", "AT1G23380",
> "AT1G29280", "AT1G29860", "AT1G30650",
"AT1G55600", "AT1G62300",
> "AT1G62990", "AT1G64000", "AT1G66550",
"AT1G66560", "AT1G66600",
> "AT1G68150", "AT1G69310", "AT1G69490",
"AT1G69810", "AT1G70510", ...
>
> This is another way of looking at the first 4 entries (Regulator is
> tab-separated from hits):
>
> Regulator
>    hits
> 1
> AT1G69490
>  
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> 2
> AT1G29860
>  
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
>
> 3
> AT1G2986
>  
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
>
>     So, the goal would be to
>
> first: Transpose the existing dataframe so that the factor Regulator
> becomes a column name (column 1 name = AT1G69490, column2 name
> AT1G29860, etc.) and the hits associated with each Regulator become
> rows. Hits is a comma separated 'list' ( I do not not know if
> technically it is an R list.), so it would have to be comma
> 'unseparated' with each entry becoming a row (col 1 row 1 =
AT4G31950,
> col 1 row 2 - AT5G24410, etc); like this :
>
> AT1G69490
> AT4G31950
> AT5G24110
> AT1G05675
> AT5G64905
>
> ... I did not include all the rows)
>
> I think it would be best to actually make the first entry a separate
> dataframe ( 1 column with name = AT1G69490 and number of rows depending
> on the number of hits), then make the second column (column name >
AT1G29860, and number of rows depending on the number of hits) into a
> new dataframe and do a full join of of the two dataframes; continue by
> making the third column (column name = AT1G2986) into a dataframe and
> full join it with the previous; continue for the 152 observations so
> that then end result is a dataframe with 152 columns and number of rows
> depending on the entry with the greatest number of hits. The full joins
> I can do with dplyr, but getting up to that point seems rather difficult.
>
> This would get me what my ultimate goal would be; each Regulator is a
> column name (152 columns) and a given row has either NA or the same hit.
>
>     This seems very difficult to me, but I appreciate any attempt.
>
> Matthew
>
> On 4/30/2019 4:34 PM, David L Carlson wrote:
> >          External Email - Use Caution
> >
> > I think we need more information. Can you give us the structure of the
data with str(YourDataFrame). Alternatively you could copy a small piece into
your email message by copying and pasting the results of the following code:
> >
> > dput(head(YourDataFrame))
> >
> > The data frame you present could not be a data frame since you say
"hits" is a factor with a variable number of elements. If each value
of "hits" was a single character string, it would only have 2 factor
levels not 6 and your efforts to parse the string would make more sense.
Transposing to a data frame would only be possible if each column was padded
with NAs to make them equal in length. Since your example tries use the name
TF2list, it is possible that you do not have a data frame but a list and you
have no factor levels, just character vectors.
> >
> > If you are not familiar with R, it may be helpful to tell us what your
overall goal is rather than an intermediate step. Very likely R can easily
handle what you want by doing things a different way.
> >
> > ----------------------------------------
> > David L Carlson
> > Department of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> >
> >
> > -----Original Message-----
> > From: R-help<r-help-bounces at r-project.org>  On Behalf Of
Matthew
> > Sent: Tuesday, April 30, 2019 2:25 PM
> > To: r-help (r-help at r-project.org)<r-help at r-project.org>
> > Subject: [R] transpose and split dataframe
> >
> > I have a data frame that is a lot bigger but for simplicity sake we
can
> > say it looks like this:
> >
> > Regulator    hits
> > AT1G69490    AT4G31950,AT5G24110,AT1G26380,AT1G05675
> > AT2G55980    AT2G85403,AT4G89223
> >
> >      In other words:
> >
> > data.frame : 2 obs. of 2 variables
> > $Regulator: Factor w/ 2 levels
> > $hits         : Factor w/ 6 levels
> >
> >     I want to transpose it so that Regulator is now the column
headings
> > and each of the AGI numbers now separated by commas is a row. So,
> > AT1G69490 is now the header of the first column and AT4G31950 is row 1
> > of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header
of
> > column 2 and AT2G85403 is row 1 of column 2, etc.
> >
> >     I have tried playing around with strsplit(TF2list[2:2]) and
> > strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
> >
> > Matthew
> >
> > ______________________________________________
> > R-help at r-project.org  mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting
guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jim Lemon

2019-May-03 00:40 UTC

head link

[R] Fwd: Re: transpose and split dataframe

Hi again,
Just noticed that the NA fill in the original solution is unnecessary, thus:

# split the second column at the commas
hitsplit<-strsplit(mmdf$hits,",")
# get all the sorted hits
allhits<-sort(unique(unlist(hitsplit)))
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
# change the names of the list
names(tmmdf)<-mmdf$Regulator
for(column in 1:length(hitsplit)) {
 hitmatches<-match(hitsplit[[column]],allhits)
 hitmatches<-hitmatches[!is.na(hitmatches)]
 tmmdf[hitmatches,column]<-allhits[hitmatches]
}

Jim

On Fri, May 3, 2019 at 10:32 AM Jim Lemon <drjimlemon at gmail.com>
wrote:>
> Hi Matthew,
> I'm not sure whether you want something like your initial request or
> David's solution. The result of this can be transformed into the
> latter:
>
> mmdf<-read.table(text="Regulator hits
> AT1G69490
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> AT1G29860
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
> AT1G2986
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
> header=TRUE,stringsAsFactors=FALSE)
> # split the second column at the commas
> hitsplit<-strsplit(mmdf$hits,",")
> # define a function that will fill with NAs
> NAfill<-function(x,n) return(x[1:n])
> # get the maximum length of hits
> maxlen<-max(unlist(lapply(hitsplit,length)))
> # fill the list with NAs
> hitsplit<-lapply(hitsplit,NAfill,maxlen)
> # get all the sorted hits
> allhits<-sort(unique(unlist(hitsplit)))
>
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
> # change the names of the list
> names(tmmdf)<-mmdf$Regulator
> # replace all NA values in tmmdf where they appear in hitsplit
> for(column in 1:length(hitsplit)) {
>  hitmatches<-match(hitsplit[[column]],allhits)
>  hitmatches<-hitmatches[!is.na(hitmatches)]
>  tmmdf[hitmatches,column]<-allhits[hitmatches]
> }
>
> Jim
>
> On Fri, May 3, 2019 at 12:43 AM David L Carlson <dcarlson at
tamu.edu> wrote:
> >
> > We still have only the toy version of your data from your first email.
The second email used dput() as I suggested, but you truncated the results so it
is useless for testing purposes.
> >
> > Use the following code after creating DataList (up to mx <- ... )
in my earlier answer:
> >
> > n <- sapply(DataList, length)
> > hits <- unname(unlist(DataList))
> > Regulator <- unname(unlist(mapply(rep, names(DataList), times=n)))
> > DataTable <- table(hits, Regulator)
> >
> > #            Regulator
> > # hits        AT1G69490 AT2G55980
> > #  AT1G05675         1         0
> > #  AT1G26380         1         0
> > #  AT2G85403         0         1
> > #  AT4G31950         1         0
> > #  AT4G89223         0         1
> > #  AT5G24110         1         0
> >
> > Now the Regulators and the hits will be listed in alphabetical order.
The table has 0's for Regulators that do not have a particular hit. If you
want NAs:
> >
> > DataTable[DataTable==0] <- NA
> > print(DataTable, na.print="NA")
> > #            Regulator
> > # hits        AT1G69490 AT2G55980
> > #   AT1G05675         1        NA
> > #   AT1G26380         1        NA
> > #   AT2G85403        NA         1
> > #   AT4G31950         1        NA
> > #   AT4G89223        NA         1
> > #   AT5G24110         1        NA
> >
> > If you need a data frame instead of a table:
> >
> > as.data.frame.matrix(DataTable)
> >
> > ----------------------------------------
> > David L Carlson
> > Department of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces at r-project.org> On Behalf Of
Matthew
> > Sent: Tuesday, April 30, 2019 4:31 PM
> > To: r-help at r-project.org
> > Subject: [R] Fwd: Re: transpose and split dataframe
> >
> > Thanks for your reply. I was trying to simplify it a little, but must
> > have got it wrong. Here is the real dataframe, TF2list:
> >
> >   str(TF2list)
> > 'data.frame':    152 obs. of  2 variables:
> >   $ Regulator: Factor w/ 87 levels
"AT1G02065","AT1G13960",..: 17 6 6 54
> > 54 82 82 82 82 82 ...
> >   $ hits     : Factor w/ 97 levels
> >
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
> > __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
> >
> >     And the first few lines resulting from dput(head(TF2list)):
> >
> > dput(head(TF2list))
> > structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
> > 82L), .Label = c("AT1G02065", "AT1G13960",
"AT1G18860", "AT1G23380",
> > "AT1G29280", "AT1G29860", "AT1G30650",
"AT1G55600", "AT1G62300",
> > "AT1G62990", "AT1G64000", "AT1G66550",
"AT1G66560", "AT1G66600",
> > "AT1G68150", "AT1G69310", "AT1G69490",
"AT1G69810", "AT1G70510", ...
> >
> > This is another way of looking at the first 4 entries (Regulator is
> > tab-separated from hits):
> >
> > Regulator
> >    hits
> > 1
> > AT1G69490
> >  
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
> > 2
> > AT1G29860
> >  
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
> >
> > 3
> > AT1G2986
> >  
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
> >
> >     So, the goal would be to
> >
> > first: Transpose the existing dataframe so that the factor Regulator
> > becomes a column name (column 1 name = AT1G69490, column2 name
> > AT1G29860, etc.) and the hits associated with each Regulator become
> > rows. Hits is a comma separated 'list' ( I do not not know if
> > technically it is an R list.), so it would have to be comma
> > 'unseparated' with each entry becoming a row (col 1 row 1 =
AT4G31950,
> > col 1 row 2 - AT5G24410, etc); like this :
> >
> > AT1G69490
> > AT4G31950
> > AT5G24110
> > AT1G05675
> > AT5G64905
> >
> > ... I did not include all the rows)
> >
> > I think it would be best to actually make the first entry a separate
> > dataframe ( 1 column with name = AT1G69490 and number of rows
depending
> > on the number of hits), then make the second column (column name >
> AT1G29860, and number of rows depending on the number of hits) into a
> > new dataframe and do a full join of of the two dataframes; continue by
> > making the third column (column name = AT1G2986) into a dataframe and
> > full join it with the previous; continue for the 152 observations so
> > that then end result is a dataframe with 152 columns and number of
rows
> > depending on the entry with the greatest number of hits. The full
joins
> > I can do with dplyr, but getting up to that point seems rather
difficult.
> >
> > This would get me what my ultimate goal would be; each Regulator is a
> > column name (152 columns) and a given row has either NA or the same
hit.
> >
> >     This seems very difficult to me, but I appreciate any attempt.
> >
> > Matthew
> >
> > On 4/30/2019 4:34 PM, David L Carlson wrote:
> > >          External Email - Use Caution
> > >
> > > I think we need more information. Can you give us the structure
of the data with str(YourDataFrame). Alternatively you could copy a small piece
into your email message by copying and pasting the results of the following
code:
> > >
> > > dput(head(YourDataFrame))
> > >
> > > The data frame you present could not be a data frame since you
say "hits" is a factor with a variable number of elements. If each
value of "hits" was a single character string, it would only have 2
factor levels not 6 and your efforts to parse the string would make more sense.
Transposing to a data frame would only be possible if each column was padded
with NAs to make them equal in length. Since your example tries use the name
TF2list, it is possible that you do not have a data frame but a list and you
have no factor levels, just character vectors.
> > >
> > > If you are not familiar with R, it may be helpful to tell us what
your overall goal is rather than an intermediate step. Very likely R can easily
handle what you want by doing things a different way.
> > >
> > > ----------------------------------------
> > > David L Carlson
> > > Department of Anthropology
> > > Texas A&M University
> > > College Station, TX 77843-4352
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: R-help<r-help-bounces at r-project.org>  On Behalf Of
Matthew
> > > Sent: Tuesday, April 30, 2019 2:25 PM
> > > To: r-help (r-help at r-project.org)<r-help at
r-project.org>
> > > Subject: [R] transpose and split dataframe
> > >
> > > I have a data frame that is a lot bigger but for simplicity sake
we can
> > > say it looks like this:
> > >
> > > Regulator    hits
> > > AT1G69490    AT4G31950,AT5G24110,AT1G26380,AT1G05675
> > > AT2G55980    AT2G85403,AT4G89223
> > >
> > >      In other words:
> > >
> > > data.frame : 2 obs. of 2 variables
> > > $Regulator: Factor w/ 2 levels
> > > $hits         : Factor w/ 6 levels
> > >
> > >     I want to transpose it so that Regulator is now the column
headings
> > > and each of the AGI numbers now separated by commas is a row. So,
> > > AT1G69490 is now the header of the first column and AT4G31950 is
row 1
> > > of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is
header of
> > > column 2 and AT2G85403 is row 1 of column 2, etc.
> > >
> > >     I have tried playing around with strsplit(TF2list[2:2]) and
> > > strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
> > >
> > > Matthew
> > >
> > > ______________________________________________
> > > R-help at r-project.org  mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting
guidehttp://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

Matthew

2019-May-06 20:12 UTC

head link

[R] Fwd: Re: transpose and split dataframe

Thank you very much Jim and David for your scripts and accompanying 
explanations.

I was intrigued at the results that came from David's script.? As seen 
below where I have taken a small piece of his DataTable:

AT1G69490 AT1G29860??? AT4G18170 *AT5G46350*
AT1G01560??? 0??? 0??? 0??? 1
*AT1G02920*??? 1??? 2??? 2??? 4
AT1G02930??? 1??? 2??? 2??? 4
AT1G05675??? 1??? 1??? 1??? 2

 ?? There are numbers other than 1 or 0, which was not what I was 
expecting. The data I am working with come from downloading results of 
an analysis done at a particular web site. I looked at Jim's solution, 
and the equivalent of the above would be:

 ??? AT1G69490 _AT1G29860_ _AT1G29860_ AT4G18170??? AT4G18170 
*AT5G46350??? AT5G46350 AT5G46350??? AT5G46350??? AT5G46350*
AT1G01560??? NA??? NA??? NA??? NA??? NA??? NA NA??? NA??? AT1G01560??? NA
*AT1G02920*??? AT1G02920??? AT1G02920 AT1G02920??? AT1G02920??? 
AT1G02920??? AT1G02920??? AT1G02920 AT1G02920??? AT1G02920??? NA
AT1G02930??? AT1G02930??? AT1G02930??? AT1G02930 AT1G02930??? 
AT1G02930??? AT1G02930??? AT1G02930??? AT1G02930 AT1G02930??? NA
AT1G05675??? AT1G05675??? AT1G05675??? NA AT1G05675??? NA??? 
AT1G05675??? AT1G05675??? NA??? NA??? NA

 ? The above is the format that I was desiring, but I was not expecting 
that a single ATG number would be the name of multiple columns. As shown 
above, _AT1G2960_ is the name of two columns and *AT5G46350* is the name 
of 5 columns (You may have to widen the e-mail across the screen to see 
it clearly). When a single ATG number, such as AT5G46350, names multiple 
columns, then the contents of each of those columns may or may not be 
the same. For example, going across a single row looking at *AT1G02920*, 
it occurs in the first column, hence the 1 in David's DataTable. It 
occurs in both AT1G29860 columns, hence the 2 in the DataTable. It again 
occurs in both AT4G18170 columns, so another 2 in the DataTable, and 
finally it occurs in only 4 of the 5 AT5G46350 columns, so the 4 in the 
DataTable.

 ??? When the same ATG number names multiple columns it is because 
different methods were used to determine the content of each column. So, 
if an ATG number such as AT1G05675 occurs in all columns with the same 
name, I then know that it was by multiple methods that this has been 
shown, and if it only occurs in some of the columns, I know that all 
methods did not associate it with the column name ATG.? David's result 
complements Jim's, and both end up being very helpful to me.

 ? Thanks again to both of you for your time and help.

Matthew


On 5/2/2019 8:40 PM, Jim Lemon wrote:>          External Email - Use Caution
>
> Hi again,
> Just noticed that the NA fill in the original solution is unnecessary,
thus:
>
> # split the second column at the commas
> hitsplit<-strsplit(mmdf$hits,",")
> # get all the sorted hits
> allhits<-sort(unique(unlist(hitsplit)))
>
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
> # change the names of the list
> names(tmmdf)<-mmdf$Regulator
> for(column in 1:length(hitsplit)) {
>   hitmatches<-match(hitsplit[[column]],allhits)
>   hitmatches<-hitmatches[!is.na(hitmatches)]
>   tmmdf[hitmatches,column]<-allhits[hitmatches]
> }
>
> Jim
>
> On Fri, May 3, 2019 at 10:32 AM Jim Lemon <drjimlemon at gmail.com>
wrote:
>> Hi Matthew,
>> I'm not sure whether you want something like your initial request
or
>> David's solution. The result of this can be transformed into the
>> latter:
>>
>> mmdf<-read.table(text="Regulator hits
>> AT1G69490
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
>> AT1G29860
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
>> AT1G2986
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830",
>> header=TRUE,stringsAsFactors=FALSE)
>> # split the second column at the commas
>> hitsplit<-strsplit(mmdf$hits,",")
>> # define a function that will fill with NAs
>> NAfill<-function(x,n) return(x[1:n])
>> # get the maximum length of hits
>> maxlen<-max(unlist(lapply(hitsplit,length)))
>> # fill the list with NAs
>> hitsplit<-lapply(hitsplit,NAfill,maxlen)
>> # get all the sorted hits
>> allhits<-sort(unique(unlist(hitsplit)))
>>
tmmdf<-as.data.frame(matrix(NA,ncol=length(hitsplit),nrow=length(allhits)))
>> # change the names of the list
>> names(tmmdf)<-mmdf$Regulator
>> # replace all NA values in tmmdf where they appear in hitsplit
>> for(column in 1:length(hitsplit)) {
>>   hitmatches<-match(hitsplit[[column]],allhits)
>>   hitmatches<-hitmatches[!is.na(hitmatches)]
>>   tmmdf[hitmatches,column]<-allhits[hitmatches]
>> }
>>
>> Jim
>>
>> On Fri, May 3, 2019 at 12:43 AM David L Carlson <dcarlson at
tamu.edu> wrote:
>>> We still have only the toy version of your data from your first
email. The second email used dput() as I suggested, but you truncated the
results so it is useless for testing purposes.
>>>
>>> Use the following code after creating DataList (up to mx <- ...
) in my earlier answer:
>>>
>>> n <- sapply(DataList, length)
>>> hits <- unname(unlist(DataList))
>>> Regulator <- unname(unlist(mapply(rep, names(DataList),
times=n)))
>>> DataTable <- table(hits, Regulator)
>>>
>>> #            Regulator
>>> # hits        AT1G69490 AT2G55980
>>> #  AT1G05675         1         0
>>> #  AT1G26380         1         0
>>> #  AT2G85403         0         1
>>> #  AT4G31950         1         0
>>> #  AT4G89223         0         1
>>> #  AT5G24110         1         0
>>>
>>> Now the Regulators and the hits will be listed in alphabetical
order. The table has 0's for Regulators that do not have a particular hit.
If you want NAs:
>>>
>>> DataTable[DataTable==0] <- NA
>>> print(DataTable, na.print="NA")
>>> #            Regulator
>>> # hits        AT1G69490 AT2G55980
>>> #   AT1G05675         1        NA
>>> #   AT1G26380         1        NA
>>> #   AT2G85403        NA         1
>>> #   AT4G31950         1        NA
>>> #   AT4G89223        NA         1
>>> #   AT5G24110         1        NA
>>>
>>> If you need a data frame instead of a table:
>>>
>>> as.data.frame.matrix(DataTable)
>>>
>>> ----------------------------------------
>>> David L Carlson
>>> Department of Anthropology
>>> Texas A&M University
>>> College Station, TX 77843-4352
>>>
>>> -----Original Message-----
>>> From: R-help <r-help-bounces at r-project.org> On Behalf Of
Matthew
>>> Sent: Tuesday, April 30, 2019 4:31 PM
>>> To: r-help at r-project.org
>>> Subject: [R] Fwd: Re: transpose and split dataframe
>>>
>>> Thanks for your reply. I was trying to simplify it a little, but
must
>>> have got it wrong. Here is the real dataframe, TF2list:
>>>
>>>    str(TF2list)
>>> 'data.frame':    152 obs. of  2 variables:
>>>    $ Regulator: Factor w/ 87 levels
"AT1G02065","AT1G13960",..: 17 6 6 54
>>> 54 82 82 82 82 82 ...
>>>    $ hits     : Factor w/ 97 levels
>>>
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
>>> __truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
>>>
>>>      And the first few lines resulting from dput(head(TF2list)):
>>>
>>> dput(head(TF2list))
>>> structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
>>> 82L), .Label = c("AT1G02065", "AT1G13960",
"AT1G18860", "AT1G23380",
>>> "AT1G29280", "AT1G29860",
"AT1G30650", "AT1G55600", "AT1G62300",
>>> "AT1G62990", "AT1G64000",
"AT1G66550", "AT1G66560", "AT1G66600",
>>> "AT1G68150", "AT1G69310",
"AT1G69490", "AT1G69810", "AT1G70510", ...
>>>
>>> This is another way of looking at the first 4 entries (Regulator is
>>> tab-separated from hits):
>>>
>>> Regulator
>>>     hits
>>> 1
>>> AT1G69490
>>>   
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
>>> 2
>>> AT1G29860
>>>   
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
>>>
>>> 3
>>> AT1G2986
>>>   
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
>>>
>>>      So, the goal would be to
>>>
>>> first: Transpose the existing dataframe so that the factor
Regulator
>>> becomes a column name (column 1 name = AT1G69490, column2 name
>>> AT1G29860, etc.) and the hits associated with each Regulator become
>>> rows. Hits is a comma separated 'list' ( I do not not know
if
>>> technically it is an R list.), so it would have to be comma
>>> 'unseparated' with each entry becoming a row (col 1 row 1 =
AT4G31950,
>>> col 1 row 2 - AT5G24410, etc); like this :
>>>
>>> AT1G69490
>>> AT4G31950
>>> AT5G24110
>>> AT1G05675
>>> AT5G64905
>>>
>>> ... I did not include all the rows)
>>>
>>> I think it would be best to actually make the first entry a
separate
>>> dataframe ( 1 column with name = AT1G69490 and number of rows
depending
>>> on the number of hits), then make the second column (column name
>>> AT1G29860, and number of rows depending on the number of hits) into
a
>>> new dataframe and do a full join of of the two dataframes; continue
by
>>> making the third column (column name = AT1G2986) into a dataframe
and
>>> full join it with the previous; continue for the 152 observations
so
>>> that then end result is a dataframe with 152 columns and number of
rows
>>> depending on the entry with the greatest number of hits. The full
joins
>>> I can do with dplyr, but getting up to that point seems rather
difficult.
>>>
>>> This would get me what my ultimate goal would be; each Regulator is
a
>>> column name (152 columns) and a given row has either NA or the same
hit.
>>>
>>>      This seems very difficult to me, but I appreciate any attempt.
>>>
>>> Matthew
>>>
>>> On 4/30/2019 4:34 PM, David L Carlson wrote:
>>>>           External Email - Use Caution
>>>>
>>>> I think we need more information. Can you give us the structure
of the data with str(YourDataFrame). Alternatively you could copy a small piece
into your email message by copying and pasting the results of the following
code:
>>>>
>>>> dput(head(YourDataFrame))
>>>>
>>>> The data frame you present could not be a data frame since you
say "hits" is a factor with a variable number of elements. If each
value of "hits" was a single character string, it would only have 2
factor levels not 6 and your efforts to parse the string would make more sense.
Transposing to a data frame would only be possible if each column was padded
with NAs to make them equal in length. Since your example tries use the name
TF2list, it is possible that you do not have a data frame but a list and you
have no factor levels, just character vectors.
>>>>
>>>> If you are not familiar with R, it may be helpful to tell us
what your overall goal is rather than an intermediate step. Very likely R can
easily handle what you want by doing things a different way.
>>>>
>>>> ----------------------------------------
>>>> David L Carlson
>>>> Department of Anthropology
>>>> Texas A&M University
>>>> College Station, TX 77843-4352
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: R-help<r-help-bounces at r-project.org>  On Behalf
Of Matthew
>>>> Sent: Tuesday, April 30, 2019 2:25 PM
>>>> To: r-help (r-help at r-project.org)<r-help at
r-project.org>
>>>> Subject: [R] transpose and split dataframe
>>>>
>>>> I have a data frame that is a lot bigger but for simplicity
sake we can
>>>> say it looks like this:
>>>>
>>>> Regulator    hits
>>>> AT1G69490    AT4G31950,AT5G24110,AT1G26380,AT1G05675
>>>> AT2G55980    AT2G85403,AT4G89223
>>>>
>>>>       In other words:
>>>>
>>>> data.frame : 2 obs. of 2 variables
>>>> $Regulator: Factor w/ 2 levels
>>>> $hits         : Factor w/ 6 levels
>>>>
>>>>      I want to transpose it so that Regulator is now the column
headings
>>>> and each of the AGI numbers now separated by commas is a row.
So,
>>>> AT1G69490 is now the header of the first column and AT4G31950
is row 1
>>>> of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is
header of
>>>> column 2 and AT2G85403 is row 1 of column 2, etc.
>>>>
>>>>      I have tried playing around with strsplit(TF2list[2:2])
and
>>>> strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
>>>>
>>>> Matthew
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org  mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting
guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
	[[alternative HTML version deleted]]

R help - May 2019 - Fwd: Re: transpose and split dataframe

[R] Fwd: Re: transpose and split dataframe

[R] Fwd: Re: transpose and split dataframe

[R] Fwd: Re: transpose and split dataframe