Bansal, Vikas
2011-Aug-07 19:57 UTC
[R] Removing funny characters from a column of a data frame
Dear all, The 5th column of my data frame is like this- .$.$.$.$.$,$,$...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,, ,..,,....,,,,,...,,,..,,......,,,,,,,....,,,.,,,,....,,...G.,,,,,,,,...,,,,,,.,, ,t.,,c,,.a.,,,.A,,,,....,,,.....,,,,..........,,,,,..,,,.,,,....,,,,,...,,,$.... .,,,,..,,,...,,,,,..,,,,,,.............$..,,,,,,...,,..,,$,...,,,,,,,....,,,,,,. ,,,,......,,,,.,,.......,.....,,,,,,.,,..,,...,,,,,.,......,.......,,....,,,,..,, ,,,,.........,,,,,.....,,,,...,,,.....,,.....,,......,....,,......,.,,..,,,,...,, H.,,,..,,.....,,,,..,,,,,,,,,^~.^~.^\".^~.^~.^~.^~,^~,^~,^~," I just want to have A,a,C,c,G,g,T,t and dot and comma in the columns. example of first row should be- .....,,...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,, currently i am using this code- df$V5 <- apply(df, 1, function(x) gsub("\\:|\\$|\\^|!|\\-|1|2|3|4|5|6|7|8|10|~|H", "",x[5])) this use of gsub looks odd to me,although result is coming good but I want something fast because data is large.I want something like this- delete everything else except A,a,C,c,G,g,T,t and dot and comma. Any suggestions Please. Thanking you, Warm Regards Vikas Bansal Msc Bioinformatics Kings College London
Joshua Wiley
2011-Aug-07 20:08 UTC
[R] Removing funny characters from a column of a data frame
Hi Vikas, You're overworking yourself here, gsub is vectorized! df$V5 <- gsub("[^AaCcGgTt\\.,]", "", df$V5) This will be *substantially* faster than looping (using apply) over every row of your data frame, since you just care about the 5th column anyways. Also, I switched your regexp for one that replaces not AaCcGgTt., Cheers, Josh On Sun, Aug 7, 2011 at 12:57 PM, Bansal, Vikas <vikas.bansal at kcl.ac.uk> wrote:> Dear all, > > The 5th column of my data frame is like this- > > .$.$.$.$.$,$,$...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,, > ,..,,....,,,,,...,,,..,,......,,,,,,,....,,,.,,,,....,,...G.,,,,,,,,...,,,,,,.,, > ,t.,,c,,.a.,,,.A,,,,....,,,.....,,,,..........,,,,,..,,,.,,,....,,,,,...,,,$.... > .,,,,..,,,...,,,,,..,,,,,,.............$..,,,,,,...,,..,,$,...,,,,,,,....,,,,,,. > ,,,,......,,,,.,,.......,.....,,,,,,.,,..,,...,,,,,.,......,.......,,....,,,,..,, > ,,,,.........,,,,,.....,,,,...,,,.....,,.....,,......,....,,......,.,,..,,,,...,, > H.,,,..,,.....,,,,..,,,,,,,,,^~.^~.^\".^~.^~.^~.^~,^~,^~,^~," > > I just want to have A,a,C,c,G,g,T,t and dot and comma in the columns. > > example of first row should be- > > .....,,...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,, > > > currently i am using this code- > > df$V5 <- ?apply(df, 1, function(x) gsub("\\:|\\$|\\^|!|\\-|1|2|3|4|5|6|7|8|10|~|H", "",x[5])) > > this use of gsub looks odd to me,although result is coming good but I want something fast because data is large.I want something like this- > > delete everything else except ?A,a,C,c,G,g,T,t and dot and comma. > > Any suggestions Please. > > > > Thanking you, > Warm Regards > Vikas Bansal > Msc Bioinformatics > Kings College London > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/