thr3ads.net - R help - [R] Removing funny characters from a column of a data frame [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Bansal, Vikas

2011-Aug-07 19:57 UTC

[R] Removing funny characters from a column of a data frame

Dear all,

The 5th column of my data frame is like this-

.$.$.$.$.$,$,$...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
,..,,....,,,,,...,,,..,,......,,,,,,,....,,,.,,,,....,,...G.,,,,,,,,...,,,,,,.,,
,t.,,c,,.a.,,,.A,,,,....,,,.....,,,,..........,,,,,..,,,.,,,....,,,,,...,,,$....
.,,,,..,,,...,,,,,..,,,,,,.............$..,,,,,,...,,..,,$,...,,,,,,,....,,,,,,.
,,,,......,,,,.,,.......,.....,,,,,,.,,..,,...,,,,,.,......,.......,,....,,,,..,,
,,,,.........,,,,,.....,,,,...,,,.....,,.....,,......,....,,......,.,,..,,,,...,,
H.,,,..,,.....,,,,..,,,,,,,,,^~.^~.^\".^~.^~.^~.^~,^~,^~,^~,"  

I just want to have A,a,C,c,G,g,T,t and dot and comma in the columns.

example of first row should be-

.....,,...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,


currently i am using this code-

df$V5 <-  apply(df, 1, function(x)
gsub("\\:|\\$|\\^|!|\\-|1|2|3|4|5|6|7|8|10|~|H", "",x[5]))

this use of gsub looks odd to me,although result is coming good but I want
something fast because data is large.I want something like this-

delete everything else except  A,a,C,c,G,g,T,t and dot and comma.

Any suggestions Please.


        
Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London

Joshua Wiley

2011-Aug-07 20:08 UTC

head link

[R] Removing funny characters from a column of a data frame

Hi Vikas,

You're overworking yourself here, gsub is vectorized!

df$V5 <- gsub("[^AaCcGgTt\\.,]", "", df$V5)

This will be *substantially* faster than looping (using apply) over
every row of your data frame, since you just care about the 5th column
anyways.  Also, I switched your regexp for one that replaces not
AaCcGgTt.,

Cheers,

Josh

On Sun, Aug 7, 2011 at 12:57 PM, Bansal, Vikas <vikas.bansal at kcl.ac.uk>
wrote:> Dear all,
>
> The 5th column of my data frame is like this-
>
>
.$.$.$.$.$,$,$...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
>
,..,,....,,,,,...,,,..,,......,,,,,,,....,,,.,,,,....,,...G.,,,,,,,,...,,,,,,.,,
>
,t.,,c,,.a.,,,.A,,,,....,,,.....,,,,..........,,,,,..,,,.,,,....,,,,,...,,,$....
>
.,,,,..,,,...,,,,,..,,,,,,.............$..,,,,,,...,,..,,$,...,,,,,,,....,,,,,,.
>
,,,,......,,,,.,,.......,.....,,,,,,.,,..,,...,,,,,.,......,.......,,....,,,,..,,
>
,,,,.........,,,,,.....,,,,...,,,.....,,.....,,......,....,,......,.,,..,,,,...,,
> H.,,,..,,.....,,,,..,,,,,,,,,^~.^~.^\".^~.^~.^~.^~,^~,^~,^~,"
>
> I just want to have A,a,C,c,G,g,T,t and dot and comma in the columns.
>
> example of first row should be-
>
> .....,,...,,,,,.,,.,,...,,,,.,,....,,,T...,,,,,,,,,,,.,,,,,....,,...,,
>
>
> currently i am using this code-
>
> df$V5 <- ?apply(df, 1, function(x)
gsub("\\:|\\$|\\^|!|\\-|1|2|3|4|5|6|7|8|10|~|H", "",x[5]))
>
> this use of gsub looks odd to me,although result is coming good but I want
something fast because data is large.I want something like this-
>
> delete everything else except ?A,a,C,c,G,g,T,t and dot and comma.
>
> Any suggestions Please.
>
>
>
> Thanking you,
> Warm Regards
> Vikas Bansal
> Msc Bioinformatics
> Kings College London
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

Apparently Analagous Threads

Search for more maybe matching threads

R help - Aug 2011 - Removing funny characters from a column of a data frame

[R] Removing funny characters from a column of a data frame

[R] Removing funny characters from a column of a data frame

Apparently Analagous Threads