thr3ads.net - R help - [R] Comparing two columns in an Excel file [Aug 2013]

If this information is useful, please help other people find it:
Share via:

arun

2013-Aug-14 13:57 UTC

[R] Comparing two columns in an Excel file

Hi,
Try:
set.seed(42)
dat1<- as.data.frame(matrix(sample(LETTERS,2*1e6,replace=TRUE),ncol=2))
dat2<- dat1[1:1e5,]
dat3<- dat1
library(data.table)
dt1<- data.table(dat1)
system.time(dat1$sat<- 1*(dat1[,1]==dat1[,2]))
#?? user? system elapsed 
#? 0.148?? 0.004?? 0.152 
library(car)
system.time({dat3$sat<- 1*(dat3[,1]==dat3[,2])
??? dat3$sat<- recode(dat3$sat,'0="no
flag";1="flag"')})
# user? system elapsed 
#? 1.140?? 0.000?? 1.137 

head(dat3)
#? V1 V2???? sat
#1? X? M no flag
#2? Y? K no flag
#3? H? W no flag
#4? V? E no flag
#5? Q? N no flag
#6? N? K no flag

#or
system.time(dt1[,sat:=1*(V1==V2)])
#?? user? system elapsed 
#? 0.104?? 0.000?? 0.103 
?identical(as.data.frame(dt1),dat1)
#[1] TRUE


#your method on a subset of dat1.
na1<- nrow(dat2)
sat <- c(rep(0,na1)) 
dat2 <- cbind(dat2,sat) 
system.time({
for(i in c(1:na1)){
? if( dat2[i,1] == dat2[i,2]) {
????? dat2[i,3] <- 1
??? }
? } 
})
?# user? system elapsed 
?#18.756?? 0.000? 18.792 
identical(dat2,dat1[1:1e5,])
#[1] TRUE
A.K.



Hi, 

I have received NGS (next generation sequencing) data in an Excel file and would
like to flag columns with synonymous mutations.

The Excel file has 48 columns and my columns of interest are 28th and 31st. 

28th and 31st columns contain one letter alphabet (amino acid), and I'd like
to flag them if they had the same alphabet.

Below is an example 

28th column ? ? ? ? ? ? ? ? ? ? ?31st column ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sat
S ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?T ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? no flag
A ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?L ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? no flag
K ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?K ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? flag

Here is the code I made and please don't laugh at it. I just started R two
weeks ago.

#_______________________________________ 
a1 <- read.csv(file.choose(),header=TRUE) 

na1 <- nrow(a1) 

sat <- c(rep(0,na1)) 
a1[,28] <- as.character(a1[,28]) 
a1[,31] <- as.character(a1[,31]) 
a1 <- cbind(a1,sat) 


for(i in c(1:na1)){ 
? if( a1[i,28] == a1[i,31]) { 
? ? ? a1[i,49] <- 1 
? ? } 
? } 
? ? 
write.csv(a1,file.choose(), row.names = FALSE) 
#_______________________________________ 

I test-ran this code with a text Excel file with 30 rows without any problem. 

But a problem arose when I ran this code with an NGS Excel file with more than
80,000 rows. It ran forever.

Does anybody know how to shorten the running time? 

Any input would be appreciated. 

Thanks. 


SY

R help - Aug 2013 - Comparing two columns in an Excel file

[R] Comparing two columns in an Excel file