Hello,
this is probably trivial but I failed to find this
particular snippet of code.
What I got:
my_dataframe (contains say a 40k rows and 4 columns)
distances (vector with euclidean distances between a
query vector and each of the rows of my_dataframe)
What I do:
after scaling data my_dataframe I calculate distances.
order them then extract top five hits
my_dataframe <- read.table("myDB.csv", header=F,
dec=".", sep=";",
row.names=1)
#reads the whole file
scaled_DB <- scale(my_dataframe, center=FALSE)
#scales the values
require(hopach)
#checks necessary R package
distances <- order(distancevector(scaled_DB,
scaled_DB['query',], d="euclid"))
#calculates distances and orders the results from
lowest
for(i in distances[1:5]) print( dbfile[i,])
#prints top five hits just for debugging
What I want to do:
1) create a small top_five frame
sadly this does not work:
for(i in distances[1:5]) top_five[i,] <-
my_dataframe[i,]
2) after I got top_five I woul like to get the index
of my query entry, something along Pythons
top_five.index('query_string')
3) possibly combine values in distances with row names
from my_dataframe:
row_1 distance_from_query1
row_2 distance_from_query2
Thank you very much for your help
Darek Kedra
Two missing things:>distances[1] 13 14 10 11 2 4 6 1 3 9 8 12 7 5 #numbers correspond to rows in my_dataframe> my_dataframeV2 V3 V4 V5 V6 ENSP00000354687 35660.45 0.04794521 0.05479452 0.06849315 0.07534247 ENSP00000355046 38942.77 0.02967359 0.04451039 0.04451039 0.06824926 ENSP00000354499 57041.21 0.04700855 0.08760684 0.11965812 0.06196581 ENSP00000354687 etc are rownames. I am trying to get top five row names with smallest distances from a given vector as calculated by distancevector from hopach. Darek Kedra ____________________________________________________________________________________ Cheap talk?
Hi!> distances <- order(distancevector(scaled_DB, scaled_DB['query',], > d="euclid"))Just compute the distances WITHOUT ordering, here. And then> 1) create a small top_five frametop = scaled_DB[rank(distances)<=5, ] rank() is better for this than order() in case there are ties.> 2) after I got top_five I woul like to get the index > of my query entry, something along Pythons > top_five.index('query_string')You mean by row name? which(row.names(scaled_DB)=='query_string') But why would you need the index? If you want to get the respective row use logical indexing: my_dataframe['query_string', ]> 3) possibly combine values in distances with row names > from my_dataframe: > row_1 distance_from_query1 > row_2 distance_from_query2The easiest way to store the distances along with the original names and data would be to simply make distances a column in your data frame, which is what I would have done to begin with. The entire procedure would then look like this: my_dataframe = read.table( ... ) scaled_DB <- scale(my_dataframe, center=FALSE) scaled_DB$dist1 = distancevector(scaled_DB, scaled_DB['query1',], ...) scaled_DB$dist2 = distancevector(scaled_DB, scaled_DB['query2',], ...) scaled_DB$dist3 = distancevector(scaled_DB, scaled_DB['query3',], ...) ... top1 = scaled_DB[rank(scaled_DB$dist1)<=5, ] ... cu Philipp -- Dr. Philipp Pagel Tel. +49-8161-71 2131 Dept. of Genome Oriented Bioinformatics Fax. +49-8161-71 2186 Technical University of Munich Science Center Weihenstephan 85350 Freising, Germany and Institute for Bioinformatics / MIPS Tel. +49-89-3187 3675 GSF - National Research Center Fax. +49-89-3187 3585 for Environment and Health Ingolst?dter Landstrasse 1 85764 Neuherberg, Germany http://mips.gsf.de/staff/pagel
Neuro LeSuperHéros
2006-Dec-03 16:16 UTC
[R] newbie: new_data_frame <- selected set of rows
#Mock df creation
my_dataframe <-data.frame(matrix(runif(14*5),14,5))
row.names(my_dataframe) <-paste("ENSP",1:14,sep="")
distances <-c(13,14,10 ,11, 2, 4, 6, 1, 3, 9, 8, 12, 7, 5)
head(my_dataframe[order(distances),],5)
>From: Darek Kedra <darked90 at yahoo.com>
>To: r-help at stat.math.ethz.ch
>Subject: Re: [R] newbie: new_data_frame <- selected set of rows
>Date: Fri, 1 Dec 2006 14:52:25 -0800 (PST)
>
>Two missing things:
>
> >distances
> [1] 13 14 10 11 2 4 6 1 3 9 8 12 7 5
>
>#numbers correspond to rows in my_dataframe
>
> > my_dataframe
> V2 V3 V4
>V5 V6
>ENSP00000354687 35660.45 0.04794521 0.05479452
>0.06849315 0.07534247
>ENSP00000355046 38942.77 0.02967359 0.04451039
>0.04451039 0.06824926
>ENSP00000354499 57041.21 0.04700855 0.08760684
>0.11965812 0.06196581
>
>ENSP00000354687 etc are rownames.
>
>I am trying to get top five row names with smallest
>distances from a given vector as calculated by
>distancevector from hopach.
>
>
>
>Darek Kedra
>
>
>
>
>
>
>____________________________________________________________________________________
>Cheap talk?
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.