Here is how I processed the files you sent. I used XLConnect to access
EXCEL.
> require(XLConnect)
> # I happen to use XLConnect for accessing EXCEL
> miRNA <- readWorksheetFromFile("/temp/miRNA.xls", 1) # read
in the
spreadsheets> genes <- readWorksheetFromFile("/temp/gene.xls", 1)
> str(miRNA) # see what the structures are so I know the names
'data.frame': 8 obs. of 2 variables:
$ MiRNA.name : chr "mir256" "mir785" "mir415"
"mir1245" ...
$ position.start: num 458965 896 9689877 11525478 11689689
...> str(genes)
'data.frame': 10 obs. of 4 variables:
$ gene.name : chr "AA" "AB" "AC"
"AD" ...
$ chromosome : num 1 1 2 4 8 9 8 11 12 13
$ positon.Start: num 251 4256 245226 449623 8214587 ...
$ position.end : num 2586 8922 247899 490256 8358964
...>
> require(sqldf)
> # the use of backquotes (`) is because the names include a period (.)
that is special to SQL> matches <- sqldf("
+ select m.*, g.*
+ from miRNA as m
+ join genes as g
+ on m.`position.start` between g.`positon.Start` and
g.`position.end`
+ ")>
> matches
MiRNA.name position.start gene.name chromosome positon.Start position.end
1 mir256 458965 AD 4 449623 490256
2 mir785 896 AA 1 251 2586
3 mir415 9689877 AF 9 9545878 9896698
4 mir1245 11525478 AH 11 11458789 11895251
5 mir96 11689689 AH 11 11458789 11895251
6 mir145 5890 AB 1 4256 8922
7 mir785 12247847 AI 12 12047896 12365478
8 mir895 246789 AC 2 245226
247899>
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Sun, Mar 15, 2015 at 6:29 PM, cyrille sage <
cyrille.laurent.sage at gmail.com> wrote:
> HI Jim
>
> thanks a lot for your answer
>
> I have some question
>
> I attached two excel files name gene and miRNA as an example of what kind
> of data I will have to deal with. this two files are just example.
>
> I was thinking that I will have first to load the file ine R using
>
> Library(gdata)
> gene<-read.xls("path:/gene.xls")
> mir<-read.xls("path:/miRNA.xls")
>
> so now I have two list (gene and miRNA) in R
>
> but I do not know how to call specifically the column (3 and 4) in the
> gene file and the column 2 for the MiRNA list for the comparison and to do
> the comparison according to the number of miRNA in my list.
>
> for example can I do this
>
> start<-sample(gene[3, ],length (gene [3]))
>
> I told you i am a biginner
>
> Papy
>
> On Sun, Mar 15, 2015 at 5:16 PM, jim holtman <jholtman at gmail.com>
wrote:
>
>> You didn't provide any test data, so I made some up with the sizes
you
>> gave. This uses the 'sqldf' package and took about 2 minutes
to come up
>> with the matches.
>>
>> > n <- 200000
>> > mi <- 4500
>> > start <- sample(n * 10, n) # start times
>> > int <- sample(1000, n, TRUE) # interval between start and end
>> > genes <- data.frame(gene = paste0('gene', 1:n)
>> + , start = start
>> + , end = start + int
>> + , stringsAsFactors = FALSE
>> + )
>> > miRNA <- data.frame(name = paste0('mi', 1:mi)
>> + , pos = sample(n * 9, mi)
>> + , stringsAsFactors = FALSE
>> + )
>> > require(sqldf)
>> Loading required package: sqldf
>> Loading required package: gsubfn
>> Loading required package: proto
>> Loading required package: RSQLite
>> Loading required package: DBI
>> > matches <- sqldf("
>> + select m.*, g.*
>> + from miRNA as m
>> + join genes as g
>> + on m.pos between g.start and g.end
>> + ")
>> Loading required package: tcltk
>> >
>> > str(matches)
>> 'data.frame': 225045 obs. of 5 variables:
>> $ name : chr "mi1" "mi1" "mi1"
"mi1" ...
>> $ pos : int 279341 279341 279341 279341 279341 279341 279341 279341
>> 279341 279341 ...
>> $ gene : chr "gene3133" "gene14326"
"gene14997" "gene17652" ...
>> $ start: int 279000 278623 279157 279296 278379 279055 279180 279273
>> 278938 278960 ...
>> $ end : int 279924 279444 280150 279930 279347 279861 279782 280268
>> 279791 279796 ...
>> > head(matches)
>> name pos gene start end
>> 1 mi1 279341 gene3133 279000 279924
>> 2 mi1 279341 gene14326 278623 279444
>> 3 mi1 279341 gene14997 279157 280150
>> 4 mi1 279341 gene17652 279296 279930
>> 5 mi1 279341 gene21208 278379 279347
>> 6 mi1 279341 gene30889 279055 279861
>>
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Sun, Mar 15, 2015 at 3:41 PM, Papysounours <
>> Cyrille.laurent.sage at gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am just starting R programming because i need it to analyse new
>>> sequencing
>>> data. I got two list of data (excel table) one is gene list with
>>> chromosomal
>>> position (like start:123456 end:124567), the other is miRNA list
with
>>> only
>>> one position (like 123789).
>>> In the first liste i have around 20000 row (meaning 20000 gene
name to
>>> compare to) and for the second around 4500 row (4500 miRNA).
>>> I want to compare the position of each individual miRNA position (
>>> genestart<=miRNA<=geneend ) to the entire list of gene in
order to get
>>> in a
>>> new table the name of the miRNA (first colum of the miRNA list) and
the
>>> name
>>> of the gene (first colum of the gene list) related to the miRNA.
>>> Hope thisis not to much to ask.
>>> Papy
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/R-beginner-tp4704684.html
>>> Sent from the datatable-help mailing list archive at Nabble.com.
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
[[alternative HTML version deleted]]