thr3ads.net - R help - [R] Data-mining using R [May 2003]

If this information is useful, please help other people find it:
Share via:

Fernando Henrique Ferraz Pereira da Rosa

2003-May-09 00:35 UTC

[R] Data-mining using R

Is it possible to use R as a data-mining tool? Here's the problem I've
got. I have a couple of data sets consisting of results from a cDNA
microarray experiment - the details about the biology don't really matter
here, the
same theory applies for any other data-mining task (that's why I thought
it'd
be more appropriate to post this on r-user).  Each of these datasets consists
of about 30000 rows by 20 to 30 columns. Let's say that each row represents
(very roughly speaking) a gene, and the columns are details about its level
of expression, reliability of the measurament, coordinates and so on.
      The main objetive here is identify some genes (rows) according to some
criteria. In order to do that, what I want to be able to do, is selectively
filter the rows, graph some convinient variables, do some further filtering
and so on.
      Let me take a more concrete example to make myself clear. Let's say
that I load a given dataset on a dataframe, namely expr1. This dataframe would
have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x,
expr1$y and so on, containing, for instance, 26000 rows. Now from these 26000
I'd
like to select only those ones satisfying expr1$expression > 2000,
expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them. I'd
have then
a reduced dataset of the first one. Let's say now that I want to narrow my
filter even more, selecting only (among the ones I have already selected) the
ones where expr1$x > 20.
      This would be done many times and in different orders. I'd like to be
able to, among those 26000 rows, take only the 100 whose expr$x are the 100
greatest
. And so on, many times, until I found a set of suitable rows.
      What is the proper way to do that using R, if any? I've played a
little with dataframes (I could for instance use: expr1$names[expr1$x > 20]
to get
the names of those genes whose x > 20) but it seemed a little clumsy. Should
I keep trying to manipulate directly the dataframe, or perhaps should I save
it on a mysql database and do que queries using RMYSql? Or maybe there is a
better option?
      I know that these things I've said are pretty easy to implement using,
for instance M$ Excel (I've seen them working on it). You just select
drop-down menus and filter the rows to your liking. But I really would like to
be
able to accomplish this task using R and other open source tools like MySql,
Perl, etc.
      

Thank you in advance,

--

A.J. Rossini

2003-May-09 02:27 UTC

head link

[R] Data-mining using R

See www.bioconductor.org for one reasonably full featured approach.

There are others (Rmaanova, etc, etc).


Fernando Henrique Ferraz Pereira da Rosa <mentus at gmx.de> writes:
>       Is it possible to use R as a data-mining tool? Here's the problem
I've
> got. I have a couple of data sets consisting of results from a cDNA
> microarray experiment - the details about the biology don't really
matter here, the
> same theory applies for any other data-mining task (that's why I
thought it'd
> be more appropriate to post this on r-user).  Each of these datasets
consists
> of about 30000 rows by 20 to 30 columns. Let's say that each row
represents
> (very roughly speaking) a gene, and the columns are details about its level
> of expression, reliability of the measurament, coordinates and so on.
>       The main objetive here is identify some genes (rows) according to
some
> criteria. In order to do that, what I want to be able to do, is selectively
> filter the rows, graph some convinient variables, do some further filtering
> and so on.
>       Let me take a more concrete example to make myself clear. Let's
say
> that I load a given dataset on a dataframe, namely expr1. This dataframe
would
> have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x,
> expr1$y and so on, containing, for instance, 26000 rows. Now from these
26000 I'd
> like to select only those ones satisfying expr1$expression > 2000,
> expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them.
I'd have then
> a reduced dataset of the first one. Let's say now that I want to narrow
my
> filter even more, selecting only (among the ones I have already selected)
the
> ones where expr1$x > 20.
>       This would be done many times and in different orders. I'd like
to be
> able to, among those 26000 rows, take only the 100 whose expr$x are the 100
> greatest
> . And so on, many times, until I found a set of suitable rows.
>       What is the proper way to do that using R, if any? I've played a
> little with dataframes (I could for instance use: expr1$names[expr1$x >
20] to get
> the names of those genes whose x > 20) but it seemed a little clumsy.
Should
> I keep trying to manipulate directly the dataframe, or perhaps should I
save
> it on a mysql database and do que queries using RMYSql? Or maybe there is a
> better option?
>       I know that these things I've said are pretty easy to implement
using,
> for instance M$ Excel (I've seen them working on it). You just select
> drop-down menus and filter the rows to your liking. But I really would like
to be
> able to accomplish this task using R and other open source tools like
MySql,
> Perl, etc.
>       
>
> Thank you in advance,
>
> --
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
-- 
A.J. Rossini rossini at u.washington.edu http://software.biostat.washington.edu/
Biostatistics, U Washington and Fred Hutchinson Cancer Research Center

FHCRC:Tu: 206-667-7025 (fax=4812)|Voicemail is pretty sketchy/use Email 
UW  : Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX 

CONFIDENTIALITY NOTICE: This e-mail message and any attachments ... {{dropped}}

Adaikalavan Ramasamy

2003-May-09 03:00 UTC

head link

[R] Data-mining using R

Yes all of this is possible in R and more.

You might find the which() command helpful for subsetting. You could
write a simple function to automate this. For graphing  facilities, see
plot(), par(), postscript() etc.

In my opinion, it might not be worth the effort and time to save it to
MYSQL if you only want to perform a couple of queries. Plus R has
excellent graphing facilities. If you really want to automate the
process, then a combination of Perl and GNUplot seems like a good
combination. The choice depends on which software you are most
comfortable with. 

Another advantage R has is that it is an interactive language. So it is
great for exploratory analysis with minimum effort (unlike Excel in
which you spend 90% of your time dragging the mouse and sorting the
data).

See the Bioconductor project, which focuses on genomic and expression
data and has many great functions specifically designed for microarray
etc. I doubt you will be able to find such vast collection of tools for
free.

Good luck.

-----Original Message-----
From: Fernando Henrique Ferraz Pereira da Rosa [mailto:mentus at gmx.de] 
Sent: Friday, May 09, 2003 8:35 AM
To: r-help at stat.math.ethz.ch
Subject: [R] Data-mining using R


      Is it possible to use R as a data-mining tool? Here's the problem
I've got. I have a couple of data sets consisting of results from a cDNA
microarray experiment - the details about the biology don't really
matter here, the same theory applies for any other data-mining task
(that's why I thought it'd be more appropriate to post this on r-user).
Each of these datasets consists of about 30000 rows by 20 to 30 columns.
Let's say that each row represents (very roughly speaking) a gene, and
the columns are details about its level of expression, reliability of
the measurament, coordinates and so on.
      The main objetive here is identify some genes (rows) according to
some criteria. In order to do that, what I want to be able to do, is
selectively filter the rows, graph some convinient variables, do some
further filtering and so on.
      Let me take a more concrete example to make myself clear. Let's
say that I load a given dataset on a dataframe, namely expr1. This
dataframe would have the fields expr1$name, expr1$expression,
expr1$reliablity, expr1$x, expr1$y and so on, containing, for instance,
26000 rows. Now from these 26000 I'd like to select only those ones
satisfying expr1$expression > 2000, expr1$reliability = 100 and plot a
graph on expr1$x x expr1$y, for them. I'd have then a reduced dataset of
the first one. Let's say now that I want to narrow my filter even more,
selecting only (among the ones I have already selected) the ones where
expr1$x > 20.
      This would be done many times and in different orders. I'd like to
be able to, among those 26000 rows, take only the 100 whose expr$x are
the 100 greatest . And so on, many times, until I found a set of
suitable rows.
      What is the proper way to do that using R, if any? I've played a
little with dataframes (I could for instance use: expr1$names[expr1$x >
20] to get the names of those genes whose x > 20) but it seemed a little
clumsy. Should I keep trying to manipulate directly the dataframe, or
perhaps should I save it on a mysql database and do que queries using
RMYSql? Or maybe there is a better option?
      I know that these things I've said are pretty easy to implement
using, for instance M$ Excel (I've seen them working on it). You just
select drop-down menus and filter the rows to your liking. But I really
would like to be able to accomplish this task using R and other open
source tools like MySql, Perl, etc.
      

Thank you in advance,

--

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Seemingly Similar Threads

Search for more seemingly similar threads

R help - May 2003 - Data-mining using R

[R] Data-mining using R

[R] Data-mining using R

[R] Data-mining using R

Seemingly Similar Threads