Is it possible to use R as a data-mining tool? Here's the problem I've got. I have a couple of data sets consisting of results from a cDNA microarray experiment - the details about the biology don't really matter here, the same theory applies for any other data-mining task (that's why I thought it'd be more appropriate to post this on r-user). Each of these datasets consists of about 30000 rows by 20 to 30 columns. Let's say that each row represents (very roughly speaking) a gene, and the columns are details about its level of expression, reliability of the measurament, coordinates and so on. The main objetive here is identify some genes (rows) according to some criteria. In order to do that, what I want to be able to do, is selectively filter the rows, graph some convinient variables, do some further filtering and so on. Let me take a more concrete example to make myself clear. Let's say that I load a given dataset on a dataframe, namely expr1. This dataframe would have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x, expr1$y and so on, containing, for instance, 26000 rows. Now from these 26000 I'd like to select only those ones satisfying expr1$expression > 2000, expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them. I'd have then a reduced dataset of the first one. Let's say now that I want to narrow my filter even more, selecting only (among the ones I have already selected) the ones where expr1$x > 20. This would be done many times and in different orders. I'd like to be able to, among those 26000 rows, take only the 100 whose expr$x are the 100 greatest . And so on, many times, until I found a set of suitable rows. What is the proper way to do that using R, if any? I've played a little with dataframes (I could for instance use: expr1$names[expr1$x > 20] to get the names of those genes whose x > 20) but it seemed a little clumsy. Should I keep trying to manipulate directly the dataframe, or perhaps should I save it on a mysql database and do que queries using RMYSql? Or maybe there is a better option? I know that these things I've said are pretty easy to implement using, for instance M$ Excel (I've seen them working on it). You just select drop-down menus and filter the rows to your liking. But I really would like to be able to accomplish this task using R and other open source tools like MySql, Perl, etc. Thank you in advance, --
See www.bioconductor.org for one reasonably full featured approach. There are others (Rmaanova, etc, etc). Fernando Henrique Ferraz Pereira da Rosa <mentus at gmx.de> writes:> Is it possible to use R as a data-mining tool? Here's the problem I've > got. I have a couple of data sets consisting of results from a cDNA > microarray experiment - the details about the biology don't really matter here, the > same theory applies for any other data-mining task (that's why I thought it'd > be more appropriate to post this on r-user). Each of these datasets consists > of about 30000 rows by 20 to 30 columns. Let's say that each row represents > (very roughly speaking) a gene, and the columns are details about its level > of expression, reliability of the measurament, coordinates and so on. > The main objetive here is identify some genes (rows) according to some > criteria. In order to do that, what I want to be able to do, is selectively > filter the rows, graph some convinient variables, do some further filtering > and so on. > Let me take a more concrete example to make myself clear. Let's say > that I load a given dataset on a dataframe, namely expr1. This dataframe would > have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x, > expr1$y and so on, containing, for instance, 26000 rows. Now from these 26000 I'd > like to select only those ones satisfying expr1$expression > 2000, > expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them. I'd have then > a reduced dataset of the first one. Let's say now that I want to narrow my > filter even more, selecting only (among the ones I have already selected) the > ones where expr1$x > 20. > This would be done many times and in different orders. I'd like to be > able to, among those 26000 rows, take only the 100 whose expr$x are the 100 > greatest > . And so on, many times, until I found a set of suitable rows. > What is the proper way to do that using R, if any? I've played a > little with dataframes (I could for instance use: expr1$names[expr1$x > 20] to get > the names of those genes whose x > 20) but it seemed a little clumsy. Should > I keep trying to manipulate directly the dataframe, or perhaps should I save > it on a mysql database and do que queries using RMYSql? Or maybe there is a > better option? > I know that these things I've said are pretty easy to implement using, > for instance M$ Excel (I've seen them working on it). You just select > drop-down menus and filter the rows to your liking. But I really would like to be > able to accomplish this task using R and other open source tools like MySql, > Perl, etc. > > > Thank you in advance, > > -- > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >-- A.J. Rossini rossini at u.washington.edu http://software.biostat.washington.edu/ Biostatistics, U Washington and Fred Hutchinson Cancer Research Center FHCRC:Tu: 206-667-7025 (fax=4812)|Voicemail is pretty sketchy/use Email UW : Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX CONFIDENTIALITY NOTICE: This e-mail message and any attachments ... {{dropped}}
Yes all of this is possible in R and more. You might find the which() command helpful for subsetting. You could write a simple function to automate this. For graphing facilities, see plot(), par(), postscript() etc. In my opinion, it might not be worth the effort and time to save it to MYSQL if you only want to perform a couple of queries. Plus R has excellent graphing facilities. If you really want to automate the process, then a combination of Perl and GNUplot seems like a good combination. The choice depends on which software you are most comfortable with. Another advantage R has is that it is an interactive language. So it is great for exploratory analysis with minimum effort (unlike Excel in which you spend 90% of your time dragging the mouse and sorting the data). See the Bioconductor project, which focuses on genomic and expression data and has many great functions specifically designed for microarray etc. I doubt you will be able to find such vast collection of tools for free. Good luck. -----Original Message----- From: Fernando Henrique Ferraz Pereira da Rosa [mailto:mentus at gmx.de] Sent: Friday, May 09, 2003 8:35 AM To: r-help at stat.math.ethz.ch Subject: [R] Data-mining using R Is it possible to use R as a data-mining tool? Here's the problem I've got. I have a couple of data sets consisting of results from a cDNA microarray experiment - the details about the biology don't really matter here, the same theory applies for any other data-mining task (that's why I thought it'd be more appropriate to post this on r-user). Each of these datasets consists of about 30000 rows by 20 to 30 columns. Let's say that each row represents (very roughly speaking) a gene, and the columns are details about its level of expression, reliability of the measurament, coordinates and so on. The main objetive here is identify some genes (rows) according to some criteria. In order to do that, what I want to be able to do, is selectively filter the rows, graph some convinient variables, do some further filtering and so on. Let me take a more concrete example to make myself clear. Let's say that I load a given dataset on a dataframe, namely expr1. This dataframe would have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x, expr1$y and so on, containing, for instance, 26000 rows. Now from these 26000 I'd like to select only those ones satisfying expr1$expression > 2000, expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them. I'd have then a reduced dataset of the first one. Let's say now that I want to narrow my filter even more, selecting only (among the ones I have already selected) the ones where expr1$x > 20. This would be done many times and in different orders. I'd like to be able to, among those 26000 rows, take only the 100 whose expr$x are the 100 greatest . And so on, many times, until I found a set of suitable rows. What is the proper way to do that using R, if any? I've played a little with dataframes (I could for instance use: expr1$names[expr1$x > 20] to get the names of those genes whose x > 20) but it seemed a little clumsy. Should I keep trying to manipulate directly the dataframe, or perhaps should I save it on a mysql database and do que queries using RMYSql? Or maybe there is a better option? I know that these things I've said are pretty easy to implement using, for instance M$ Excel (I've seen them working on it). You just select drop-down menus and filter the rows to your liking. But I really would like to be able to accomplish this task using R and other open source tools like MySql, Perl, etc. Thank you in advance, -- ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help