thr3ads.net - similar to: "read large amount of data"

Displaying 20 results from an estimated 1000 matches similar to: "read large amount of data"

2005 Jul 13

read.table

Hi, I have a question on read.table. I have a dataset with 273,000 lines and 195 columns. I used the read.table to load the data into R: trn<-read.table('train1.dat', header=F, sep='|', na.strings='.') I found it takes forever. then I run 1/10 of the data (test) using read.table again. And this time it finished quickly. So, there might be something wrong in my data

need help

2005 Aug 12

need help

Hi, there: I think i need to re-phrase my question since last time I did not get any reply but i think the question is not that hard, probably i did not make the question clear: I want to find cases like 35, 90, 330, 330, 335 from the rest which look like 3, 3, 3, 3.2, 3.3 4, 4.4, 4.5, 4.6, 4.7 .... basically there is one (or more) big 'gap' in the case i seek. thanks, weiwei --

a problem in random forest

2005 Oct 11

a problem in random forest

Hi, there: I spent some time on this but I think I really cannot figure it out, maybe I missed something here: my data looks like this: > dim(trn3) [1] 7361 209 > dim(val3) [1] 7427 209 > mg.rf2<-randomForest(x=trn3[,1:208], y=trn3[,209], data=trn3, xtest=val3[, 1:208], ytest=val3[,209], importance=T) my test data has 7427 observations but after prediction, > dim(mg.rf2$votes)

generalized linear model and missing handling

2005 Oct 04

generalized linear model and missing handling

Hi, I have a dataset and want to build a generalized linear model on it. Unfortunately, complete.cases(df) returns null, which means I have to find a way to "fill" those missings. One way is following my previous post to use median to replace(or use most freq. of level to replace for catergorical case), but I am wondering if there are other ways, when glm or something like it is

Re : Re : implementation of t.test

2006 Dec 12

Re : Re : implementation of t.test

Excuses I have a mistake in previous mail Type stats:::t.test.defaultThe formal way is to use getAnywhere(t.test) Justin BEM Elève Ingénieur Statisticien Economiste BP 294 Yaoundé. Tél (00237)9597295. ----- Message d'origine ---- De : justin bem <justin_bem@yahoo.fr> À : Weiwei Shi <helprhelp@gmail.com> Cc : R-help@stat.math.ethz.ch Envoyé le : Mardi, 12 Décembre 2006,

"more" and "tab" functionalities in R under linux

2005 Jul 08

"more" and "tab" functionalities in R under linux

Hi, forgive me if it is due to my "laziness" :) I am wondering if there are functionalities in R, which can do like "more" and "tab" in linux: more(one.data.frame) so I can browse through it. Sometimes I can use one.data.frame[1:100,], but still not as good as "more" in linux. tab: can I use tab to auto complete an defined object name in R so I don't

pca in dimension reduction

2005 Oct 05

pca in dimension reduction

Hi, there: I am wondering if anyone here can provide an example using pca doing dimension reduction for a dataset. The dataset can be n*q (n>=q or n<=q). As to dimension reduction, are there other implementations for like ICA, Isomap, Locally Linear Embedding... Thanks, weiwei -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III

an error in my using of nnet

2005 Oct 11

an error in my using of nnet

Hi, there: I am trying nnet as followed: > mg.nnet<-nnet(x=trn3[,r.v[1:100]], y=trn3[,209], size=5, decay = 5e-4, maxit = 200) # weights: 511 initial value 13822.108453 iter 10 value 7408.169201 iter 20 value 7362.201934 iter 30 value 7361.669408 iter 40 value 7361.294379 iter 50 value 7361.045190 final value 7361.038121 converged Error in y - tmp : non-numeric argument to binary operator

cluster

2005 Jul 25

cluster

Dear listers: Here I have a question on clustering methods available in R. I am trying to down-sampling the majority class in a classification problem on an imbalanced dataset. Since I don't want to lose information in the original dataset, I don't want to use naive down-sampling: I think using clustering on the majority class' side to select "representative" samples might

randomForest

2005 Jul 07

randomForest

> From: Weiwei Shi > > it works. > thanks, > > but: (just curious) > why i tried previously and i got > > > is.vector(sample.size) > [1] TRUE Because a list is also a vector: > a <- c(list(1), list(2)) > a [[1]] [1] 1 [[2]] [1] 2 > is.vector(a) [1] TRUE > is.numeric(a) [1] FALSE Actually, the way I initialize a list of known length is by

network package in R

2011 May 27

network package in R

Hi there, I need a network builder and it can change the node size and color; I am not sure if network package in R can do this or not. The other functions I wanted have been found in that package. BTW, if there is another package in R relating to this, please suggest too. Thanks, Weiwei -- Weiwei Shi, Ph.D Research Scientist "Did you always know?" "No, I did not. But I

computationally singular

2005 Aug 08

computationally singular

Hi, I have a dataset which has around 138 variables and 30,000 cases. I am trying to calculate a mahalanobis distance matrix for them and my procedure is like this: Suppose my data is stored in mymatrix > S<-cov(mymatrix) # this is fine > D<-sapply(1:nrow(mymatrix), function(i) mahalanobis(mymatrix, mymatrix[i,], S)) Error in solve.default(cov, ...) : system is computationally

question on write.table

2005 Dec 15

question on write.table

Hi, I have a question on write.table: I have a data.frame called t7 as below: > dim(t7) [1] 14015184 6 > t7[1:5,] uci uce par line graphical.forms stems 1 0 0 0 0 active activ 2 0 0 0 0 policy polici 3 0 0 0 0 wc PC 4 0 0 0 0 eff elf 5 0 0 0 0 icn ICC I want to write the

heatmap for plotting categorical matrix

2011 Oct 24

heatmap for plotting categorical matrix

Hi there, I have a matrix like this: > a4[1:20, 1:5] 194 211 294 314 315 GO:0000003 1 1 1 1 1 GO:0000072 0 0 0 0 0 GO:0000076 1 0 0 0 0 GO:0000082 1 3 1 1 1 GO:0000083 1 0 0 0 1 GO:0000086 0 1 0 1 1 GO:0000114 0 0 0 0 0 GO:0000115 0 0 0 0 0 GO:0000117 0 0 0 0 0 GO:0000160 0 0 1 0 0

factor vector manipulation

2005 Jun 03

factor vector manipulation

Hi, I have one question on factor vector. I have 3 factor vectors: a<-factor(c("1", "2", "3")) b<-factor(c("a", "b", "c")) c<-factor(c("b", "a", "c")) what I want is like: c x 1 b 2 2 a 1 3 c 3 which means, I use b as keys and vector a as values and I find values for c. I used the following

time series clustering

2006 Jun 03

time series clustering

Dear Listers: I happened to have a problem requiring time-series clustering since the clusters will change with time (too old data need to be removed from data while new data comes in). I am wondering if there is some paper or reference on this topic and there is some kind of implementation in R? Thanks, Weiwei -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I

margins defined in randomForest and supclust

2009 Jul 22

margins defined in randomForest and supclust

Hi there, How to solve the conflicts as to the same object between two packages, for example, like margins in both randomForest and supclust? When both libraries are installed, supclust will complain "margins" defined in randomForest. I can only solve it by re-starting R, which is very inconvenient, any clever way? Thanks, Weiwei -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc.

Looking for packages to do Feature Selection and Classifi cation

2006 Jan 09

Looking for packages to do Feature Selection and Classifi cation

Hi, You should also check my msc.features.select from caMassClass package. It has feature selection algorithm that I found useful in case of mass-spectra data. It performs individual feature selection and/or removes highly correlated neighbor features. Jarek -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] Sent: Friday, January

some thoughts on outlier detection, need help!

2005 Aug 04

some thoughts on outlier detection, need help!

Dear listers: I have an idea to do the outlier detection and I need to use R to implement it first. Here I hope I can get some input from all the guru's here. I select distance-based approach--- step 1: calculate the distance of any two rows for a dataframe. considering the scaling among different variables, I choose mahalanobis, using variance as scaler. step 2: Let k be the number of

regression modeling

2006 Apr 24

regression modeling

Hi, there: I am looking for a regression modeling (like regression trees) approach for a large-scale industry dataset. Any suggestion on a package from R or from other sources which has a decent accuracy and scalability? Any recommendation from experience is highly appreciated. Thanks, Weiwei -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..."

similar to: read large amount of data