Hi team I am new to R so please help me to do this task. Please find the attached data sample. But in the original data frame I have 350 features and 400000 observations. I need to carryout these tasks. 1. How to Identify features (names) that have all zeros? 2. How to remove features that have all zeros from the dataset? 3. How to identify features (names) that have outliers such as 99999,-1 in the data frame. 4. How to remove outliers? Many thanks
> On Mar 30, 2016, at 3:56 PM, Norman Pat <normanmath1 at gmail.com> wrote: > > Hi team > > I am new to R so please help me to do this task. > > Please find the attached data sample.No. Nothing attached. Please read the Rhelp Info page and the Posting Guide.> But in the original data frame I > have 350 features and 400000 observations. > > I need to carryout these tasks.Who is assigning you this task? Homework? (Read the Posting Guide.)> 1. How to Identify features (names) that have all zeros?That's generally pretty simple if "names" refers to columns in a dataframe.> > 2. How to remove features that have all zeros from the dataset?But maybe you mean to process by rows?> 3. How to identify features (names) that have outliers such as 99999,-1 in > the data frame. > > 4. How to remove outliers?You could start by defining "outliers" in something other than vague examples. If this is data from a real-life data gathering effort, then defining outliers would start with an explanation of the context.> > > Many thanksPlease at least do the following "homework".> ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
I strongly suggest checking out some R tutorials. Most of these tasks are basic data management that are likely covered in just about any tutorial. I'm afraid that this isn't the appropriate forum for such basics. On Mar 30, 2016 9:14 PM, "Norman Pat" <normanmath1 at gmail.com> wrote:> Hi team > > I am new to R so please help me to do this task. > > Please find the attached data sample. But in the original data frame I > have 350 features and 400000 observations. > > I need to carryout these tasks. > > 1. How to Identify features (names) that have all zeros? > > 2. How to remove features that have all zeros from the dataset? > > 3. How to identify features (names) that have outliers such as 99999,-1 in > the data frame. > > 4. How to remove outliers? > > > Many thanks > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi David,> Please find the attached data sample.No. Nothing attached. Please read the Rhelp Info page and the Posting Guide. *I attached it. Anyway I have attached it again (sample train.xlsx).* Who is assigning you this task? Homework? (Read the Posting Guide.) *This is my new job role so I have to do that. I know some basic R *> 1. How to Identify features (names) that have all zeros?That's generally pretty simple if "names" refers to columns in a data frame. *You mean such as something like names(data.nrow(means==0))*> 2. How to remove features that have all zeros from the dataset?But maybe you mean to process by rows? *in a column(feature) *> 3. How to identify features (names) that have outliers such as 99999,-1 in > the data frame.*Please refer to the attached excel file*> 4. How to remove outliers?You could start by defining "outliers" in something other than vague examples. If this is data from a real-life data gathering effort, then defining outliers would start with an explanation of the context. *By looking at data I need to find the outliers* *Thanks * On Thu, Mar 31, 2016 at 12:20 PM, David Winsemius <dwinsemius at comcast.net> wrote:> > > On Mar 30, 2016, at 3:56 PM, Norman Pat <normanmath1 at gmail.com> wrote: > > > > Hi team > > > > I am new to R so please help me to do this task. > > > > Please find the attached data sample. > > No. Nothing attached. Please read the Rhelp Info page and the Posting > Guide. > > > But in the original data frame I > > have 350 features and 400000 observations. > > > > I need to carryout these tasks. > > Who is assigning you this task? Homework? (Read the Posting Guide.) > > > 1. How to Identify features (names) that have all zeros? > > That's generally pretty simple if "names" refers to columns in a dataframe. > > > > > 2. How to remove features that have all zeros from the dataset? > > But maybe you mean to process by rows? > > > > 3. How to identify features (names) that have outliers such as 99999,-1 > in > > the data frame. > > > > 4. How to remove outliers? > > You could start by defining "outliers" in something other than vague > examples. If this is data from a real-life data gathering effort, then > defining outliers would start with an explanation of the context. > > > > > > > > Many thanks > > Please at least do the following "homework". > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > >
Hi Norman, To check whether all values of an object (say "x") fulfill a certain condition (==0): all(x==0) If your object (X) is indeed a data frame, you can only do this by column, so if you want to get the results: X<-data.frame(A=c(0,1:10),B=c(0,2:10,99999), C=c(0,-1,3:11),D=rep(0,11)) all_zeros<-function(x) return(all(x==0)) which_cols<-unlist(lapply(X,all_zeros)) If your data frame (or a subset) contains all numeric values, you can finesse the problem like this: which_rows<-apply(as.matrix(X),1,all_zeros) What you get is a list of logical (TRUE/FALSE) values from lapply, so it has to be unlisted to get a vector of logical values like you get with "apply". You can then use that vector to index (subset) the original data frame by logically inverting it with ! (NOT): X[,!which_cols] X[!which_rows,] Your "outliers" look suspiciously like missing values from certain statistical packages. If you know the values you are looking for, you can do something like: NA99999<-X==99999 and then "remove" them by replacing those values with NA: X[NA99999]<-NA Be aware that all these hackles (diminutive of hacks) are pretty specific to this example. Also remember that if this is homework, your karma has just gone down the cosmic sinkhole. Jim On Thu, Mar 31, 2016 at 9:56 AM, Norman Pat <normanmath1 at gmail.com> wrote:> Hi team > > I am new to R so please help me to do this task. > > Please find the attached data sample. But in the original data frame I > have 350 features and 400000 observations. > > I need to carryout these tasks. > > 1. How to Identify features (names) that have all zeros? > > 2. How to remove features that have all zeros from the dataset? > > 3. How to identify features (names) that have outliers such as 99999,-1 in > the data frame. > > 4. How to remove outliers? > > > Many thanks > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Jim, Thanks for your reply. I know these basic stuffs in R. But I want to know let say you have a data frame X with 300 features.>From that 300 features I need to pullout the names of each featurethat has zero values for all the observations in that sample. Here I am looking for a package or a function to do that. And how do I know whether there are abnormal values for each feature. Let say I have 300 features and 100000 observations. It is hard to look everything in the excel file. Instead of that I am looking for a package that does the work. I hope you understood. Thanks a lot Cheers On Thu, Mar 31, 2016 at 1:13 PM, Jim Lemon <drjimlemon at gmail.com> wrote:> Hi Norman, > To check whether all values of an object (say "x") fulfill a certain > condition (==0): > > all(x==0) > > If your object (X) is indeed a data frame, you can only do this by > column, so if you want to get the results: > > X<-data.frame(A=c(0,1:10),B=c(0,2:10,99999), > C=c(0,-1,3:11),D=rep(0,11)) > all_zeros<-function(x) return(all(x==0)) > which_cols<-unlist(lapply(X,all_zeros)) > > If your data frame (or a subset) contains all numeric values, you can > finesse the problem like this: > > which_rows<-apply(as.matrix(X),1,all_zeros) > > What you get is a list of logical (TRUE/FALSE) values from lapply, so > it has to be unlisted to get a vector of logical values like you get > with "apply". > > You can then use that vector to index (subset) the original data frame > by logically inverting it with ! (NOT): > > X[,!which_cols] > X[!which_rows,] > > Your "outliers" look suspiciously like missing values from certain > statistical packages. If you know the values you are looking for, you > can do something like: > > NA99999<-X==99999 > > and then "remove" them by replacing those values with NA: > > X[NA99999]<-NA > > Be aware that all these hackles (diminutive of hacks) are pretty > specific to this example. Also remember that if this is homework, your > karma has just gone down the cosmic sinkhole. > > Jim > > > On Thu, Mar 31, 2016 at 9:56 AM, Norman Pat <normanmath1 at gmail.com> wrote: > > Hi team > > > > I am new to R so please help me to do this task. > > > > Please find the attached data sample. But in the original data frame I > > have 350 features and 400000 observations. > > > > I need to carryout these tasks. > > > > 1. How to Identify features (names) that have all zeros? > > > > 2. How to remove features that have all zeros from the dataset? > > > > 3. How to identify features (names) that have outliers such as 99999,-1 > in > > the data frame. > > > > 4. How to remove outliers? > > > > > > Many thanks > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]