Hello all, So I'm doing some data analysis using MARS. I have a matrix of 65 independent variables, from which I'm trying to predict 71 dependent variables. I have 900+ data points (so a 900x136 matrix of data), which I randomly split into training and validation sets, for ~450 data points in each set. Occasionally, this works well, and I get decent predictions. However, quite often MARS predicts extremely wrong values for the entire matrix of dependent variables. For example: y1 y2 y3 y4 ... -1248145.629 1272399.812 9687.904417 -17713.04301 -1289951.702 1234426.24 -7355.868156 -17713.00275 -1268022.079 1245287.516 -1169.938246 -17713.32342 -1252243.171 1304869.002 19119.56255 -17713.32218 -1275335.038 1241681.7 -3269.268145 -17713.12027 -1251563.638 1299513.864 17509.25065 -17712.68469 . . . . . . . . . . . . where the average value of these variables is actually more like: y1 = ~19.89 y2 = ~33.64 y3 = ~1.51 y4 = ~1.52 I think it may be related to the distribution of my data; the vast majority of it (~850 of the 900+ points) is all very close to the average value of the points, whereas the remainder of the data is scattered widely around the measurement space, often very far from the average. It seems that if I limit my training set to "good" points only, the model is good (<10% error). As I add these "bad" points to the training set, there is a certain number I can include, after which MARS predicts extremely wrong values like the example above. Is this a bug with the MARS implementation in R, or a limitation of MARS itself when trained with some outlier data? My code is shown below: library(mda) measurements <- read.table("clean_measurements.csv", header=TRUE, colClasses = "numeric", sep=",") #divide "good" and "bad" data points based on a 1/-1 label column selection <- which(measurements[,138] == 1) passing_Devices <- measurements[selection,1:138] failing_Devices <- measurements[-selection,1:138] #Number of passing/failing devices num_Passing_Devices <- dim(passing_Devices)[1] num_Failing_Devices <- dim(failing_Devices)[1] # Use probability vectors to make vectors of indices pass_Selection <- which (runif(num_Passing_Devices) > 0.5) fail_Selection <- which (runif(num_Failing_Devices) > 0.5) # ... which are then used to establish training and validation data sets with each set containing # ~50% of "good" and ~50% of "bad" data points training_Set <- rbind(passing_Devices[pass_Selection,],failing_Devices[fail_Selection,]) validation_Set <- rbind(passing_Devices[-pass_Selection,], failing_Devices[-fail_Selection,]) # columns 2 to 66 are independent variables x <- training_Set[,2:66] # and 67 to 137 are dependent y <- training_Set[,67:137] model <- mars(x,y) x_v <- validation_Set[,2:66] y_v <- validation_Set[,67:137] y_p <- predict(model, x_v) percent_Error <- abs((y_p - y_v) / y_v) Thanks in advance for any help or suggestions you might have, I appreciate it. ~Nate -- View this message in context: http://www.nabble.com/strange-problem-with-mars%28%29-in-package-mda-tf4224354.html#a12016923 Sent from the R help mailing list archive at Nabble.com.