thr3ads.net - R help - [R] Statistical analysis of olive dataset [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Axel

2016-Mar-12 17:39 UTC

[R] Statistical analysis of olive dataset

Hi to all the members of the list!

I am a novice as regards to statistical 
analysis and the use of the R software, so I am experimenting with the dataset 
"olive" included in the package "tourr".
This dataset contains the results of 
the determination of the fatty acids in 572 samples of olive oil from Italy 
(columns from 3 to 10) along with the area and the region of origin of the oil 
(respectively, column 1 and column 2).

The main goal of my analysis is to 
determine which are the fatty acids that characterize the origin of an oil. As 
a secondary goal, I wolud like to insert the results of the chemical analysis 
of an oil that I analyzed (I am a Chemistry student) in order to determine its 
region of production. I do not know if this last thing is possibile.

I am 
using R 3.2.4 on MacOS X El Capitan with the packages "tourr" and
"psych"
loaded.
Here are the commands I have used up to now:

olivenum <- olive[,c(3:
10)]
mean <- colMeans(olivenum)
sd <- sapply(olivenum,sd)
describeBy(olivenum,
olive[2])
pairs(olivenum)
R <- cor(olivenum)
eigen(R)
# Since the first three 
autovalues are greater than 1, these are the main components (column 1, 2 and 
3). But I can determine them also using a scree diagram as following. Right?

autoval <- eigen(R)$values
autovec <- eigen(R)$vectors
pvarsp <- autoval/ncol
(olivenum)
plot(autoval,type="b",main="Scree diagram",xlab="Number
of
components",ylab="Autovalues")
abline(h=1,lwd=3,col="red")

eigen (R)$vectors[,
1:3]
olive.scale <- scale(olivenum,T,T)
points <- olive.scale%*%autovec[,1:3]


#Since I selected three main components (three columns), how should I plot the 
dispersion graph? I do not think that what I have done is right:
plot(points,
main="Dispersion graph",xlab="Component
1",ylab="Component 2")
princomp
(olivenum,cor=T)
#With the following command I obtain a summary of the 
importance of components. For example, the variance of component 1 is about 
0,465, of component 2 is 0,220 and of component 3 is 0,127 with a cumulative 
variance of 0,812. This means that the values in the first three columns of the 
matrix "olivenum" mostly characterize the differences between the
observations.
Right?
summary(princomp(olivenum,cor=T))
screeplot(princomp(olivenum,cor=T))

plot(princomp(olivenum,cor=T)$scores,rownames(olivenum))
abline(h=0,v=0)

I 
determined that three components can explain a great part of variability but I 
don't know which are these components. How should I continue?

Thank you for

attention,
Axel

Bert Gunter

2016-Mar-13 04:49 UTC

head link

[R] Statistical analysis of olive dataset

Inline.

Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Mar 12, 2016 at 9:39 AM, Axel <axeldibert at alice.it>
wrote:> Hi to all the members of the list!
>
> I am a novice as regards to statistical
> analysis and the use of the R software, so I am experimenting with the
dataset
> "olive" included in the package "tourr".
Stop experimenting and spend time with an R tutorial or two? There are
many good ones on the Web. See also
https://www.rstudio.com/online-learning/#R  for some recommendations.



> This dataset contains the results of
> the determination of the fatty acids in 572 samples of olive oil from Italy
> (columns from 3 to 10) along with the area and the region of origin of the
oil
> (respectively, column 1 and column 2).
>
> The main goal of my analysis is to
> determine which are the fatty acids that characterize the origin of an oil.
As
> a secondary goal, I wolud like to insert the results of the chemical
analysis
> of an oil that I analyzed (I am a Chemistry student) in order to determine
its
> region of production. I do not know if this last thing is possibile.
>
> I am
> using R 3.2.4 on MacOS X El Capitan with the packages "tourr" and
"psych"
> loaded.
> Here are the commands I have used up to now:
>
> olivenum <- olive[,c(3:
> 10)]
> mean <- colMeans(olivenum)
> sd <- sapply(olivenum,sd)
> describeBy(olivenum,
> olive[2])
> pairs(olivenum)
> R <- cor(olivenum)
> eigen(R)
> # Since the first three
> autovalues are greater than 1, these are the main components (column 1, 2
and
> 3). But I can determine them also using a scree diagram as following.
Right?
>
> autoval <- eigen(R)$values
> autovec <- eigen(R)$vectors
> pvarsp <- autoval/ncol
> (olivenum)
> plot(autoval,type="b",main="Scree
diagram",xlab="Number of
> components",ylab="Autovalues")
> abline(h=1,lwd=3,col="red")
>
> eigen (R)$vectors[,
> 1:3]
> olive.scale <- scale(olivenum,T,T)
> points <- olive.scale%*%autovec[,1:3]
>
>
> #Since I selected three main components (three columns), how should I plot
the
> dispersion graph? I do not think that what I have done is right:
> plot(points,
> main="Dispersion graph",xlab="Component
1",ylab="Component 2")
> princomp
> (olivenum,cor=T)
> #With the following command I obtain a summary of the
> importance of components. For example, the variance of component 1 is about
> 0,465, of component 2 is 0,220 and of component 3 is 0,127 with a
cumulative
> variance of 0,812. This means that the values in the first three columns of
the
> matrix "olivenum" mostly characterize the differences between the
observations.
> Right?
> summary(princomp(olivenum,cor=T))
> screeplot(princomp(olivenum,cor=T))
>
> plot(princomp(olivenum,cor=T)$scores,rownames(olivenum))
> abline(h=0,v=0)
>
> I
> determined that three components can explain a great part of variability
but I
> don't know which are these components. How should I continue?
>
> Thank you for
>
> attention,
> Axel
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jim Lemon

2016-Mar-13 07:22 UTC

head link

[R] Statistical analysis of olive dataset

Hi Axel,
It seems to me that cluster analysis could be what you are seeking.
Identify the clusters of different combinations of fatty acids in the
oils. Do they correspond to location? If so, is there a method to
predict the cluster membership of a new set of measurements? Have a
look at the cluster package, which you should have.

Jim


On Sun, Mar 13, 2016 at 4:39 AM, Axel <axeldibert at alice.it>
wrote:> Hi to all the members of the list!
>
> I am a novice as regards to statistical
> analysis and the use of the R software, so I am experimenting with the
dataset
> "olive" included in the package "tourr".
> This dataset contains the results of
> the determination of the fatty acids in 572 samples of olive oil from Italy
> (columns from 3 to 10) along with the area and the region of origin of the
oil
> (respectively, column 1 and column 2).
>
> The main goal of my analysis is to
> determine which are the fatty acids that characterize the origin of an oil.
As
> a secondary goal, I wolud like to insert the results of the chemical
analysis
> of an oil that I analyzed (I am a Chemistry student) in order to determine
its
> region of production. I do not know if this last thing is possibile.
>
> I am
> using R 3.2.4 on MacOS X El Capitan with the packages "tourr" and
"psych"
> loaded.
> Here are the commands I have used up to now:
>
> olivenum <- olive[,c(3:
> 10)]
> mean <- colMeans(olivenum)
> sd <- sapply(olivenum,sd)
> describeBy(olivenum,
> olive[2])
> pairs(olivenum)
> R <- cor(olivenum)
> eigen(R)
> # Since the first three
> autovalues are greater than 1, these are the main components (column 1, 2
and
> 3). But I can determine them also using a scree diagram as following.
Right?
>
> autoval <- eigen(R)$values
> autovec <- eigen(R)$vectors
> pvarsp <- autoval/ncol
> (olivenum)
> plot(autoval,type="b",main="Scree
diagram",xlab="Number of
> components",ylab="Autovalues")
> abline(h=1,lwd=3,col="red")
>
> eigen (R)$vectors[,
> 1:3]
> olive.scale <- scale(olivenum,T,T)
> points <- olive.scale%*%autovec[,1:3]
>
>
> #Since I selected three main components (three columns), how should I plot
the
> dispersion graph? I do not think that what I have done is right:
> plot(points,
> main="Dispersion graph",xlab="Component
1",ylab="Component 2")
> princomp
> (olivenum,cor=T)
> #With the following command I obtain a summary of the
> importance of components. For example, the variance of component 1 is about
> 0,465, of component 2 is 0,220 and of component 3 is 0,127 with a
cumulative
> variance of 0,812. This means that the values in the first three columns of
the
> matrix "olivenum" mostly characterize the differences between the
observations.
> Right?
> summary(princomp(olivenum,cor=T))
> screeplot(princomp(olivenum,cor=T))
>
> plot(princomp(olivenum,cor=T)$scores,rownames(olivenum))
> abline(h=0,v=0)
>
> I
> determined that three components can explain a great part of variability
but I
> don't know which are these components. How should I continue?
>
> Thank you for
>
> attention,
> Axel
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Michael Dewey

2016-Mar-13 09:23 UTC

head link

[R] Statistical analysis of olive dataset

Dear Axel

Since you are using princomp (among other things) you might find the 
biplot function useful on the output of princomp.


I have not studies your code in detail but you do seem to be doing 
several things in multiple ways using functions from different sources. 
I wonder whether it might be better to stick to fewer functions.

On 12/03/2016 17:39, Axel wrote:> Hi to all the members of the list!
>
> I am a novice as regards to statistical
> analysis and the use of the R software, so I am experimenting with the
dataset
> "olive" included in the package "tourr".
> This dataset contains the results of
> the determination of the fatty acids in 572 samples of olive oil from Italy
> (columns from 3 to 10) along with the area and the region of origin of the
oil
> (respectively, column 1 and column 2).
>
> The main goal of my analysis is to
> determine which are the fatty acids that characterize the origin of an oil.
As
> a secondary goal, I wolud like to insert the results of the chemical
analysis
> of an oil that I analyzed (I am a Chemistry student) in order to determine
its
> region of production. I do not know if this last thing is possibile.
>
> I am
> using R 3.2.4 on MacOS X El Capitan with the packages "tourr" and
"psych"
> loaded.
> Here are the commands I have used up to now:
>
> olivenum <- olive[,c(3:
> 10)]
> mean <- colMeans(olivenum)
> sd <- sapply(olivenum,sd)
> describeBy(olivenum,
> olive[2])
> pairs(olivenum)
> R <- cor(olivenum)
> eigen(R)
> # Since the first three
> autovalues are greater than 1, these are the main components (column 1, 2
and
> 3). But I can determine them also using a scree diagram as following.
Right?
>
> autoval <- eigen(R)$values
> autovec <- eigen(R)$vectors
> pvarsp <- autoval/ncol
> (olivenum)
> plot(autoval,type="b",main="Scree
diagram",xlab="Number of
> components",ylab="Autovalues")
> abline(h=1,lwd=3,col="red")
>
> eigen (R)$vectors[,
> 1:3]
> olive.scale <- scale(olivenum,T,T)
> points <- olive.scale%*%autovec[,1:3]
>
>
> #Since I selected three main components (three columns), how should I plot
the
> dispersion graph? I do not think that what I have done is right:
> plot(points,
> main="Dispersion graph",xlab="Component
1",ylab="Component 2")
> princomp
> (olivenum,cor=T)
> #With the following command I obtain a summary of the
> importance of components. For example, the variance of component 1 is about
> 0,465, of component 2 is 0,220 and of component 3 is 0,127 with a
cumulative
> variance of 0,812. This means that the values in the first three columns of
the
> matrix "olivenum" mostly characterize the differences between the
observations.
> Right?
> summary(princomp(olivenum,cor=T))
> screeplot(princomp(olivenum,cor=T))
>
> plot(princomp(olivenum,cor=T)$scores,rownames(olivenum))
> abline(h=0,v=0)
>
> I
> determined that three components can explain a great part of variability
but I
> don't know which are these components. How should I continue?
>
> Thank you for
>
> attention,
> Axel
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Michael
http://www.dewey.myzen.co.uk/home.html

Michael Friendly

2016-Mar-13 15:24 UTC

head link

[R] Statistical analysis of olive dataset

On 3/12/2016 12:39 PM, Axel wrote:> The main goal of my analysis is to
> determine which are the fatty acids that characterize the origin of an oil.
As
> a secondary goal, I wolud like to insert the results of the chemical
analysis
> of an oil that I analyzed (I am a Chemistry student) in order to determine
its
> region of production. I do not know if this last thing is possibile.
There are already plenty of tools for this; don't bother trying to 
re-invent an already well-working wheel.

* PCA + a biplot will give you a good overview.  With groups, I 
recommend ggbiplot, with data ellipses for the groups.
This shows clear separation along PC1

data(olive, package="tourr")
library(ggbiplot)
olivenum <- olive[,c(3:10)]

olive.pca <- prcomp(olivenum, scale.=TRUE)
summary(olive.pca)

# region should be a factor (area has 9 levels, maybe too confusing)
olive$region <- factor(olive$region, labels=c("North",
"Sardinia", "South"))

ggbiplot(olive.pca, obs.scale = 1, var.scale = 1,
          groups = olive$region, ellipse = TRUE, varname.size=4,
          circle = TRUE) +
          theme_bw() +
          theme(legend.direction = 'horizontal',
                legend.position = 'top')

* Discrimination among regions by chemical composition:
A canonical discriminant analysis will show you this in
a low-rank view.  The biggest difference is between the North
vs. the other 2.

# MLM
olive.mlm <- lm(as.matrix(olive[,c(3:10)]) ~ olive$region, data=olive)

# Canonical discriminant analysis

# (need devel. version for ellipses)
# install.packages("candisc",
repos="http://R-Forge.R-project.org")
library(candisc)
olive.can <- candisc(olive.mlm)
olive.can
plot(olive.can, ellipse=TRUE)

* You can probably use the predict() method for MASS::lda() to predict
the class for new samples.

hope this helps,
-Michael

R help - Mar 2016 - Statistical analysis of olive dataset

[R] Statistical analysis of olive dataset

[R] Statistical analysis of olive dataset

[R] Statistical analysis of olive dataset

[R] Statistical analysis of olive dataset

[R] Statistical analysis of olive dataset