thr3ads.net - R help - [R] Data transformation & cleaning [Sep 2011]

If this information is useful, please help other people find it:
Share via:

pip56789

2011-Sep-28 03:13 UTC

[R] Data transformation & cleaning

Hi,

I have a few methodological and implementation questions for ya'll. Thank
you in advance for your help. I have a dataset that reflects people's
preference choices. I want to see if there's any kind of clustering effect
among certain preference choices (e.g. do people who pick choice A also pick
choice D). 

I have a data set that has one record per user ID, per preference choice.
It's a "long" form of a data set that looks like this: 

ID | Page
123 | Choice A
123 | Choice B
456 | Choice A
456 | Choice B
...

I thought that I should do the following

1. Make the data set "wide", counting the observations so the data
looks
like this:
ID | Count of Preference A | Count of Preference B
123 | 1 | 1
...

Using 
table1 <- dcast(data,ID ~ Page,fun.aggregate=length,value_var='Page'
)

2. Create a correlation matrix of preferences
cor(table2[,-1])

How would I restrict my correlation to show preferences that met a minimum
sample threshold? Can you confirm if the two following commands do the same
thing? What would I do from here (or am I taking the wrong approach)
table1 <- dcast(data,Page ~
Page,fun.aggregate=length,value_var='Page' )
table2 <- with(data, table(Page,Page))


many thanks,
Peter

--
View this message in context:
http://r.789695.n4.nabble.com/Data-transformation-cleaning-tp3849889p3849889.html
Sent from the R help mailing list archive at Nabble.com.

Daniel Malter

2011-Sep-28 05:38 UTC

head link

[R] Data transformation & cleaning

On a methodological level, if the choices do not correspond on a cardinal or
at least ordinal scale, you don't want to use correlations. Instead you
should probably use Cramer's V, in particular if the choices are
multinomial. Whether the wide format is necessary will depend on the format
the function you are using expects.

HTH,
Daniel


pde3p wrote:> 
> Hi,
> 
> I have a few methodological and implementation questions for ya'll.
Thank
> you in advance for your help. I have a dataset that reflects people's
> preference choices. I want to see if there's any kind of clustering
effect
> among certain preference choices (e.g. do people who pick choice A also
> pick choice D). 
> 
> I have a data set that has one record per user ID, per preference choice.
> It's a "long" form of a data set that looks like this: 
> 
> ID | Page
> 123 | Choice A
> 123 | Choice B
> 456 | Choice A
> 456 | Choice B
> ...
> 
> I thought that I should do the following
> 
> 1. Make the data set "wide", counting the observations so the
data looks
> like this:
> ID | Count of Preference A | Count of Preference B
> 123 | 1 | 1
> ...
> 
> Using 
> table1 <- dcast(data,ID ~
Page,fun.aggregate=length,value_var='Page' )
> 
> 2. Create a correlation matrix of preferences
> cor(table2[,-1])
> 
> How would I restrict my correlation to show preferences that met a minimum
> sample threshold? Can you confirm if the two following commands do the
> same thing? What would I do from here (or am I taking the wrong approach)
> table1 <- dcast(data,Page ~
Page,fun.aggregate=length,value_var='Page' )
> table2 <- with(data, table(Page,Page))
> 
> 
> many thanks,
> Peter
> 
--
View this message in context:
http://r.789695.n4.nabble.com/Data-transformation-cleaning-tp3849889p3850076.html
Sent from the R help mailing list archive at Nabble.com.

Weidong Gu

2011-Sep-28 07:36 UTC

head link

[R] Data transformation & cleaning

Seems your questions belong to rule mining for frequent item sets.
check arules package

Weidong Gu

On Tue, Sep 27, 2011 at 11:13 PM, pip56789 <pde3p at virginia.edu>
wrote:> Hi,
>
> I have a few methodological and implementation questions for ya'll.
Thank
> you in advance for your help. I have a dataset that reflects people's
> preference choices. I want to see if there's any kind of clustering
effect
> among certain preference choices (e.g. do people who pick choice A also
pick
> choice D).
>
> I have a data set that has one record per user ID, per preference choice.
> It's a "long" form of a data set that looks like this:
>
> ID | Page
> 123 | Choice A
> 123 | Choice B
> 456 | Choice A
> 456 | Choice B
> ...
>
> I thought that I should do the following
>
> 1. Make the data set "wide", counting the observations so the
data looks
> like this:
> ID | Count of Preference A | Count of Preference B
> 123 | 1 | 1
> ...
>
> Using
> table1 <- dcast(data,ID ~
Page,fun.aggregate=length,value_var='Page' )
>
> 2. Create a correlation matrix of preferences
> cor(table2[,-1])
>
> How would I restrict my correlation to show preferences that met a minimum
> sample threshold? Can you confirm if the two following commands do the same
> thing? What would I do from here (or am I taking the wrong approach)
> table1 <- dcast(data,Page ~
Page,fun.aggregate=length,value_var='Page' )
> table2 <- with(data, table(Page,Page))
>
>
> many thanks,
> Peter
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Data-transformation-cleaning-tp3849889p3849889.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Jim Lemon

2011-Sep-28 10:29 UTC

head link

[R] Data transformation & cleaning

On 09/28/2011 01:13 PM, pip56789 wrote:> Hi,
>
> I have a few methodological and implementation questions for ya'll.
Thank
> you in advance for your help. I have a dataset that reflects people's
> preference choices. I want to see if there's any kind of clustering
effect
> among certain preference choices (e.g. do people who pick choice A also
pick
> choice D).
>
> I have a data set that has one record per user ID, per preference choice.
> It's a "long" form of a data set that looks like this:
>
> ID | Page
> 123 | Choice A
> 123 | Choice B
> 456 | Choice A
> 456 | Choice B
> ...
>
> I thought that I should do the following
>
> 1. Make the data set "wide", counting the observations so the
data looks
> like this:
> ID | Count of Preference A | Count of Preference B
> 123 | 1 | 1
> ...
>
> Using
> table1<- dcast(data,ID ~
Page,fun.aggregate=length,value_var='Page' )
>
> 2. Create a correlation matrix of preferences
> cor(table2[,-1])
>
> How would I restrict my correlation to show preferences that met a minimum
> sample threshold? Can you confirm if the two following commands do the same
> thing? What would I do from here (or am I taking the wrong approach)
> table1<- dcast(data,Page ~
Page,fun.aggregate=length,value_var='Page' )
> table2<- with(data, table(Page,Page))
>
>Hi Peter,
An easy way to visualize set intersections is the intersectDiagram 
function in the plotrix package. This will display the counts or 
percentages of each type of intersection. Your data could be passed like 
this:

choices<-data.frame(IDs=sample(1:20,50,TRUE),
  sample(LETTERS[1:4],50,TRUE))
library(plotrix)
intersectDiagram(choices)

This example is a bit messy, as it will generate quite a few repeated 
choices that will be ignored by intersectDiagram, but it should give you 
the idea.

Jim

Reasonably Related Threads

Search for more seemingly similar threads

R help - Sep 2011 - Data transformation & cleaning

[R] Data transformation & cleaning

[R] Data transformation & cleaning

[R] Data transformation & cleaning

[R] Data transformation & cleaning

Reasonably Related Threads