thr3ads.net - R help - [R] Operating on count lists of non-equal lengths [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Kari Manninen

2011-Jan-09 06:19 UTC

[R] Operating on count lists of non-equal lengths

This is my first post to R-help and I look forward receiving some  
advice for a novice like me...

I?ve got a simple repeated (4 periods so far) 10-question survey data  
that is very easy to work on Excel. However, I?d like to move the  
compilation to R but I?m having some trouble operating on count list  
data in a neat way.

The data C> str(C)'data.frame':	551 obs. of  13 variables:
  $ TIME   : int  1 1 1 1 1 1 1 1 1 1 ...
  $ Sector : Factor w/ 6 levels
"D","F","G","H",..: 1 1 1 1 1 1 1 1 1 1
...
  $ COMP   : Factor w/ 196 levels " (_____ __ _____) ",..: 73 133 128
109 153 147 56 26 142 34 ...
  $ Q1     : int  0 0 1 1 0 -1 -1 1 1 -1 ...
  $ Q2     : int  0 0 0 -1 0 -1 0 0 1 -1 ...
  $ Q3     : int  0 0 0 1 0 -1 -1 1 1 -1 ...
  $ Q4     : int  -1 0 0 0 0 -1 0 -1 0 -1 ...
  $ Q5     : int  0 0 0 -1 0 -1 0 -1 0 0 ...
  $ Q6     : int  0 0 0 1 0 -1 0 -1 0 0 ...
  $ Q7     : int  0 1 1 0 0 0 1 0 1 1 ...
  $ Q8     : int  0 0 0 0 0 -1 0 0 1 0 ...
  $ Q9     : int  0 1 0 0 0 -1 0 -1 1 -1 ...
  $ Q10    : int  0 0 0 0 -1 -1 0 -1 0 0 ...
> summary(C)       TIME       Sector  COMP        Q1               Q2
  Min.   :1.000   D:130   A:  4   Min.   :-1.000   Min.   :-1.0000
  1st Qu.:2.000   F:126   B:  4   1st Qu.: 0.000   1st Qu.: 0.0000
  Median :3.000   G:158   C:  4   Median : 1.000   Median : 0.0000
  Mean   :2.684   H: 26   D:  4   Mean   : 0.446   Mean   : 0.2178
  3rd Qu.:4.000   I: 20   E:  4   3rd Qu.: 1.000   3rd Qu.: 1.0000
  Max.   :4.000   J: 91   F:  4   Max.   : 1.000   Max.   : 1.0000
                    (Other):527   NA's   :60.000   NA's   :69.0000



The aim is to produce balance scores between positive and negative  
answers? shares in the data. First counts of -1, 0 and 1 (negative,  
neutral, positive) and missing NA (it would be som much simple without  
the missing values) for each question Q1-Q10 for each period (TIME) in  
6 Sectors:

b<-apply(C[,4:13], 2, function (x) tapply(x,C[,1:2], count))

I know that b is a list of data.frames dim(4x6) for each question,  
where each ?cell? is a count list.

For example, for Question 1, Time period 2, Sector 1:> str(b$Q1[2,1])List of 1
  $ :?data.frame?:  4 obs. of 2 variables:
    ..$ x    : int [1:4]  -1 0 1  NA
    ..$ freq : int [1:4]  3  9 12 2

Now I would like to group questions (C[, 4:6],   C[, 7],  C[8:9],   
C[10:11]  and  C[, 12:13])  and sum counts (-1, 0, 1) for these groups  
and present them in percentage terms. I don?t know how to this  
efficiently for the whole data. I would not like to go through each  
cell separately


Then I?d give each group a balance score based on something like:

Score = 100 + 100*[ pos% - neg%] for each group by TIME, Sector, while  
excluding the missing observations.

### This is not working
Score <-  100 + 100*[sum(count( =="1")/sum(count(list(
"-1", "0","1")
- sum(count( =="-1")/sum(count(list( "-1",
"0","1")]  for each 5
groups defined above and by TIME, Sector

I would greatly appreciate your help on this.

Regards,
- Kari Manninen

Dennis Murphy

2011-Jan-09 10:22 UTC

head link

[R] Operating on count lists of non-equal lengths

Hi:

This is an abridged version of the reply I sent privately to the OP.

#### Generate an artificial data frame
# function to randomly generate one of the Q* columns with length 1000
mysamp <- function() sample(c(-1, 0, 1, NA), 1000, prob = c(0.35, 0.2, 0.4,
0.05), replace = TRUE)

# use above function to randomly generate 10 questions and assign them names
in the workspace
for(i in 1:10) assign(paste('Q', i, sep = ''), mysamp())
# create a data frame from the generate questions
C <- data.frame(time = rep(1:4, each = 250),
                sector = sample(LETTERS[1:6], 1000, replace = TRUE),
                Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10)
####

# A function to generate the scores from the combined questions
# for an arbitrary input data frame d:
scorefun <- function(d) {
     dm <- matrix(unlist(apply(d, 2, table)[-(1:2)]), nrow = 3)
     tsums <- cbind(rowSums(dm[, 1:3]), dm[, 4],
                    rowSums(dm[, 5:6]), rowSums(dm[, 7:8]),
                    rowSums(dm[, 9:10]) )
     dprop <- function(x) (x[3] - x[1])/sum(x)
     100 * (1 + apply(tsums, 2, dprop))
   }

library(plyr)
# Apply scorefun() to each sub-data frame corresponding to time-sector
combinations
ddply(C, .(time, sector), scorefun)

Dennis

On Sat, Jan 8, 2011 at 10:19 PM, Kari Manninen <kari@econadvisor.com>
wrote:
> This is my first post to R-help and I look forward receiving some advice
> for a novice like me...
>
> I’ve got a simple repeated (4 periods so far) 10-question survey data that
> is very easy to work on Excel. However, I’d like to move the compilation to
> R but I’m having some trouble operating on count list data in a neat way.
>
> The data C
>
>> str(C)
>>
> 'data.frame':   551 obs. of  13 variables:
>  $ TIME   : int  1 1 1 1 1 1 1 1 1 1 ...
>  $ Sector : Factor w/ 6 levels
"D","F","G","H",..: 1 1 1 1 1 1 1 1 1 1
...
>  $ COMP   : Factor w/ 196 levels " (_____ __ _____) ",..: 73 133
128 109
> 153 147 56 26 142 34 ...
>  $ Q1     : int  0 0 1 1 0 -1 -1 1 1 -1 ...
>  $ Q2     : int  0 0 0 -1 0 -1 0 0 1 -1 ...
>  $ Q3     : int  0 0 0 1 0 -1 -1 1 1 -1 ...
>  $ Q4     : int  -1 0 0 0 0 -1 0 -1 0 -1 ...
>  $ Q5     : int  0 0 0 -1 0 -1 0 -1 0 0 ...
>  $ Q6     : int  0 0 0 1 0 -1 0 -1 0 0 ...
>  $ Q7     : int  0 1 1 0 0 0 1 0 1 1 ...
>  $ Q8     : int  0 0 0 0 0 -1 0 0 1 0 ...
>  $ Q9     : int  0 1 0 0 0 -1 0 -1 1 -1 ...
>  $ Q10    : int  0 0 0 0 -1 -1 0 -1 0 0 ...
>
>  summary(C)
>>
>      TIME       Sector  COMP        Q1               Q2
>  Min.   :1.000   D:130   A:  4   Min.   :-1.000   Min.   :-1.0000
>  1st Qu.:2.000   F:126   B:  4   1st Qu.: 0.000   1st Qu.: 0.0000
>  Median :3.000   G:158   C:  4   Median : 1.000   Median : 0.0000
>  Mean   :2.684   H: 26   D:  4   Mean   : 0.446   Mean   : 0.2178
>  3rd Qu.:4.000   I: 20   E:  4   3rd Qu.: 1.000   3rd Qu.: 1.0000
>  Max.   :4.000   J: 91   F:  4   Max.   : 1.000   Max.   : 1.0000
>                   (Other):527   NA's   :60.000   NA's   :69.0000
> …
>
> The aim is to produce balance scores between positive and negative answers’
> shares in the data. First counts of -1, 0 and 1 (negative, neutral,
> positive) and missing NA (it would be som much simple without the missing
> values) for each question Q1-Q10 for each period (TIME) in 6 Sectors:
>
> b<-apply(C[,4:13], 2, function (x) tapply(x,C[,1:2], count))
>
> I know that b is a list of data.frames dim(4x6) for each question, where
> each ‘cell’ is a count list.
>
> For example, for Question 1, Time period 2, Sector 1:
>
>> str(b$Q1[2,1])
>>
> List of 1
>  $ :’data.frame’:  4 obs. of 2 variables:
>   ..$ x    : int [1:4]  -1 0 1  NA
>   ..$ freq : int [1:4]  3  9 12 2
>
> Now I would like to group questions (C[, 4:6],   C[, 7],  C[8:9],  C[10:11]
>  and  C[, 12:13])  and sum counts (-1, 0, 1) for these groups and present
> them in percentage terms. I don’t know how to this efficiently for the
whole
> data. I would not like to go through each cell separately…
>
> Then I’d give each group a balance score based on something like:
>
> Score = 100 + 100*[ pos% - neg%] for each group by TIME, Sector, while
> excluding the missing observations.
>
> ### This is not working
> Score <-  100 + 100*[sum(count( =="1")/sum(count(list(
"-1", "0","1")  -
> sum(count( =="-1")/sum(count(list( "-1",
"0","1")]  for each 5 groups
> defined above and by TIME, Sector
>
> I would greatly appreciate your help on this.
>
> Regards,
> - Kari Manninen
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Jan 2011 - Operating on count lists of non-equal lengths

[R] Operating on count lists of non-equal lengths

[R] Operating on count lists of non-equal lengths

Possibly Parallel Threads