thr3ads.net - R help - [R] simple generation of artificial data with defined features [Aug 2008]

If this information is useful, please help other people find it:
Share via:

drflxms

2008-Aug-22 12:12 UTC

[R] simple generation of artificial data with defined features

Dear R-colleagues,

I am quite a newbie to R fighting my stupidity to solve a probably quite
simple problem of generating artificial data with defined features.

I am conducting a study of inter-observer-agreement in
child-bronchoscopy. One of the most important measures is Kappa
according to Fleiss, which is very comfortable available in R through
the irr-package.
Unfortunately medical doctors like me don't really understand much of
statistics. Therefore I'd like to give the reader an easy understandable
example of Fleiss-Kappa in the Methods part. To achieve this, I obtained
a table with the results of the German election from 2005:

party        number of votes    percent

SPD        16194665            34,2
CDU        13136740            27,8
CSU        3494309            7,4
Gruene    3838326            8,1
FDP        4648144            9,8
PDS        4118194            8,7

I want to show the agreement of voters measured by Fleiss-Kappa. To
calculate this with the kappam.fleiss-function of irr, I need a
data.frame like this:

                (id of 1st voter) (id of 2nd voter)

party             spd                         cdu

Of course I don't plan to calculate this with the million of cases
mentioned in the table above (I am working on a small laptop). A
division by 1000 would be more than perfect for this example. The exact
format of the table is generally not so important, as I could reshape
nearly every format with the help of the reshape-package.

Unfortunately I could not figure out how to create such a
fictive/artificial dataset as described above. Any data.frame would be
nice, that keeps at least the percentage. String-IDs of parties could be
substituted by numbers of course (would be even better for function
kappam.fleiss in irr!).

I would appreciate any kind of help very much indeed.
Greetings from Munich,

Felix Mueller-Sarnowski

Greg Snow

2008-Aug-22 16:40 UTC

head link

[R] simple generation of artificial data with defined features

I don't think that the election data is the right data to demonstrate Kappa,
you need subjects that are classified by 2 or more different raters/methods. 
The election data could be considered classifying the voters into which party
they voted for, but you only have 1 rater.  Maybe if you had some survey data
that showed which party each voter voted for in 2 or more elections, then that
may be a good example dataset.  Otherwise you may want to stick with the sample
datasets.

There are other packages that compute Kappa values as well (I don't know if
others calculate this particular version), but some of those take the summary
data as input rather than the raw data, which may be easier if you just have the
summary tables.


--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
(801) 408-8111


> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of drflxms
> Sent: Friday, August 22, 2008 6:12 AM
> To: r-help at r-project.org
> Subject: [R] simple generation of artificial data with
> defined features
>
> Dear R-colleagues,
>
> I am quite a newbie to R fighting my stupidity to solve a
> probably quite simple problem of generating artificial data
> with defined features.
>
> I am conducting a study of inter-observer-agreement in
> child-bronchoscopy. One of the most important measures is
> Kappa according to Fleiss, which is very comfortable
> available in R through the irr-package.
> Unfortunately medical doctors like me don't really understand
> much of statistics. Therefore I'd like to give the reader an
> easy understandable example of Fleiss-Kappa in the Methods
> part. To achieve this, I obtained a table with the results of
> the German election from 2005:
>
> party        number of votes    percent
>
> SPD        16194665            34,2
> CDU        13136740            27,8
> CSU        3494309            7,4
> Gruene    3838326            8,1
> FDP        4648144            9,8
> PDS        4118194            8,7
>
> I want to show the agreement of voters measured by
> Fleiss-Kappa. To calculate this with the
> kappam.fleiss-function of irr, I need a data.frame like this:
>
>                 (id of 1st voter) (id of 2nd voter)
>
> party             spd                         cdu
>
> Of course I don't plan to calculate this with the million of
> cases mentioned in the table above (I am working on a small
> laptop). A division by 1000 would be more than perfect for
> this example. The exact format of the table is generally not
> so important, as I could reshape nearly every format with the
> help of the reshape-package.
>
> Unfortunately I could not figure out how to create such a
> fictive/artificial dataset as described above. Any data.frame
> would be nice, that keeps at least the percentage. String-IDs
> of parties could be substituted by numbers of course (would
> be even better for function kappam.fleiss in irr!).
>
> I would appreciate any kind of help very much indeed.
> Greetings from Munich,
>
> Felix Mueller-Sarnowski
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

drflxms

2008-Aug-23 11:25 UTC

head link

[R] simple generation of artificial data with defined features

Dear Mr. Christos Hatzis,

thank you so much for your answer which is in my eyes just brilliant! I
followed it step by step (great and detailed explanation) and nearly
everything is fine. - Except a problem in the very end, I haven't found
a solution for until now. (Despite playing arround quite a lot...)
Please let me explain:
> election.2005 <- c(16194,13136,3494,3838,4648,4118) #cut of last 3digits, cause my laptop can't handle millions of
rows...> attr(election.2005, "class") <- "table"
> attr(election.2005, "dim") <- c(1,6)
> attr(election.2005, "dimnames") <- list(c("votes"),
c("spd", "cdu","csu", "gruene", "fdp",
"pds"))> head(election.2005)        spd   cdu  csu gruene  fdp  pds
votes 16194 13136 3494   3838 4648 4118> el.dt <- as.data.frame(election.2005)
> el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)]
> dim(el.dt.exp)
[1] 45428     2> head(el.dt.exp)     Var1 Var2
1   votes  spd
1.1 votes  spd
1.2 votes  spd
1.3 votes  spd
1.4 votes  spd
1.5 votes  spd

My problem now is, that I would need either an autoincrementing
identifier instead of "votes" in Var1 or the possibility to access the
numbering by a column name (i.e. Var0). In addition I need a 3rd
Variable for the year oft the election (2005, which is the same for all,
but needed later on). So this is what it should look like:

     voter.id     party     election.year
1       1        spd            2005
1.1     2         spd          2005
1.2     3        spd           2005
1.3     4        spd            2005
1.4     5        spd            2005
1.5     6        spd            2005

The reason for that is the input format of the kappam.fleiss function of
the irr package I use for calculation. It accepts a data.frame with the
categories as rows (here we would have only one catgory: the year of the
election) and the raters (here the voters) as columns. In the data.frame
there will be the chosen party for each combination of electionyear and
voter.

This format can be easily achieved using the reshape package. Assuming
voter.id would be an autoincrementing identifier, the command should be:
>library(reshape)
>el.dt.exp.molten<-melt(el.dt.exp, id=c("voter.id")) #which
wouldpropably change not really anything in this case, because the data is
already in a "molten" form>kappa.frame<-cast(el.dt.exp.molten, election.year ~ voter.id,subset=variable=="party")

I'd be extremely happy in case you might help me out again!
Have a nice weekend and many thanks so far!
Greetings from Munich,

Felix Mueller-Sarnowski


Christos Hatzis wrote:> On the general question on how to create a dataset that matches the
> frequencies in a table, function as.data.frame can be useful.  It takes as
> argument an object of a class 'table' and returns a data frame of
> frequencies.
>
> Consider for example table 6.1 of Fleiss et al (3rd Ed):
>
>   
>> birth.weight <- c(10,15,40,135)
>> attr(birth.weight, "class") <- "table"
>> attr(birth.weight, "dim") <- c(2,2)
>> attr(birth.weight, "dimnames") <- list(c("A",
"Ab"), c("B", "Bb"))
>> birth.weight
>>     
>      B  Bb
> A   10  40
> Ab  15 135
>   
>> summary(birth.weight)
>>     
> Number of cases in table: 200 
> Number of factors: 2 
> Test for independence of all factors:
>         Chisq = 3.429, df = 1, p-value = 0.06408
>   
>> bw.dt <- as.data.frame(birth.weight)
>>     
>
> Observations (rows) in this table can then be replicated according to their
> corresponding frequencies to yield the expanded dataset that conforms with
> the original table. 
>
>   
>> bw.dt.exp <- bw.dt[rep(1:nrow(bw.dt), bw.dt$Freq), -ncol(bw.dt)]
>> dim(bw.dt.exp)
>>     
> [1] 200   2
>   
>> table(bw.dt.exp)
>>     
>     Var2
> Var1   B  Bb
>   A   10  40
>   Ab  15 135 
>
> The above approach is not restricted to 2x2 tables, and should be
> straightforward generate datasets that conform to arbitrary nxm frequency
> tables.
>
> -Christos Hatzis
>
>
>   
>> -----Original Message-----
>> From: r-help-bounces at r-project.org 
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Greg Snow
>> Sent: Friday, August 22, 2008 12:41 PM
>> To: drflxms; r-help at r-project.org
>> Subject: Re: [R] simple generation of artificial data with 
>> defined features
>>
>> I don't think that the election data is the right data to 
>> demonstrate Kappa, you need subjects that are classified by 2 
>> or more different raters/methods.  The election data could be 
>> considered classifying the voters into which party they voted 
>> for, but you only have 1 rater.  Maybe if you had some survey 
>> data that showed which party each voter voted for in 2 or 
>> more elections, then that may be a good example dataset.  
>> Otherwise you may want to stick with the sample datasets.
>>
>> There are other packages that compute Kappa values as well (I 
>> don't know if others calculate this particular version), but 
>> some of those take the summary data as input rather than the 
>> raw data, which may be easier if you just have the summary tables.
>>
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> Statistical Data Center
>> Intermountain Healthcare
>> greg.snow at imail.org
>> (801) 408-8111
>>
>>
>>
>>     
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org
>>> [mailto:r-help-bounces at r-project.org] On Behalf Of drflxms
>>> Sent: Friday, August 22, 2008 6:12 AM
>>> To: r-help at r-project.org
>>> Subject: [R] simple generation of artificial data with defined 
>>> features
>>>
>>> Dear R-colleagues,
>>>
>>> I am quite a newbie to R fighting my stupidity to solve a probably 
>>> quite simple problem of generating artificial data with defined 
>>> features.
>>>
>>> I am conducting a study of inter-observer-agreement in 
>>> child-bronchoscopy. One of the most important measures is Kappa 
>>> according to Fleiss, which is very comfortable available in 
>>>       
>> R through 
>>     
>>> the irr-package.
>>> Unfortunately medical doctors like me don't really 
>>>       
>> understand much of 
>>     
>>> statistics. Therefore I'd like to give the reader an easy 
>>> understandable example of Fleiss-Kappa in the Methods part. 
>>>       
>> To achieve 
>>     
>>> this, I obtained a table with the results of the German 
>>>       
>> election from 
>>     
>>> 2005:
>>>
>>> party        number of votes    percent
>>>
>>> SPD        16194665            34,2
>>> CDU        13136740            27,8
>>> CSU        3494309            7,4
>>> Gruene    3838326            8,1
>>> FDP        4648144            9,8
>>> PDS        4118194            8,7
>>>
>>> I want to show the agreement of voters measured by Fleiss-Kappa. To
>>> calculate this with the kappam.fleiss-function of irr, I need a 
>>> data.frame like this:
>>>
>>>                 (id of 1st voter) (id of 2nd voter)
>>>
>>> party             spd                         cdu
>>>
>>> Of course I don't plan to calculate this with the million of
cases
>>> mentioned in the table above (I am working on a small laptop). A 
>>> division by 1000 would be more than perfect for this example. The 
>>> exact format of the table is generally not so important, as I could
>>> reshape nearly every format with the help of the reshape-package.
>>>
>>> Unfortunately I could not figure out how to create such a 
>>> fictive/artificial dataset as described above. Any 
>>>       
>> data.frame would be 
>>     
>>> nice, that keeps at least the percentage. String-IDs of 
>>>       
>> parties could 
>>     
>>> be substituted by numbers of course (would be even better 
>>>       
>> for function 
>>     
>>> kappam.fleiss in irr!).
>>>
>>> I would appreciate any kind of help very much indeed.
>>> Greetings from Munich,
>>>
>>> Felix Mueller-Sarnowski
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>       
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>     
>
>
>

drflxms

2008-Aug-24 11:01 UTC

head link

[R] simple generation of artificial data with defined features

Hello all,

beside saying again thank you for your help, I'd like to present the
final solution of my problem and the results of the kappa-calculation:
> election.2005 <- c(16194,13136,3494,3838,4648,4118)#data obtained via genesis-database of "Statistisches Bundesamt"
www.destatis.de
#simply cut of last 3 digits because of limited calculation-power of
laptop> attr(election.2005, "class") <- "table"
> attr(election.2005, "dim") <- c(1,6)
> attr(election.2005, "dimnames") <- list(c("votes"),
c(1,2,3,4,5,6))#used numbers instead of names of parties for easier handling later on
#1=spd,2=cdu,3=csu,4=gruene,5=fdp,6=pds> head(election.2005)      [,1]  [,2] [,3] [,4] [,5] [,6]
[1,] 16194 13136 3494 3838 4648 4118
#replicate rows according to frequency-table:> el.dt.exp <- el.dt[rep(1:nrow(el.dt), el.dt$Freq), -ncol(el.dt)]
> el.dt.exp$id=seq(1:nrow(el.dt.exp)) #add voter id
> el.dt.exp$year=2005 #add column with year of election
# remove a column we don't need:> el.dt.exp<-subset(el.dt.exp, select=-c(Var1))
> dim(el.dt.exp)
[1] 45428     3> head(el.dt.exp)    Var2 id year
1      1  1 2005
1.1    1  2 2005
1.2    1  3 2005
1.3    1  4 2005
1.4    1  5 2005
1.5    1  6 2005
1.5    1  6 2005> el.dt.exp<-as.data.frame(el.dt.exp, row.names=seq(1:nrow(el.dt.exp)))
# get rid of the unusual numbering of rows> head(el.dt.exp)  Var2 id year
1    1  1 2005
2    1  2 2005
3    1  3 2005
4    1  4 2005
5    1  5 2005
6    1  6 2005> summary(el.dt.exp) Var2            id             year    
 1:16194   Min.   :    1   Min.   :2005 
 2:13136   1st Qu.:11358   1st Qu.:2005 
 3: 3494   Median :22715   Median :2005 
 4: 3838   Mean   :22715   Mean   :2005 
 5: 4648   3rd Qu.:34071   3rd Qu.:2005 
 6: 4118   Max.   :45428   Max.   :2005 

Var2 is of type character, which is uncomfortable for further processing.
I changed type with the data editor using fix(el.dt.exp) to number.

#create the dataframe for the calculation of kappa> library(reshape)
> el.dt.exp.molten<-melt(el.dt.exp, id=c(2,3), na.rm=FALSE)
> kappa.frame<-cast(el.dt.exp.molten, year ~ id)
> dim(kappa.frame)[1]     1 45429
#calculate kappa> library(irr)
> kappam.fleiss(kappa.frame, exact=FALSE, detail=TRUE) Fleiss' Kappa for m Raters

 Subjects = 1
   Raters = 45428
    Kappa = -2.2e-05

        z = -1.35
  p-value = 0.176

   Kappa      z p.value
1  0.000 -0.707   0.479
2  0.000 -0.707   0.479
3  0.000 -0.707   0.479
4  0.000 -0.707   0.479
5  0.000 -0.707   0.479
6  0.000 -0.707   0.479

What a surprise! So Greg was absolutely right, that this is probably not
a good example for Kappa. But still a very interesting one, if you ask me!

My theory: Kappa doesn't express simply agreement. As far as I learned
from the Handbook of Inter-Rater Reliability (Gwet, Kilem 2001; STATAXIS
Publishing Company;  www.stataxis.com) Kappa tries to measure how
different and observed agreement is from an agreement that arises from
chance.
So in this case this probably means, that the results of the election
2005 are not significantly different from results, that could have
arisen by chance.

Anyway I personally learned a very interesting lesson about Kappa and R.
Thank you all for your professional and quick help to a newbie!
Greetings from Munich,

Felix

Apparently Analagous Threads

Search for more reasonably related threads

R help - Aug 2008 - simple generation of artificial data with defined features

[R] simple generation of artificial data with defined features

[R] simple generation of artificial data with defined features

[R] simple generation of artificial data with defined features

[R] simple generation of artificial data with defined features

Apparently Analagous Threads