thr3ads.net - R help - [R] Discrepancy in the PBC data set [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Terry Therneau

2008-Nov-24 13:39 UTC

[R] Discrepancy in the PBC data set

The data set in R is wrong. I've found mistakes on 2 lines in a quick look. 
  
  I don't know if the data is incorrect in the Appendix of Fleming and 
Harrington as well (someone seems to have borrowed my copy), which is where the 
data set appears to have been taken from, given all the "-9" codes in
it. (Note,
Tom Fleming originally got the data from me, so I'm fairly confident in
calling
my Mayo version the authoritative one).  I'll make sure this gets fixed.
  
  You can grab a correct data set from our department web page.  Code is below.
  
  	Terry Therneau
  	
  
pbcurl <- 
"http://mayoresearch.mayo.edu/mayo/research/biostat/upload/therneau_upload/pbc.d
at"

pbc <- read.table(pbcurl, header=F, 
                  col.names=c('id', 'time', 'status',
'trt',  'age', 'sex',
                              'ascites',  'hepato', 
'spiders',  'edema',
                              'bili',  'chol', 
'albumin',  'copper',
                              'alk.phos',  'ast', 
'trig',  'platelet',
                              'protime',  'stage'),
                  na.strings='.')
pbc$age <- pbc$age/365.25 

newfit <- coxph(Surv(time, status==2) ~ age + edema + log(bili) +
	log(protime) + log(albumin), data=pbc)

newfit
                coef exp(coef) se(coef)     z       p
age           0.0396    1.0404  0.00767  5.16 2.4e-07
edema         0.8963    2.4505  0.27141  3.30 9.6e-04
log(bili)     0.8636    2.3716  0.08294 10.41 0.0e+00
log(protime)  2.3868   10.8791  0.76851  3.11 1.9e-03
log(albumin) -2.5069    0.0815  0.65292 -3.84 1.2e-04

Likelihood ratio test=231  on 5 df, p=0  n=416 (2 observations deleted due to 
missingness)

Ravi Varadhan

2008-Nov-24 14:28 UTC

head link

[R] Discrepancy in the PBC data set

Dear Terry,

Thank you very much for taking your time to address this problem!  

I did check the data in F&H.  I couldn't detect any differences between
the
R data set and the one in the Appendix.  The preface in F&H acknowledges
that the data set was obtained from Roland Dickinson.  Is the data set in R
created by Tom Fleming based on the original Mayo data?

Where do the papers that reference this data set get their data from?  Do
they get it from the URL that you gave me?  It is impossible to tell from
the papers because they just cite the F&H appendix as the source of the
data, but obviously they must have gotten it as an electronic version from
somewhere.  If so, is the electronic version the same as the R data set?

This is relevant for me because I am trying to compare the results of my
estimation algorithm to that in another paper (which, of course, simply
cites F&H for the data).

Best regards,
Ravi.

----------------------------------------------------------------------------
-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: rvaradhan at jhmi.edu

Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html

 

----------------------------------------------------------------------------
--------


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Terry Therneau
Sent: Monday, November 24, 2008 8:40 AM
To: rvaradhan at jhmi.edu
Cc: r-help at r-project.org
Subject: Re: [R] Discrepancy in the PBC data set

  The data set in R is wrong. I've found mistakes on 2 lines in a quick
look. 
  
  I don't know if the data is incorrect in the Appendix of Fleming and
Harrington as well (someone seems to have borrowed my copy), which is where
the data set appears to have been taken from, given all the "-9" codes
in
it. (Note, Tom Fleming originally got the data from me, so I'm fairly
confident in calling my Mayo version the authoritative one).  I'll make sure
this gets fixed.
  
  You can grab a correct data set from our department web page.  Code is
below.
  
  	Terry Therneau
  	
  
pbcurl <-
"http://mayoresearch.mayo.edu/mayo/research/biostat/upload/therneau_upload/p
bc.d
at"

pbc <- read.table(pbcurl, header=F, 
                  col.names=c('id', 'time', 'status',
'trt',  'age', 'sex',
                              'ascites',  'hepato', 
'spiders',  'edema',
                              'bili',  'chol', 
'albumin',  'copper',
                              'alk.phos',  'ast', 
'trig',  'platelet',
                              'protime',  'stage'),
                  na.strings='.')
pbc$age <- pbc$age/365.25 

newfit <- coxph(Surv(time, status==2) ~ age + edema + log(bili) +
	log(protime) + log(albumin), data=pbc)

newfit
                coef exp(coef) se(coef)     z       p
age           0.0396    1.0404  0.00767  5.16 2.4e-07
edema         0.8963    2.4505  0.27141  3.30 9.6e-04
log(bili)     0.8636    2.3716  0.08294 10.41 0.0e+00
log(protime)  2.3868   10.8791  0.76851  3.11 1.9e-03
log(albumin) -2.5069    0.0815  0.65292 -3.84 1.2e-04

Likelihood ratio test=231  on 5 df, p=0  n=416 (2 observations deleted due
to
missingness)

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more maybe matching threads

R help - Nov 2008 - Discrepancy in the PBC data set

[R] Discrepancy in the PBC data set

[R] Discrepancy in the PBC data set

Reasonably Related Threads