Greetings Everybody:
I generated a 1.2MB dta file based on the general social survey with
Stata8 for linux. The file can be re-opened with Stata, but when I bring
it into R, it says all the values are missing for most of the variables.
This dataset is called "morgen.dta" and I dropped a copy online in
case
you are interested
http://www.ku.edu/~pauljohn/R/morgen.dta
looks like this to R (I tried various options on the read.dta command):
> myDat <- read.dta("morgen.dta")
> summary(myDat)
CASEID year id hrs1 hrs2
Min. : 19721 Min. :1972 Min. : 1 NAP : 0 NAP : 0
1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 DK : 0 DK : 0
Median : 1996808 Median :1987 Median : 905 NA : 0 NA : 0
Mean : 9963040 Mean :1986 Mean : 990 NA's:40933 NA's:40933
3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358
Max. :20002817 Max. :2000 Max. :3247
prestige agewed age educ paeduc
DK,NA,NAP: 0 NAP : 0 DK : 0 NAP : 0 NAP : 0
NA's :40933 DK : 0 NA : 0 DK : 0 DK : 0
NA : 0 NA's:40933 NA : 0 NA : 0
NA's:40933 NA's:40933
NA's:40933
maeduc speduc income
NAP : 0 NAP : 0 $25000 OR MORE:14525
DK : 0 DK : 0 $10000 - 14999: 5022
NA : 0 NA : 0 $15000 - 19999: 3869
NA's:40933 NA's:40933 $20000 - 24999: 3664
REFUSED : 1877
(Other) : 8523
NA's : 3453
>
Here's what Stata sees when I load the same thing:
summarize, detail
Case identification number
-------------------------------------------------------------
Percentiles Smallest
1% 197432 19721
5% 199649 19722
10% 1974116 19723 Obs 40933
25% 1983475 19724 Sum of Wgt. 40933
50% 1996808 Mean 9963040
Largest Std. Dev. 9006352
75% 1.99e+07 2.00e+07
90% 2.00e+07 2.00e+07 Variance 8.11e+13
95% 2.00e+07 2.00e+07 Skewness .18931
99% 2.00e+07 2.00e+07 Kurtosis 1.045409
GSS YEAR FOR THIS RESPONDENT
-------------------------------------------------------------
Percentiles Smallest
1% 1972 1972
5% 1973 1972
10% 1974 1972 Obs 40933
25% 1978 1972 Sum of Wgt. 40933
50% 1987 Mean 1986.421
Largest Std. Dev. 8.61136
75% 1994 2000
90% 1998 2000 Variance 74.15552
95% 2000 2000 Skewness -.0789223
99% 2000 2000 Kurtosis 1.799939
RESPONDENT ID NUMBER
-------------------------------------------------------------
Percentiles Smallest
1% 18 1
5% 89 1
10% 178 1 Obs 40933
25% 445 1 Sum of Wgt. 40933
50% 905 Mean 989.9129
Largest Std. Dev. 689.0596
75% 1358 3244
90% 2027 3245 Variance 474803.2
95% 2437 3246 Skewness .8359211
99% 2867 3247 Kurtosis 3.311248
NUMBER OF HOURS WORKED LAST WEEK
-------------------------------------------------------------
Percentiles Smallest
1% 6 0
5% 15 0
10% 21 0 Obs 23279
25% 37 0 Sum of Wgt. 23279
50% 40 Mean 41.05206
Largest Std. Dev. 13.95931
75% 48 89
90% 60 89 Variance 194.8624
95% 65 89 Skewness .195045
99% 82 89 Kurtosis 4.448998
NUMBER OF HOURS USUALLY WORK A WEEK
-------------------------------------------------------------
Percentiles Smallest
1% 4 0
5% 15 0
10% 20 1 Obs 774
25% 38 2 Sum of Wgt. 774
50% 40 Mean 39.79199
Largest Std. Dev. 13.43383
75% 45 89
90% 55 89 Variance 180.4677
95% 60 89 Skewness -.0002332
99% 80 89 Kurtosis 5.009869
RS OCCUPATIONAL PRESTIGE SCORE (1970)
-------------------------------------------------------------
Percentiles Smallest
1% 14 12
5% 17 12
10% 20 12 Obs 24267
25% 30 12 Sum of Wgt. 24267
50% 39 Mean 39.35645
Largest Std. Dev. 14.03712
75% 48 82
90% 60 82 Variance 197.0407
95% 62 82 Skewness .2927414
99% 76 82 Kurtosis 2.775553
AGE WHEN FIRST MARRIED
-------------------------------------------------------------
Percentiles Smallest
1% 15 12
5% 17 12
10% 17 12 Obs 25382
25% 19 12 Sum of Wgt. 25382
50% 21 Mean 22.09609
Largest Std. Dev. 4.813944
75% 24 63
90% 28 68 Variance 23.17405
95% 31 73 Skewness 2.002265
99% 39 73 Kurtosis 11.28279
AGE OF RESPONDENT
-------------------------------------------------------------
Percentiles Smallest
1% 19 18
5% 21 18
10% 24 18 Obs 40790
25% 30 18 Sum of Wgt. 40790
50% 42 Mean 45.14798
Largest Std. Dev. 17.53519
75% 58 89
90% 71 89 Variance 307.4828
95% 77 89 Skewness .4774907
99% 86 89 Kurtosis 2.239618
HIGHEST YEAR OF SCHOOL COMPLETED
-------------------------------------------------------------
Percentiles Smallest
1% 3 0
5% 7 0
10% 8 0 Obs 40806
25% 11 0 Sum of Wgt. 40806
50% 12 Mean 12.48152
Largest Std. Dev. 3.176226
75% 14 20
90% 16 20 Variance 10.08841
95% 18 20 Skewness -.3389303
99% 20 20 Kurtosis 3.960311
HIGHEST YEAR SCHOOL COMPLETED, FATHER
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 3 0
10% 4 0 Obs 29347
25% 8 0 Sum of Wgt. 29347
50% 11 Mean 10.20994
Largest Std. Dev. 4.342143
75% 12 20
90% 16 20 Variance 18.85421
95% 17 20 Skewness -.1628909
99% 20 20 Kurtosis 2.826482
HIGHEST YEAR SCHOOL COMPLETED, MOTHER
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 3 0
10% 6 0 Obs 34151
25% 8 0 Sum of Wgt. 34151
50% 12 Mean 10.41478
Largest Std. Dev. 3.709352
75% 12 20
90% 14 20 Variance 13.75929
95% 16 20 Skewness -.6324499
99% 18 20 Kurtosis 3.605715
HIGHEST YEAR SCHOOL COMPLETED, SPOUSE
-------------------------------------------------------------
Percentiles Smallest
1% 4 0
5% 7 0
10% 8 0 Obs 22780
25% 12 0 Sum of Wgt. 22780
50% 12 Mean 12.53095
Largest Std. Dev. 3.103418
75% 14 20
90% 16 20 Variance 9.631203
95% 18 20 Skewness -.287755
99% 20 20 Kurtosis 4.051822
TOTAL FAMILY INCOME
-------------------------------------------------------------
Percentiles Smallest
1% 1 1
5% 3 1
10% 5 1 Obs 37480
25% 9 1 Sum of Wgt. 37480
50% 11 Mean 9.75619
Largest Std. Dev. 2.994967
75% 12 13
90% 12 13 Variance 8.969825
95% 13 13 Skewness -1.29205
99% 13 13 Kurtosis 3.759778
.
--
Paul E. Johnson email: pauljohn at ku.edu
Dept. of Political Science http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas Office: (785) 864-9086
Lawrence, Kansas 66044-3177 FAX: (785) 864-5700
On Tue, 21 Sep 2004, Paul Johnson wrote:> Greetings Everybody: > > I generated a 1.2MB dta file based on the general social survey with Stata8 > for linux. The file can be re-opened with Stata, but when I bring it into R, > it says all the values are missing for most of the variables.You need read.dta( ,convert.factors=FALSE) You have variables with labels for some, but not all, of their values. When these are converted to R factors you lose the unlabelled values. R does not have a data type that is sometimes labelled and sometimes numeric. When you use convert.factors=FALSE the label information is still read in and returned as an attribute of the data frame, so you can set individual variables to be factors. -thomas> > This dataset is called "morgen.dta" and I dropped a copy online in case you > are interested > > http://www.ku.edu/~pauljohn/R/morgen.dta > > looks like this to R (I tried various options on the read.dta command): > >> myDat <- read.dta("morgen.dta") >> summary(myDat) > CASEID year id hrs1 hrs2 > Min. : 19721 Min. :1972 Min. : 1 NAP : 0 NAP : 0 > 1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 DK : 0 DK : 0 > Median : 1996808 Median :1987 Median : 905 NA : 0 NA : 0 > Mean : 9963040 Mean :1986 Mean : 990 NA's:40933 NA's:40933 > 3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358 > Max. :20002817 Max. :2000 Max. :3247 > > prestige agewed age educ paeduc > DK,NA,NAP: 0 NAP : 0 DK : 0 NAP : 0 NAP : 0 > NA's :40933 DK : 0 NA : 0 DK : 0 DK : 0 > NA : 0 NA's:40933 NA : 0 NA : 0 > NA's:40933 NA's:40933 NA's:40933 > > > > maeduc speduc income > NAP : 0 NAP : 0 $25000 OR MORE:14525 > DK : 0 DK : 0 $10000 - 14999: 5022 > NA : 0 NA : 0 $15000 - 19999: 3869 > NA's:40933 NA's:40933 $20000 - 24999: 3664 > REFUSED : 1877 > (Other) : 8523 > NA's : 3453 >> > > > Here's what Stata sees when I load the same thing: > > summarize, detail > > Case identification number > ------------------------------------------------------------- > Percentiles Smallest > 1% 197432 19721 > 5% 199649 19722 > 10% 1974116 19723 Obs 40933 > 25% 1983475 19724 Sum of Wgt. 40933 > > 50% 1996808 Mean 9963040 > Largest Std. Dev. 9006352 > 75% 1.99e+07 2.00e+07 > 90% 2.00e+07 2.00e+07 Variance 8.11e+13 > 95% 2.00e+07 2.00e+07 Skewness .18931 > 99% 2.00e+07 2.00e+07 Kurtosis 1.045409 > > GSS YEAR FOR THIS RESPONDENT > ------------------------------------------------------------- > Percentiles Smallest > 1% 1972 1972 > 5% 1973 1972 > 10% 1974 1972 Obs 40933 > 25% 1978 1972 Sum of Wgt. 40933 > > 50% 1987 Mean 1986.421 > Largest Std. Dev. 8.61136 > 75% 1994 2000 > 90% 1998 2000 Variance 74.15552 > 95% 2000 2000 Skewness -.0789223 > 99% 2000 2000 Kurtosis 1.799939 > > RESPONDENT ID NUMBER > ------------------------------------------------------------- > Percentiles Smallest > 1% 18 1 > 5% 89 1 > 10% 178 1 Obs 40933 > 25% 445 1 Sum of Wgt. 40933 > > 50% 905 Mean 989.9129 > Largest Std. Dev. 689.0596 > 75% 1358 3244 > 90% 2027 3245 Variance 474803.2 > 95% 2437 3246 Skewness .8359211 > 99% 2867 3247 Kurtosis 3.311248 > > NUMBER OF HOURS WORKED LAST WEEK > ------------------------------------------------------------- > Percentiles Smallest > 1% 6 0 > 5% 15 0 > 10% 21 0 Obs 23279 > 25% 37 0 Sum of Wgt. 23279 > > 50% 40 Mean 41.05206 > Largest Std. Dev. 13.95931 > 75% 48 89 > 90% 60 89 Variance 194.8624 > 95% 65 89 Skewness .195045 > 99% 82 89 Kurtosis 4.448998 > > NUMBER OF HOURS USUALLY WORK A WEEK > ------------------------------------------------------------- > Percentiles Smallest > 1% 4 0 > 5% 15 0 > 10% 20 1 Obs 774 > 25% 38 2 Sum of Wgt. 774 > > 50% 40 Mean 39.79199 > Largest Std. Dev. 13.43383 > 75% 45 89 > 90% 55 89 Variance 180.4677 > 95% 60 89 Skewness -.0002332 > 99% 80 89 Kurtosis 5.009869 > > RS OCCUPATIONAL PRESTIGE SCORE (1970) > ------------------------------------------------------------- > Percentiles Smallest > 1% 14 12 > 5% 17 12 > 10% 20 12 Obs 24267 > 25% 30 12 Sum of Wgt. 24267 > > 50% 39 Mean 39.35645 > Largest Std. Dev. 14.03712 > 75% 48 82 > 90% 60 82 Variance 197.0407 > 95% 62 82 Skewness .2927414 > 99% 76 82 Kurtosis 2.775553 > > AGE WHEN FIRST MARRIED > ------------------------------------------------------------- > Percentiles Smallest > 1% 15 12 > 5% 17 12 > 10% 17 12 Obs 25382 > 25% 19 12 Sum of Wgt. 25382 > > 50% 21 Mean 22.09609 > Largest Std. Dev. 4.813944 > 75% 24 63 > 90% 28 68 Variance 23.17405 > 95% 31 73 Skewness 2.002265 > 99% 39 73 Kurtosis 11.28279 > > AGE OF RESPONDENT > ------------------------------------------------------------- > Percentiles Smallest > 1% 19 18 > 5% 21 18 > 10% 24 18 Obs 40790 > 25% 30 18 Sum of Wgt. 40790 > > 50% 42 Mean 45.14798 > Largest Std. Dev. 17.53519 > 75% 58 89 > 90% 71 89 Variance 307.4828 > 95% 77 89 Skewness .4774907 > 99% 86 89 Kurtosis 2.239618 > > HIGHEST YEAR OF SCHOOL COMPLETED > ------------------------------------------------------------- > Percentiles Smallest > 1% 3 0 > 5% 7 0 > 10% 8 0 Obs 40806 > 25% 11 0 Sum of Wgt. 40806 > > 50% 12 Mean 12.48152 > Largest Std. Dev. 3.176226 > 75% 14 20 > 90% 16 20 Variance 10.08841 > 95% 18 20 Skewness -.3389303 > 99% 20 20 Kurtosis 3.960311 > > HIGHEST YEAR SCHOOL COMPLETED, FATHER > ------------------------------------------------------------- > Percentiles Smallest > 1% 0 0 > 5% 3 0 > 10% 4 0 Obs 29347 > 25% 8 0 Sum of Wgt. 29347 > > 50% 11 Mean 10.20994 > Largest Std. Dev. 4.342143 > 75% 12 20 > 90% 16 20 Variance 18.85421 > 95% 17 20 Skewness -.1628909 > 99% 20 20 Kurtosis 2.826482 > > HIGHEST YEAR SCHOOL COMPLETED, MOTHER > ------------------------------------------------------------- > Percentiles Smallest > 1% 0 0 > 5% 3 0 > 10% 6 0 Obs 34151 > 25% 8 0 Sum of Wgt. 34151 > > 50% 12 Mean 10.41478 > Largest Std. Dev. 3.709352 > 75% 12 20 > 90% 14 20 Variance 13.75929 > 95% 16 20 Skewness -.6324499 > 99% 18 20 Kurtosis 3.605715 > > HIGHEST YEAR SCHOOL COMPLETED, SPOUSE > ------------------------------------------------------------- > Percentiles Smallest > 1% 4 0 > 5% 7 0 > 10% 8 0 Obs 22780 > 25% 12 0 Sum of Wgt. 22780 > > 50% 12 Mean 12.53095 > Largest Std. Dev. 3.103418 > 75% 14 20 > 90% 16 20 Variance 9.631203 > 95% 18 20 Skewness -.287755 > 99% 20 20 Kurtosis 4.051822 > > TOTAL FAMILY INCOME > ------------------------------------------------------------- > Percentiles Smallest > 1% 1 1 > 5% 3 1 > 10% 5 1 Obs 37480 > 25% 9 1 Sum of Wgt. 37480 > > 50% 11 Mean 9.75619 > Largest Std. Dev. 2.994967 > 75% 12 13 > 90% 12 13 Variance 8.969825 > 95% 13 13 Skewness -1.29205 > 99% 13 13 Kurtosis 3.759778 > > . > > > -- > Paul E. Johnson email: pauljohn at ku.edu > Dept. of Political Science http://lark.cc.ku.edu/~pauljohn > 1541 Lilac Lane, Rm 504 > University of Kansas Office: (785) 864-9086 > Lawrence, Kansas 66044-3177 FAX: (785) 864-5700 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
I've had a similar problem once. What may have caused the problem then was a variate for which value lables had been defined for the highest and lowest values. What complicates things is that the file had been originally converted from SPSS to Stata. A workaround was to set "convert.factor=FALSE" and that seems to work here too (using R 1.91 and the latest update for foreign):> m2<-read.dta("morgen.dta",convert.factors=FALSE) > summary(m2)CASEID year id hrs1 Min. : 19721 Min. :1972 Min. : 1 Min. : 0.00 1st Qu.: 1983475 1st Qu.:1978 1st Qu.: 445 1st Qu.: 37.00 Median : 1996808 Median :1987 Median : 905 Median : 40.00 Mean : 9963040 Mean :1986 Mean : 990 Mean : 41.05 3rd Qu.:19872187 3rd Qu.:1994 3rd Qu.:1358 3rd Qu.: 48.00 Max. :20002817 Max. :2000 Max. :3247 Max. : 89.00 NA's :17654.00 hrs2 prestige agewed age Min. : 0.00 Min. : 12.00 Min. : 12.00 Min. : 18.00 1st Qu.: 38.00 1st Qu.: 30.00 1st Qu.: 19.00 1st Qu.: 30.00 Median : 40.00 Median : 39.00 Median : 21.00 Median : 42.00 Mean : 39.79 Mean : 39.36 Mean : 22.10 Mean : 45.15 3rd Qu.: 45.00 3rd Qu.: 48.00 3rd Qu.: 24.00 3rd Qu.: 58.00 Max. : 89.00 Max. : 82.00 Max. : 73.00 Max. : 89.00 NA's :40159.00 NA's :16666.00 NA's :15551.00 NA's :143.00 educ paeduc maeduc speduc Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 1st Qu.: 11.00 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 12.00 Median : 12.00 Median : 11.00 Median : 12.00 Median : 12.00 Mean : 12.48 Mean : 10.21 Mean : 10.41 Mean : 12.53 3rd Qu.: 14.00 3rd Qu.: 12.00 3rd Qu.: 12.00 3rd Qu.: 14.00 Max. : 20.00 Max. : 20.00 Max. : 20.00 Max. : 20.00 NA's :127.00 NA's :11586.00 NA's :6782.00 NA's :18153.00 income Min. : 1.000 1st Qu.: 9.000 Median : 11.000 Mean : 9.756 3rd Qu.: 12.000 Max. : 13.000 NA's :3453.000>--- Paul Johnson <pauljohn at ku.edu> wrote:> Greetings Everybody: > > I generated a 1.2MB dta file based on the general social survey > with > Stata8 for linux. The file can be re-opened with Stata, but when I > bring > it into R, it says all the values are missing for most of the > variables. > > This dataset is called "morgen.dta" and I dropped a copy online in > case > you are interested > > http://www.ku.edu/~pauljohn/R/morgen.dta >[snip]