An organization has asked me to comment on the validity of their recent all-employee survey. Survey responses, by geographic region, compared with the total number of employees in each region, were as follows:> ByRegionAll.Employees Survey.Respondents Region_1 735 142 Region_2 500 83 Region_3 897 78 Region_4 717 133 Region_5 167 48 Region_6 309 0 Region_7 806 125 Region_8 627 122 Region_9 858 177 Region_10 851 160 Region_11 336 52 Region_12 1823 312 Region_13 80 9 Region_14 774 121 Region_15 561 24 Region_16 834 134 How well does the survey represent the employee population? Chi-square test says, not very well:> chisq.test(ByRegion)Pearson's Chi-squared test data: ByRegion X-squared = 163.6869, df = 15, p-value < 2.2e-16 By striking three under-represented regions (3,6, and 15), we get a more reasonable, although still not convincing, result:> chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),])Pearson's Chi-squared test data: ByRegion[setdiff(1:16, c(3, 6, 15)), ] X-squared = 22.5643, df = 12, p-value = 0.03166 This poses several questions: 1) Looking at a side-by-side barchart (proportion of responses vs. proportion of employees, per region), the pattern of survey responses appears, visually, to match fairly well the pattern of employees. Is this a case where we trust the numbers and not the picture? 2) Part of the problem, ironically, is that there were too many responses to the survey. If we had only one-tenth the responses, but in the same proportions by region, the chi-square statistic would look much better, (though with a warning about possible inaccuracy): data: data.frame(ByRegion$All.Employees, 0.1 * (ByRegion$Survey.Respondents)) X-squared = 17.5912, df = 15, p-value = 0.2848 Is there a way of reconciling a large response rate with an unrepresentative response profile? Or is the bad news that the survey will give very precise results about a very ill-specified sub-population? (Of course, I would put in softer terms, like "you need to assess the degree of homogeneity across different regions" .) 3) Is Chi-squared really the right measure of how representative is the survey? <<<<<<< >>>>>>>>> Thanks for any help you can give - hope these questions make sense - George H.
gheine wrote on 10/11/2011 02:31:46 PM:> > An organization has asked me to comment on the validity of their > recent all-employee survey. Survey responses, by geographic region, > compared > with the total number of employees in each region, were as follows: > > > ByRegion > All.Employees Survey.Respondents > Region_1 735 142 > Region_2 500 83 > Region_3 897 78 > Region_4 717 133 > Region_5 167 48 > Region_6 309 0 > Region_7 806 125 > Region_8 627 122 > Region_9 858 177 > Region_10 851 160 > Region_11 336 52 > Region_12 1823 312 > Region_13 80 9 > Region_14 774 121 > Region_15 561 24 > Region_16 834 134 > > How well does the survey represent the employee population? > Chi-square test says, not very well: > > > chisq.test(ByRegion) > > Pearson's Chi-squared test > > data: ByRegion > X-squared = 163.6869, df = 15, p-value < 2.2e-16 > > By striking three under-represented regions (3,6, and 15), we get > a more reasonable, although still not convincing, result: > > > chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),]) > > Pearson's Chi-squared test > > data: ByRegion[setdiff(1:16, c(3, 6, 15)), ] > X-squared = 22.5643, df = 12, p-value = 0.03166You can't simply eliminate the three regions with the fewest respondents (3, 6, and 15). These are the three largest contributors to the chi-squared statistic, precisely because fewer people in those regions were surveyed than expected. In addition, more people in regions 1, 5, and 9 were surveyed than expected. This should be clear in a bar chart. And the resulting chi-squared test confirms this. Jean> This poses several questions: > > 1) Looking at a side-by-side barchart (proportion of responses vs. > proportion of employees, per region), the pattern of survey responses > appears, visually, to match fairly well the pattern of employees. Is > this a case where we trust the numbers and not the picture? > > 2) Part of the problem, ironically, is that there were too many > responses > to the survey. If we had only one-tenth the responses, but in the same > proportions by region, the chi-square statistic would look much better, > (though with a warning about possible inaccuracy): > > data: data.frame(ByRegion$All.Employees, 0.1 * > (ByRegion$Survey.Respondents)) > X-squared = 17.5912, df = 15, p-value = 0.2848 > > Is there a way of reconciling a large response rate with an > unrepresentative > response profile? Or is the bad news that the survey will give very > precise > results about a very ill-specified sub-population? > > (Of course, I would put in softer terms, like "you need to assess the > degree > of homogeneity across different regions" .) > > 3) Is Chi-squared really the right measure of how representative is the > survey? > > <<<<<<< >>>>>>>>> > > Thanks for any help you can give - hope these questions make sense - > > George H.[[alternative HTML version deleted]]
George, Perhaps the site of the RISQ project (Representativity indicators for Survey Quality) might be of use: http://www.risq-project.eu/ . They also provide R-code to calculate their indicators. HTH, Jan Quoting gheine at mathnmaps.com:> An organization has asked me to comment on the validity of their > recent all-employee survey. Survey responses, by geographic region, compared > with the total number of employees in each region, were as follows: > >> ByRegion > All.Employees Survey.Respondents > Region_1 735 142 > Region_2 500 83 > Region_3 897 78 > Region_4 717 133 > Region_5 167 48 > Region_6 309 0 > Region_7 806 125 > Region_8 627 122 > Region_9 858 177 > Region_10 851 160 > Region_11 336 52 > Region_12 1823 312 > Region_13 80 9 > Region_14 774 121 > Region_15 561 24 > Region_16 834 134 > > How well does the survey represent the employee population? > Chi-square test says, not very well: > >> chisq.test(ByRegion) > > Pearson's Chi-squared test > > data: ByRegion > X-squared = 163.6869, df = 15, p-value < 2.2e-16 > > By striking three under-represented regions (3,6, and 15), we get > a more reasonable, although still not convincing, result: > >> chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),]) > > Pearson's Chi-squared test > > data: ByRegion[setdiff(1:16, c(3, 6, 15)), ] > X-squared = 22.5643, df = 12, p-value = 0.03166 > > This poses several questions: > > 1) Looking at a side-by-side barchart (proportion of responses vs. > proportion of employees, per region), the pattern of survey responses > appears, visually, to match fairly well the pattern of employees. Is > this a case where we trust the numbers and not the picture? > > 2) Part of the problem, ironically, is that there were too many responses > to the survey. If we had only one-tenth the responses, but in the same > proportions by region, the chi-square statistic would look much better, > (though with a warning about possible inaccuracy): > > data: data.frame(ByRegion$All.Employees, 0.1 * > (ByRegion$Survey.Respondents)) > X-squared = 17.5912, df = 15, p-value = 0.2848 > > Is there a way of reconciling a large response rate with an unrepresentative > response profile? Or is the bad news that the survey will give very precise > results about a very ill-specified sub-population? > > (Of course, I would put in softer terms, like "you need to assess the degree > of homogeneity across different regions" .) > > 3) Is Chi-squared really the right measure of how representative is the > survey? > > <<<<<<< >>>>>>>>> > > Thanks for any help you can give - hope these questions make sense - > > George H. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
The chisq.test function is expecting a contingency table, basically one column should have the count of respondents and the other column should have the count of non-respondents (yours looks like it is the total instead of the non-respondents), so your data is wrong to begin with. A significant chi-square here just means that the proportion responding differs in some of the regions, that does not mean that the sample is representative (or not representative). What is more important (and not in the data or standard tests) is if there is a relationship between why someone chose to respond and the outcomes of interest. If you are concerned with different proportions responding then you could do post-stratification to correct for the inequality when computing other summaries or tests (though region 6 will still give you problems, you will need to make some assumptions, possibly combine it with another region that is similar). Throwing away data is rarely, if ever, beneficial. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of gheine at mathnmaps.com > Sent: Tuesday, October 11, 2011 1:32 PM > To: r-help at r-project.org > Subject: [R] Chi-Square test and survey results > > An organization has asked me to comment on the validity of their > recent all-employee survey. Survey responses, by geographic region, > compared > with the total number of employees in each region, were as follows: > > > ByRegion > All.Employees Survey.Respondents > Region_1 735 142 > Region_2 500 83 > Region_3 897 78 > Region_4 717 133 > Region_5 167 48 > Region_6 309 0 > Region_7 806 125 > Region_8 627 122 > Region_9 858 177 > Region_10 851 160 > Region_11 336 52 > Region_12 1823 312 > Region_13 80 9 > Region_14 774 121 > Region_15 561 24 > Region_16 834 134 > > How well does the survey represent the employee population? > Chi-square test says, not very well: > > > chisq.test(ByRegion) > > Pearson's Chi-squared test > > data: ByRegion > X-squared = 163.6869, df = 15, p-value < 2.2e-16 > > By striking three under-represented regions (3,6, and 15), we get > a more reasonable, although still not convincing, result: > > > chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),]) > > Pearson's Chi-squared test > > data: ByRegion[setdiff(1:16, c(3, 6, 15)), ] > X-squared = 22.5643, df = 12, p-value = 0.03166 > > This poses several questions: > > 1) Looking at a side-by-side barchart (proportion of responses vs. > proportion of employees, per region), the pattern of survey responses > appears, visually, to match fairly well the pattern of employees. Is > this a case where we trust the numbers and not the picture? > > 2) Part of the problem, ironically, is that there were too many > responses > to the survey. If we had only one-tenth the responses, but in the same > proportions by region, the chi-square statistic would look much better, > (though with a warning about possible inaccuracy): > > data: data.frame(ByRegion$All.Employees, 0.1 * > (ByRegion$Survey.Respondents)) > X-squared = 17.5912, df = 15, p-value = 0.2848 > > Is there a way of reconciling a large response rate with an > unrepresentative > response profile? Or is the bad news that the survey will give very > precise > results about a very ill-specified sub-population? > > (Of course, I would put in softer terms, like "you need to assess the > degree > of homogeneity across different regions" .) > > 3) Is Chi-squared really the right measure of how representative is the > survey? > > <<<<<<< >>>>>>>>> > > Thanks for any help you can give - hope these questions make sense - > > George H. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.