Dear Mailing list! sorry to bother you - but maybe you can help me out. I have been searching and searching for appropriate tests. I have a huge dataset of loan requests and I have data at portfolio level, with average portfolio size 200 loans. I want to test whether portfolios are randomly drawn. The problem is that I have rather qualitative data, namely I want to characterize whether loans are randomly selected using word counts. For each loan, I have a "sector", "activity" and "use description". The "use description" contains about 15 words, the "activity" description is usually only one or two words. What I did as of now was to find the word-counts in the overall portfolio, which is 110,000 loans. From this, I can compute, based on knowing the size of a team portfolio the expected frequency of certain keywords appearing. The "sector" variable is categorical and can take only 17 values, whereas in the overall distribution I found 180 different words as activity description. I now wanted to do a type of "goodness of fit" test to see whether the portfolios are randomly selected or not. I would expect that certain portfolios are indeed randomly selected, whereas others arent. I did a chi^2, Pearson Tukey and G-test of Goodness of Fit. The problem is that these tests are usually constructed for categorical data - but if I use the "activity" word-count it need not be categorical. So I am wondering, whether this is still appropriate? I may have a portfolio of 200 loans in which certain words never appear. In this case, I am not sure which degrees of freedom to look at. Should I use as prescribed 179 degrees of freedom as I have 180 "categories" - but these "arent" real categories... An example may look as follows - the word is on the left, the expected and observed word counts are given: +---------------------------------------+ | word observed expected | |---------------------------------------| 1. | food 54 57.511776 | 2. | retail 46 49.04432 | 3. | agriculture 39 36.557732 | 4. | services 23 15.867387 | 5. | clothing 13 14.126975 | |---------------------------------------| 6. | transportation 10 6.5851929 | 7. | housing 3 4.65019 | 8. | construction 2 4.3173841 | 9. | arts 5 4.2500955 | 10. | manufacturing 1 3.0170768 | |---------------------------------------| 11. | health 2 1.751323 | 12. | use 0 .55215646 | 13. | personal 0 .25241221 | 14. | education 2 .68743521 | 15. | wholesale 0 .11241227 | |---------------------------------------| 16. | entertainment 0 .32743521 | 17. | green 0 .42743782 | +---------------------------------------+ I can have R calculate the Chi? statistic from this, but should I use now 17 degrees of freedom? The problem is this is not categorical data! In this case, do I have to make comparisons on a word-by-word basis? Like a "Bernoulli"? I was looking for other goodness of fit tests for this kind of data for days now, but I cant really find any others! I really appreciate your thoughts, best Thiemo --- http://freigeist.devmag.net