Richard and Barbara Males
2008-May-15 19:27 UTC
[R] lattice histogram problem with integers values and nint
been puzzling over this for a day. Summary integer variable to use with histogram, 170,000 rows. Value is day of year. Hist works, lattice histogram with nint does not work (spurious spikes in display), lattice histogram using breaks=c(0:365) works fine. Spike values appear to be sum of two adjacent bins. Want to know if this is a familiar problem, and what the recommended work-around is. Also want to know how to get the bin count from the lattice histogram object, as I would with hist$count. Thanks in advance. Detail I have a dataset of approximately 170,000 rows, with a DayOfYear field. I want a histogram of the number of rows in each day of the year. I set up breaks from 0:365, and use this with hist, and the lattice histogram, e.g. histogram(dfTemp$DayOfYear,breaks=breaklist,type="count") If I use hist to display this, all values are under 600, everything is fine.. If I use lattice histogram on the full 365 days, either with nint=365, or breaks set from (0:366), I get 26 equally-spaced spurious peaks above 800 (that is, 26 days reported with bad values). A table command on this field shows me that the highest count of rows in a day is 553. When I do hist (not histogram), the plot looks fine. When I do plot(table(dataframe$DayOfYear), the plot looks fine. If I do a subset of the data to look only at days below 340, the plot looks fine. At days below 341, I get one of these spikes, at about 170 days, going up to about 900. At a subset of days below 342, I get two spikes, both over 900. If I set breaks to 0:366, and add a small increment to my integer values, e.g. histogram(df2006NonRecVessels$DayOfYear+.001,breaks=breaklist,type="count") All is well.under this approach. I have attempted to search to see if this is a known problem, but don't find anything. Also, I can get the count in each bin for hist as> xx=hist(df2006NonRecVessels$DayOfYear,breaks=breaklist) > xx$count # this gives me the countsI am unclear how to get equivalent information on bin contents from the lattice-generated histogram object. It appears to be in panel.args, but I am unclear on the exact syntax. Richard M. Males Cincinnati, Ohio, USA
Charilaos Skiadas
2008-May-15 19:43 UTC
[R] lattice histogram problem with integers values and nint
Two comments. First of all, I don't see how you can be sure that if you specify 365 bins, then each bin will contain exactly one day. In order to do that, you need to know that each bin has width exactly 1, and you don't tell lattice to use such a width, so it is likely choosing something else. In fact, if you save your histogram in a temporary variable, say "pl", then the following will show you where lattice puts the breaks: pl$panel.args.common$breaks In the example I tried, the difference in any two consecutive breaks was 1.077041 The second point, is that this would probably be better done with an xyplot, using type="h". So assuming that x has those 170000 values, and that all days occur, the following might be more satisfactory: xyplot(table(x)~1:365, type="h") I don't see the benefit in having bars instead of single vertical lines personally. Hope this helps, Haris Skiadas Department of Mathematics and Computer Science Hanover College On May 15, 2008, at 3:27 PM, Richard and Barbara Males wrote:> been puzzling over this for a day. > > Summary > integer variable to use with histogram, 170,000 rows. Value is day of > year. Hist works, lattice histogram with nint does not work (spurious > spikes in display), lattice histogram using breaks=c(0:365) works > fine. Spike values appear to be sum of two adjacent bins. Want to > know if this is a familiar problem, and what the recommended > work-around is. Also want to know how to get the bin count from the > lattice histogram object, as I would with hist$count. > > Thanks in advance. > > > Detail > > I have a dataset of approximately 170,000 rows, with a DayOfYear > field. I want a histogram of the number of rows in each day of the > year. I set up breaks from 0:365, and use this with hist, and the > lattice histogram, e.g. > > histogram(dfTemp$DayOfYear,breaks=breaklist,type="count") > > If I use hist to display this, all values are under 600, everything > is fine.. > > If I use lattice histogram on the full 365 days, either with nint=365, > or breaks set from (0:366), I get 26 equally-spaced spurious peaks > above 800 (that is, 26 days reported with bad values). A table > command on this field shows me that the highest count of rows in a day > is 553. When I do hist (not histogram), the plot looks fine. When I > do plot(table(dataframe$DayOfYear), the plot looks fine. If I do a > subset of the data to look only at days below 340, the plot looks > fine. At days below 341, I get one of these spikes, at about 170 > days, going up to about 900. At a subset of days below 342, I get two > spikes, both over 900. > > If I set breaks to 0:366, and add a small increment to my integer > values, e.g. > > histogram(df2006NonRecVessels$DayOfYear+. > 001,breaks=breaklist,type="count") > > All is well.under this approach. > > > I have attempted to search to see if this is a known problem, but > don't find anything. > > Also, I can get the count in each bin for hist as >> xx=hist(df2006NonRecVessels$DayOfYear,breaks=breaklist) >> xx$count # this gives me the counts > > I am unclear how to get equivalent information on bin contents from > the lattice-generated histogram object. It appears to be in > panel.args, but I am unclear on the exact syntax. > > > Richard M. Males > Cincinnati, Ohio, USA