jonathanbriggs
2009-Jun-02 15:46 UTC
[R] Help deciding on data format for sales data (newbie)
Dear All Beginning data mining and need some help working out the best way to represent data. I have searched here and online and not found any real help. Imagines that I have a file of order(sales) data OrderNo CustomerNo ItemsInOrder 1 1 a,b,c 2 1 d 3 2 a,d I can represent this as a data.frame but then need to parse my ItemsInOrder? This seems quite clumsy. Alternatively I can try this sort of representation OrderNo CustomerNo a b c d 1 1 1 1 1 NA 2 1 NA NA NA 1 3 2 1 NA NA 1 Are these really the two choices and how well does the second representation scale? (I have 50,000 SKUs) Can anyone point me in the direction of some worked examples that take such data and manipulate it; looking for association rules and clusters? Thanks Jonathan -- View this message in context: http://www.nabble.com/Help-deciding-on-data-format-for-sales-data-%28newbie%29-tp23835331p23835331.html Sent from the R help mailing list archive at Nabble.com.
Daniel Malter
2009-Jun-02 16:18 UTC
[R] Help deciding on data format for sales data (newbie)
Depending on what you want to do, the second format (wide) maybe useful; but it can get very clumsy if there are many potential items. You may think about the long format: OrderNo Customer Item 1 1 a 1 1 b 1 1 c 2 1 d 3 2 a 3 2 d The long (above) and the wide format (your second example) are the most widely used and most software packages have functions to reshape the dataset from one into the other. You may also want want to think about reducing the dimensionality of your data. If, say, "a" were a right shoe and "b" were a left shoe, then "a" is typically always sold with "b." Then you would want to create one indicator from the two (admittedly, a pretty stupid example) to get the number of unique items reduced. Alternatively, you may also be interested in only a subset of the items (e.g. shoes only). Hope this helps, Daniel ------------------------- cuncta stricte discussurus ------------------------- -----Urspr?ngliche Nachricht----- Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im Auftrag von jonathanbriggs Gesendet: Tuesday, June 02, 2009 11:47 AM An: r-help at r-project.org Betreff: [R] Help deciding on data format for sales data (newbie) Dear All Beginning data mining and need some help working out the best way to represent data. I have searched here and online and not found any real help. Imagines that I have a file of order(sales) data OrderNo CustomerNo ItemsInOrder 1 1 a,b,c 2 1 d 3 2 a,d I can represent this as a data.frame but then need to parse my ItemsInOrder? This seems quite clumsy. Alternatively I can try this sort of representation OrderNo CustomerNo a b c d 1 1 1 1 1 NA 2 1 NA NA NA 1 3 2 1 NA NA 1 Are these really the two choices and how well does the second representation scale? (I have 50,000 SKUs) Can anyone point me in the direction of some worked examples that take such data and manipulate it; looking for association rules and clusters? Thanks Jonathan -- View this message in context: http://www.nabble.com/Help-deciding-on-data-format-for-sales-data-%28newbie% 29-tp23835331p23835331.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.