Architector Data Tools
2017-Aug-04 09:56 UTC
[R] Seeking to validate data quality requirements - should I develop a package?
I am planning to develop an R package to manage all aspects of data quality. I am very experienced in data quality, but fairly new to R. I have tried to find a suitable data quality package, and am surprised not to find much to suit my requirements. Developing the package would be an ambitious effort, involving several contributors (that I have already identified, and who also do not have much R experience yet). So I am seeking some confidence that the effort is worthwhile. The package will be highly configurable so it can be applied to pretty much any situation, and will implement sophisticated data quality capabilities, including: (a) DEFINITION: integration with a data dictionary (perhaps metaData), and with highly configurable and expressive data quality rules (b) MONITORING & DETECTION: automated data quality monitoring and alerting against any data source. Automatically raise and update quality issues (c) ANALYSIS & ROOT CAUSE: data quality dashboard, alerts, drill-downs, plot trends, including perhaps a machine learning aspect that detects noteworthy events in quality measurements for inclusion in executive reports (d) WORKFLOW: basic data quality management workflow (i.e. implement 'inbox' and 'actions', probably via Shiny) The requirements will be drawn from my professional experience (as interim head of data quality at a global bank), although this project is not sponsored either by my employer or any of my consulting clients. I do, however, expect the package to be of interest to financial service organisations who rely on good quality data for their financial and risk models, and for any other process that relies on good data. To sum up, if anyone can point to a data quality package that means I don?t have to develop one that would be great. Alternatively, any comments of support would also be very useful! David David Twaddell Architector Data Tools Tel: +44 20 3239 1099 | +44 7447 936 984 Web: www.architector.co.uk
Bert Gunter
2017-Aug-04 14:15 UTC
[R] Seeking to validate data quality requirements - should I develop a package?
Sounds like you'll be reinventing square wheels. Searching "data quality package" on rseek.org brought up many hits. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Aug 4, 2017 at 2:56 AM, Architector Data Tools via R-help <r-help at r-project.org> wrote:> I am planning to develop an R package to manage all aspects of data > quality. I am very experienced in data quality, but fairly new to R. I > have tried to find a suitable data quality package, and am surprised > not to find much to suit my requirements. Developing the package > would be an ambitious effort, involving several contributors (that I > have already identified, and who also do not have much R experience > yet). So I am seeking some confidence that the effort is worthwhile. > > The package will be highly configurable so it can be applied to pretty > much any situation, and will implement sophisticated data quality > capabilities, including: > > (a) DEFINITION: integration with a data dictionary (perhaps metaData), > and with highly configurable and expressive data quality rules > > (b) MONITORING & DETECTION: automated data quality monitoring and > alerting against any data source. Automatically raise and update > quality issues > > (c) ANALYSIS & ROOT CAUSE: data quality dashboard, alerts, > drill-downs, plot trends, including perhaps a machine learning aspect > that detects noteworthy events in quality measurements for inclusion > in executive reports > > (d) WORKFLOW: basic data quality management workflow (i.e. implement > 'inbox' and 'actions', probably via Shiny) > > The requirements will be drawn from my professional experience (as > interim head of data quality at a global bank), although this project > is not sponsored either by my employer or any of my consulting > clients. I do, however, expect the package to be of interest to > financial service organisations who rely on good quality data for > their financial and risk models, and for any other process that relies > on good data. > > To sum up, if anyone can point to a data quality package that means I > don?t have to develop one that would be great. Alternatively, any > comments of support would also be very useful! > > David > > David Twaddell > Architector Data Tools > Tel: +44 20 3239 1099 | +44 7447 936 984 > Web: www.architector.co.uk > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Architector Data Tools
2017-Aug-04 16:17 UTC
[R] Seeking to validate data quality requirements - should I develop a package?
Thanks Bert, I will definately look through rseek, and reuse wherever possible. Scanning through the first few pages, maybe "datacheck" can provide something. But I have in mind a complete DQ package, a sports car with 4 good wheels ;) and still seems likely that I will need to develop something at this point. Regards, David On Fri, 4 Aug 2017, 3:15 pm Bert Gunter, <bgunter.4567 at gmail.com> wrote:> Sounds like you'll be reinventing square wheels. > > Searching "data quality package" on rseek.org brought up many hits. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Aug 4, 2017 at 2:56 AM, Architector Data Tools via R-help > <r-help at r-project.org> wrote: > > I am planning to develop an R package to manage all aspects of data > > quality. I am very experienced in data quality, but fairly new to R. I > > have tried to find a suitable data quality package, and am surprised > > not to find much to suit my requirements. Developing the package > > would be an ambitious effort, involving several contributors (that I > > have already identified, and who also do not have much R experience > > yet). So I am seeking some confidence that the effort is worthwhile. > > > > The package will be highly configurable so it can be applied to pretty > > much any situation, and will implement sophisticated data quality > > capabilities, including: > > > > (a) DEFINITION: integration with a data dictionary (perhaps metaData), > > and with highly configurable and expressive data quality rules > > > > (b) MONITORING & DETECTION: automated data quality monitoring and > > alerting against any data source. Automatically raise and update > > quality issues > > > > (c) ANALYSIS & ROOT CAUSE: data quality dashboard, alerts, > > drill-downs, plot trends, including perhaps a machine learning aspect > > that detects noteworthy events in quality measurements for inclusion > > in executive reports > > > > (d) WORKFLOW: basic data quality management workflow (i.e. implement > > 'inbox' and 'actions', probably via Shiny) > > > > The requirements will be drawn from my professional experience (as > > interim head of data quality at a global bank), although this project > > is not sponsored either by my employer or any of my consulting > > clients. I do, however, expect the package to be of interest to > > financial service organisations who rely on good quality data for > > their financial and risk models, and for any other process that relies > > on good data. > > > > To sum up, if anyone can point to a data quality package that means I > > don?t have to develop one that would be great. Alternatively, any > > comments of support would also be very useful! > > > > David > > > > David Twaddell > > Architector Data Tools > > Tel: +44 20 3239 1099 | +44 7447 936 984 > > Web: www.architector.co.uk > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]