As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.
Here are some tips for How to share your SAS knowledge with your professional network.
Data Quality for Analytics Defined
Part I - Data Quality Defined
- This paper uses a case study from statistical analysis in the sports area to illustrate the different data quality criteria:
- “Data Availability” starts with the question whether data are available in general.
- “Data Quantity” examines whether the amount of data are sufficient for the analysis.
- “Data Completeness” deals with the fact that available data fields may contain missing values.
- “Data Correctness” checks whether the available data are correct with respect to its definition. “
- Considerations for Predictive Modeling” discusses special requirements of predictive modeling methods
- “Specifics of Analytics” shows additional requirements of interdependences of analytical methods and the data
- “Process Considerations” finally shows the process aspect of data quality and also discusses the aspects like data relevancy and possible alternatives.
- Consequently it uses the specifics of the case study for a scoping and a definition of the topic „Data Quality for Analytics“ and builds a bridge between the data quality problems in the case study and general data quality topics.
- Data Quality Definition
- Data quality in the context of this book is the degree of excellence of data to precisely and comprehensively describe the practical situation of interest in an unbiased and complete way. The data shall be appropriate to answer the business or functional question of interest without a reduction of the scope of the question and the applicability of the results. The data has a good status with respect to features like availability, completeness, correctness, timeliness, sufficient quantity and stability over time.
- Data also has a good suitability for the analytical methods that shall be employed on the data in order to answer the business or functional question. The data therefore comply with the analytical requirements of the respective methods. For predictive modeling and time series forecasting the data have predictive power with respect to the values that shall be predicted or forecasted.
- Contribution of Analytics to Data Quality
- Profiling of the structure of missing values
- Imputing missing values
- Calculating a representative replacement value
- Advanced Outlier Detection:
- time series, predictive modeling, cluster analysis,
- Quick assessment of variable importance
- Sample Size Planning
- Similarity analysis (de-duplication, record matching)