As the first step in the decommissioning of sasCommunity.org the site has been converted to read-only mode.
Here are some tips for How to share your SAS knowledge with your professional network.
Consequences of Poor Data Quality -- Simulation Studies
Part III - Consequences of Poor Data Quality -- Simulation Studies
- This part closes this loop by showing possible consequences of poor data quality. This is shown for two reasons;
- It is shown what consequences poor data quality has on model performance. This information can support the decision whether an analysis is performed on the specific data or not.
- The chapters also highlight the consequences form an accuracy and monetary point of view using a business case calculation. This information can then support the decision whether to invest in additional efforts of data quality improvement.
- The simulations studies concentrate on the two topics: predictive modeling and time series forecasting.
- For predictive modeling the case of the predication of a binary event variable with a logistic regression model is analyzed
- In the case of time series forecasting, time series on a monthly aggregated basis are analyzed. The forecasting models that are used in this context only depend on the forecast variable itself. No additional explanatory variables are used as co-variables in these models.
- For the two analytical domains, predictive modeling and time series forecasting, the following data quality criteria are analyzed.
- Random and systematic missing values in the input data.
- Random and systematic errors in the input data and the target variables.
- The reduction of the available data quantity: Quantity in the regard refers to the available number of observations and number of events in predictive modeling and the available length of the history of the time series in time series forecasting.
- The simulations studies in the following chapters give an indication about the size of the effect of data that do not fully meet the completeness, correctness, availability and quantity criteria. The following questions are answered by the simulations studies:
- Does it make sense to run analyses on data that do not fulfill the data quality criteria?
- How much trust shall be put into results that are produced form these data?
- What is the expected loss in accuracy when dealing with such data?
- Which criteria have a strong effect on forecast accuracy? Where shall the effort in data collection and data quality improvement go in order to get improvements in forecast quality?