- Why is Data Quality Important
- Real World Data
- Changes in Data Collection Strategies
- Sources of Errors
- Dimensions of Quality
- How to improve Data Quality
- Data Cleaning - Types and Techniques
- The need for a "Human-in-the-loop"
- Cleaning Techniques
Why is Data Quality Important
- Impact on organisational decisions (missing or incorrect data can result in wrong decision making) 
- Legal obligations in certain business scenarios (e.g., plug type information required for selling electric devices in EU) 
- Impact on machine learning models (cleaner data can greatly improve model performance) 
- Potential for causing biased decisions in ML-based applications (not well understood, area of active research) 
- Operational stability: missing and inconsistent data can cause havoc in production systems - Crashes (e.g., due to “NullPointerExceptions” for missing attributes)
- Wrong predictions (e.g., change of scale in attributes)
 
Real World Data
- Constantly changing
- Often hundreds of attributes
- Data originates from multiple sources / people / teams / systems
- Several potentially inconsistent copies
- Often too large to conveniently handle on a desktop machine
- Often difficult to access (e.g., data compressed and partitioned in a distributed filesystem)
Changes in Data Collection Strategies
Pre-Internet Era
- Data initially collected in transactional, relational databases
- Data Modelled before collection
- Backward-looking analysis of well understood business data
Internet Era
- Collect first, analyze later
- Vast amount of semi-structured data (images, text, audio)
- New data stores (k-v, document, file system: data lakes)
- relaxed consistency (as no distributed transactions)
- enforce few modelling decisions at collection time
- schema-on-road: application must interpret the data
 
- Storage costs decreased
- Data becomes valuable as input to ML-based applications
Sources of Errors
- Data Entry errors
- Typos in Forms
- Different spellings for the same real-world entity
 
- Measurement errors
- outside interference in measurement process
- configuration/placement of sensors changed
- e-commerce -> mobile network instability
 
 
- Distillation errors
- errors in data summaries (logarithm)
- editorial bias in data summaries
 
- Data integration errors
- Resolution of inconsistencies w.r.t duplicate entries
- unification of units, measurement periods
 
Dimensions of Quality
- Completeness
- Degree to which data required to describe a real-world object is available
 
- Consistency: Intra-Relation constraints
- Specific data type, interval for a numerical column, set of values for a categorical column
 
- Consistency: Inter-Relation constraints
- Validity of references to other data entries (foreign keys)
 
- Syntactic Accuracy
- red, blue ... (syntactically correct but not match real value)
 
- Semantic Accuracy
- XL, red ... (match real-world representation)
 
How to improve Data Quality
Data entry interface design
- Enforce Integrity constraints (e.g. constraints on numeric values, referential integrity)
- Caveat: can force users to "invent" dirty data (still need to care about it)
Organizational management
- Streamlining of processes for data collection and analysis (capture error earlier)
- Capturing of provenance and metadata
Automated data auditing and data cleaning
- Application of automated techniques to identify and rectify data errors
Exploratory data analysis and dat cleaning
- human in the loop approach necessary most of the time
- interaction between data viz and data cleaning iterative process
Data Cleaning - Types and Techniques
Quantitative
- integers or floating numbers in different shapes (sets, tensors, time series)
- challenges: unit conversion (especially for volatile units like currency)
- foundation of cleaning techniques: outlier detection ## Categorical
- names or codes to assign data into groups, no ordering or distance defined
- common problem: misspelling upon data entry
- foundation of cleaning techniques: normalization/deduplcation
Postal Addesses
- special case of categorical data, typically entered as free text
- deduplication
keys / Identifiers
- ensure referential integrity
The need for a "Human-in-the-loop"
Unrealistic assumption
- existence of error detecting rules assumed: integrity constraints, functional dependencies
- focus on most efficient and accurate way to apply cleaning steps according to rules
In practice
- EDA + Viz + cleaning
- Iterative approach
- Open Refine, Trifacta (Alteryx)
Cleaning Techniques
Quantitative Data
- Robust Univariate Outlier Detection
- Robust Centers (robust statistics are closer then median)
- Median
- K-trimmed mean
 
- Robust Dispersion
- MAD (median absolute deviation)
- measures median distance of all values from the sample median
 
 
- MAD (median absolute deviation)
 
- Robust Centers (robust statistics are closer then median)
Categorical Data
Normalization of String data
- Different spellings (Jerome vs Jerome)
- Different punctuation (ACME Inc. vs ACME, Inc)
- Typos (Alice->Ailce)
- Misunderstandings (Rupert -> Robert)
Keying function
- Use keying function to form cluster with the computed key
- Automatic repair
- replace all strings in a cluster with the string with the highest cardinality S(s) in S
 
- Manual repair (OpenRefine implementation)
- show clusters and stats to user
- have user decide about replacement
 
Candidate Key Detection
unique row ratio
- number of distinct value / total number of rows
- problem: frequency outliers (lots of frequent -999)
- often caused by UIs forcing users to invent common dummy values like 0000 or -999 #### unique value ratio
 
- number of unique value (only occur once) / number of distinct value
Missing value (Imputation with ML)
Missing data
- Missing Completely at Random (MCAR)
- as if we lost some entries by chance
 
- Missing at Random (MAR)
- missingness depends on observed job type
- engineers more likely to not share income than managers
 
- Not Missing at Random (NMAR)
- missingness data depends on missing data itself
- people with low income do not want to share this information
 
Handling missing values
- remove examples with missing attributes
- placeholder symbol for categorical missing values
- impute missing values - with mode or mean of the observed values in the column
Imputation problem as multi-class
Idea: Treat imputation problem as multi-class classification problem - Tool: Datawig - use rows with non-missing data in target columns as training data - encode table data from feature columns to a numerical representation - one-hot encoding of categorical - standardisation of numerical columns - character sequences for textual columns - LSTM (trained encoder output)/ n-gram hashing (frequency based - tfidf)

