Why is Data Quality Important
Real World Data
Changes in Data Collection Strategies
- Pre-Internet Era
- Internet Era
Sources of Errors
Dimensions of Quality
How to improve Data Quality
Data Cleaning - Types and Techniques
The need for a "Human-in-the-loop"
- Unrealistic assumption
- In practice
Cleaning Techniques
- Quantitative Data
- Categorical Data

Why is Data Quality Important

Impact on organisational decisions (missing or incorrect data can result in wrong decision making)
Legal obligations in certain business scenarios (e.g., plug type information required for selling electric devices in EU)
Impact on machine learning models (cleaner data can greatly improve model performance)
Potential for causing biased decisions in ML-based applications (not well understood, area of active research)
Operational stability: missing and inconsistent data can cause havoc in production systems
- Crashes (e.g., due to “NullPointerExceptions” for missing attributes)
- Wrong predictions (e.g., change of scale in attributes)

Real World Data

Constantly changing
Often hundreds of attributes
Data originates from multiple sources / people / teams / systems
Several potentially inconsistent copies
Often too large to conveniently handle on a desktop machine
Often difficult to access (e.g., data compressed and partitioned in a distributed filesystem)

Changes in Data Collection Strategies

Pre-Internet Era

Data initially collected in transactional, relational databases
Data Modelled before collection
Backward-looking analysis of well understood business data

Internet Era

Collect first, analyze later
Vast amount of semi-structured data (images, text, audio)
New data stores (k-v, document, file system: data lakes)
- relaxed consistency (as no distributed transactions)
- enforce few modelling decisions at collection time
- schema-on-road: application must interpret the data
Storage costs decreased
Data becomes valuable as input to ML-based applications

Sources of Errors

Data Entry errors
- Typos in Forms
- Different spellings for the same real-world entity
Measurement errors
- outside interference in measurement process
- configuration/placement of sensors changed
  - e-commerce -> mobile network instability
Distillation errors
- errors in data summaries (logarithm)
- editorial bias in data summaries
Data integration errors
- Resolution of inconsistencies w.r.t duplicate entries
- unification of units, measurement periods

Dimensions of Quality

Completeness
- Degree to which data required to describe a real-world object is available
Consistency: Intra-Relation constraints
- Specific data type, interval for a numerical column, set of values for a categorical column
Consistency: Inter-Relation constraints
- Validity of references to other data entries (foreign keys)
Syntactic Accuracy
- red, blue ... (syntactically correct but not match real value)
Semantic Accuracy
- XL, red ... (match real-world representation)

How to improve Data Quality

Data entry interface design

Enforce Integrity constraints (e.g. constraints on numeric values, referential integrity)
Caveat: can force users to "invent" dirty data (still need to care about it)

Organizational management

Streamlining of processes for data collection and analysis (capture error earlier)
Capturing of provenance and metadata

Automated data auditing and data cleaning

Application of automated techniques to identify and rectify data errors

Exploratory data analysis and dat cleaning

human in the loop approach necessary most of the time
interaction between data viz and data cleaning iterative process

Data Cleaning - Types and Techniques

Quantitative

integers or floating numbers in different shapes (sets, tensors, time series)
challenges: unit conversion (especially for volatile units like currency)
foundation of cleaning techniques: outlier detection ## Categorical
names or codes to assign data into groups, no ordering or distance defined
common problem: misspelling upon data entry
foundation of cleaning techniques: normalization/deduplcation

Postal Addesses

special case of categorical data, typically entered as free text
deduplication

keys / Identifiers

ensure referential integrity

The need for a "Human-in-the-loop"

Unrealistic assumption

existence of error detecting rules assumed: integrity constraints, functional dependencies
focus on most efficient and accurate way to apply cleaning steps according to rules

In practice

EDA + Viz + cleaning
Iterative approach
Open Refine, Trifacta (Alteryx)

Cleaning Techniques

Quantitative Data

Robust Univariate Outlier Detection
- Robust Centers (robust statistics are closer then median)
  - Median
  - K-trimmed mean
- Robust Dispersion
  - MAD (median absolute deviation)
    - measures median distance of all values from the sample median

Categorical Data

Normalization of String data

Different spellings (Jerome vs Jerome)
Different punctuation (ACME Inc. vs ACME, Inc)
Typos (Alice->Ailce)
Misunderstandings (Rupert -> Robert)

Keying function

Use keying function to form cluster with the computed key
Automatic repair
- replace all strings in a cluster with the string with the highest cardinality S(s) in S
Manual repair (OpenRefine implementation)
- show clusters and stats to user
- have user decide about replacement

Candidate Key Detection

unique row ratio

number of distinct value / total number of rows
problem: frequency outliers (lots of frequent -999)
- often caused by UIs forcing users to invent common dummy values like 0000 or -999 #### unique value ratio
number of unique value (only occur once) / number of distinct value

Missing value (Imputation with ML)

Missing data

Missing Completely at Random (MCAR)
- as if we lost some entries by chance
Missing at Random (MAR)
- missingness depends on observed job type
- engineers more likely to not share income than managers
Not Missing at Random (NMAR)
- missingness data depends on missing data itself
- people with low income do not want to share this information

Handling missing values

remove examples with missing attributes
placeholder symbol for categorical missing values
impute missing values - with mode or mean of the observed values in the column

Imputation problem as multi-class

Idea: Treat imputation problem as multi-class classification problem - Tool: Datawig - use rows with non-missing data in target columns as training data - encode table data from feature columns to a numerical representation - one-hot encoding of categorical - standardisation of numerical columns - character sequences for textual columns - LSTM (trained encoder output)/ n-gram hashing (frequency based - tfidf)

Data Quality What & Why & How