Image by Author
# Introduction
Data quality problems are everywhere. Missing values where there shouldn’t be any. Dates in the wrong format. Duplicate records that slip through. Outliers that skew your analysis. Text fields with inconsistent capitalization and spelling variations. These issues can break your analysis, pipelines, and often lead to incorrect business decisions.
Manual data validation is tedious. You need to check for the same issues repeatedly across multiple datasets, and it’s easy to miss subtle issues. This article covers five practical Python scripts that handle the most common data quality issues.
Link to the code on GitHub
# 1. Analyzing Missing Data
// The Pain Point
You receive a dataset expecting complete records, but scattered throughout are empty cells, null values, blank strings, and placeholder text like “N/A” or “Unknown”. Some columns are mostly empty, others have just a few gaps. You need to understand the extent of the problem before you can fix it.
// What the Script Does
Comprehensively scans datasets for missing data in all its forms. Identifies patterns in missingness (random vs. systematic), calculates completeness scores for each column, and flags columns with excessive missing data. It also generates visual reports showing where your data gaps are.
// How It Works
The script reads data from CSV, Excel, or JSON files, detects various representations of missing values like None, NaN, empty strings, common placeholders. It then calculates missing data percentages by column and row, identifies correlations between missing values across columns. Finally, it produces both summary statistics and detailed reports with recommendations for handling each type of missingness.
⏩ Get the missing data analyzer script
# 2. Validating Data Types
// The Pain Point
Your dataset claims to have numeric IDs, but some are text. Date fields contain dates, times, or sometimes just random strings. Email addresses in the email column, except for fields that aren’t valid emails. Such type inconsistencies cause scripts to crash or result in incorrect calculations.
// What the Script Does
Validates that each column contains the expected data type. Checks numeric columns for non-numeric values, date columns for invalid dates, email and URL columns for proper formatting, and categorical columns for unexpected values. The script also provides detailed reports on type violations with row numbers and examples.
// How It Works
The script accepts a schema definition specifying expected types for each column, uses regex patterns and validation libraries to check format compliance, identifies and reports rows that violate type expectations, calculates violation rates per column, and suggests appropriate data type conversions or cleaning steps.
⏩ Get the data type validator script
# 3. Detecting Duplicate Records
// The Pain Point
Your database should have unique records, but duplicate entries keep appearing. Sometimes they’re exact duplicates, sometimes just a few fields match. Maybe it’s the same customer with slightly different spellings of their name, or transactions that were accidentally submitted twice. Finding these manually is super challenging.
// What the Script Does
Identifies duplicate and near-duplicate records using multiple detection strategies. Finds exact matches, fuzzy matches based on similarity thresholds, and duplicates within specific column combinations. Groups similar records together and calculates confidence scores for potential matches.
// How It Works
The script uses hash-based exact matching for perfect duplicates, applies fuzzy string matching algorithms using Levenshtein distance for near-duplicates, allows specification of key columns for partial matching, generates duplicate clusters with similarity scores, and exports detailed reports showing all potential duplicates with recommendations for deduplication.
⏩ Get the duplicate record detector script
# 4. Detecting Outliers
// The Pain Point
Your analysis results look wrong. You dig in and find someone entered 999 for age, a transaction amount is negative when it should be positive, or a measurement is three orders of magnitude larger than the rest. Outliers skew statistics, break models, and are often difficult to identify in large datasets.
// What the Script Does
Automatically detects statistical outliers using multiple methods. Applies z-score analysis, IQR or interquartile range method, and domain-specific rules. Identifies extreme values, impossible values, and values that fall outside expected ranges. Provides context for each outlier and suggests whether it’s likely an error or a legitimate extreme value.
// How It Works
The script analyzes numeric columns using configurable statistical thresholds, applies domain-specific validation rules, visualizes distributions with outliers highlighted, calculates outlier scores and confidence levels, and generates prioritized reports flagging the most likely data errors first.
⏩ Get the outlier detection script
# 5. Checking Cross-Field Consistency
// The Pain Point
Individual fields look fine, but relationships between fields are broken. Start dates after end dates. Shipping addresses in different countries than the billing address’s country code. Child records without corresponding parent records. Order totals that don’t match the sum of line items. These logical inconsistencies are harder to spot but just as damaging.
// What the Script Does
Validates logical relationships between fields based on business rules. Checks temporal consistency, referential integrity, mathematical relationships, and custom business logic. Flags violations with specific details about what’s inconsistent.
// How It Works
The script accepts a rules definition file specifying relationships to validate, evaluates conditional logic and cross-field comparisons, performs lookups to verify referential integrity, calculates derived values and compares to stored values, and produces detailed violation reports with row references and specific rule failures.
⏩ Get the cross-field consistency checker script
# Wrapping Up
These five scripts help you catch data quality issues early, before they break your analysis or systems. Data validation should be automatic, comprehensive, and fast, and these scripts help with that.
So how do you get started? Download the script that addresses your biggest data quality pain point and install the required dependencies. Next, configure validation rules for your specific data, run it on a sample dataset to verify the setup. Then, integrate it into your data pipeline to catch issues automatically
Clean data is the foundation of everything else. Start validating systematically, and you’ll spend less time fixing problems. Happy validating!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

