5 Useful Python Scripts for Automated Data Quality Checks

Image by Author

# Introduction

Data quality problems are everywhere. Missing values where there shouldn’t be any. Dates in the wrong format. Duplicate records that slip through. Outliers that skew your analysis. Text fields with inconsistent capitalization and spelling variations. These issues can break your analysis, pipelines, and often lead to incorrect business decisions.

Manual data validation is tedious. You need to check for the same issues repeatedly across multiple datasets, and it’s easy to miss subtle issues. This article covers five practical Python scripts that handle the most common data quality issues.

Link to the code on GitHub

# 1. Analyzing Missing Data

// The Pain Point

You receive a dataset expecting complete records, but scattered throughout are empty cells, null values, blank strings, and placeholder text like “N/A” or “Unknown”. Some columns are mostly empty, others have just a few gaps. You need to understand the extent of the problem before you can fix it.

// What the Script Does

Comprehensively scans datasets for missing data in all its forms. Identifies patterns in missingness (random vs. systematic), calculates completeness scores for each column, and flags columns with excessive missing data. It also generates visual reports showing where your data gaps are.

// How It Works

The script reads data from CSV, Excel, or JSON files, detects various representations of missing values like None, NaN, empty strings, common placeholders. It then calculates missing data percentages by column and row, identifies correlations between missing values across columns. Finally, it produces both summary statistics and detailed reports with recommendations for handling each type of missingness.

⏩ Get the missing data analyzer script

# 2. Validating Data Types

// The Pain Point

Your dataset claims to have numeric IDs, but some are text. Date fields contain dates, times, or sometimes just random strings. Email addresses in the email column, except for fields that aren’t valid emails. Such type inconsistencies cause scripts to crash or result in incorrect calculations.

// What the Script Does

Validates that each column contains the expected data type. Checks numeric columns for non-numeric values, date columns for invalid dates, email and URL columns for proper formatting, and categorical columns for unexpected values. The script also provides detailed reports on type violations with row numbers and examples.

// How It Works

The script accepts a schema definition specifying expected types for each column, uses regex patterns and validation libraries to check format compliance, identifies and reports rows that violate type expectations, calculates violation rates per column, and suggests appropriate data type conversions or cleaning steps.

⏩ Get the data type validator script

# 3. Detecting Duplicate Records

// The Pain Point

Your database should have unique records, but duplicate entries keep appearing. Sometimes they’re exact duplicates, sometimes just a few fields match. Maybe it’s the same customer with slightly different spellings of their name, or transactions that were accidentally submitted twice. Finding these manually is super challenging.

// What the Script Does

Identifies duplicate and near-duplicate records using multiple detection strategies. Finds exact matches, fuzzy matches based on similarity thresholds, and duplicates within specific column combinations. Groups similar records together and calculates confidence scores for potential matches.

// How It Works

The script uses hash-based exact matching for perfect duplicates, applies fuzzy string matching algorithms using Levenshtein distance for near-duplicates, allows specification of key columns for partial matching, generates duplicate clusters with similarity scores, and exports detailed reports showing all potential duplicates with recommendations for deduplication.

⏩ Get the duplicate record detector script

# 4. Detecting Outliers

// The Pain Point

Your analysis results look wrong. You dig in and find someone entered 999 for age, a transaction amount is negative when it should be positive, or a measurement is three orders of magnitude larger than the rest. Outliers skew statistics, break models, and are often difficult to identify in large datasets.

// What the Script Does

Automatically detects statistical outliers using multiple methods. Applies z-score analysis, IQR or interquartile range method, and domain-specific rules. Identifies extreme values, impossible values, and values that fall outside expected ranges. Provides context for each outlier and suggests whether it’s likely an error or a legitimate extreme value.

// How It Works

The script analyzes numeric columns using configurable statistical thresholds, applies domain-specific validation rules, visualizes distributions with outliers highlighted, calculates outlier scores and confidence levels, and generates prioritized reports flagging the most likely data errors first.

⏩ Get the outlier detection script

# 5. Checking Cross-Field Consistency

// The Pain Point

Individual fields look fine, but relationships between fields are broken. Start dates after end dates. Shipping addresses in different countries than the billing address’s country code. Child records without corresponding parent records. Order totals that don’t match the sum of line items. These logical inconsistencies are harder to spot but just as damaging.

// What the Script Does

Validates logical relationships between fields based on business rules. Checks temporal consistency, referential integrity, mathematical relationships, and custom business logic. Flags violations with specific details about what’s inconsistent.

// How It Works

The script accepts a rules definition file specifying relationships to validate, evaluates conditional logic and cross-field comparisons, performs lookups to verify referential integrity, calculates derived values and compares to stored values, and produces detailed violation reports with row references and specific rule failures.

⏩ Get the cross-field consistency checker script

# Wrapping Up

These five scripts help you catch data quality issues early, before they break your analysis or systems. Data validation should be automatic, comprehensive, and fast, and these scripts help with that.

So how do you get started? Download the script that addresses your biggest data quality pain point and install the required dependencies. Next, configure validation rules for your specific data, run it on a sample dataset to verify the setup. Then, integrate it into your data pipeline to catch issues automatically

Clean data is the foundation of everything else. Start validating systematically, and you’ll spend less time fixing problems. Happy validating!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

What's Hot

Tesla’s Full Self-Driving is on the cusp of a recall

Luna Glass Helps The Visually Impaired See Better at Night

How Bark.com and AWS collaborated to build a scalable video generation solution

How Bark.com and AWS collaborated to build a scalable video generation solution

Top 5 GitHub Repositories for Free Claude Skills (1000+ Skills)

A Coding Guide to Implement Advanced Differential Equation Solvers, Stochastic Simulations, and Neural Ordinary Differential Equations Using Diffrax and JAX

Build an AI-Powered A/B testing engine using Amazon Bedrock

Get Ready for a Year of Chaotic Weather in the US

Meet Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency

Tesla’s Full Self-Driving is on the cusp of a recall

Luna Glass Helps The Visually Impaired See Better at Night

How Bark.com and AWS collaborated to build a scalable video generation solution

Tesla’s Full Self-Driving is on the cusp of a recall

Luna Glass Helps The Visually Impaired See Better at Night

How Bark.com and AWS collaborated to build a scalable video generation solution

Usefull link

categories

What's Hot

5 Useful Python Scripts for Automated Data Quality Checks

# Introduction

# 1. Analyzing Missing Data

// The Pain Point

// What the Script Does

// How It Works

# 2. Validating Data Types

// The Pain Point

// What the Script Does

// How It Works

# 3. Detecting Duplicate Records

// The Pain Point

// What the Script Does

// How It Works

# 4. Detecting Outliers

// The Pain Point

// What the Script Does

// How It Works

# 5. Checking Cross-Field Consistency

// The Pain Point

// What the Script Does

// How It Works

# Wrapping Up

Related Posts

Usefull link

categories