Image by Author
# Introduction
Excel remains relevant for data work, but a significant portion of the time spent using it is purely mechanical. Tasks like combining files from multiple sources, tracking down duplicate records, reformatting inconsistent exports, and splitting a master sheet into separate files are not complex, but they are time-consuming and prone to human error.
These five Python scripts help automate those tasks. Each one is self-contained, configurable, and designed to work with messy real-world data.
You can find all the scripts on GitHub.
# Merging Multiple Excel Files
// The Pain Point
When consolidating data from multiple Excel or comma-separated values (CSV) files, the manual process — opening each file, copying the data, and pasting into a master sheet — is slow and prone to misalignment errors, especially when column orders differ between files.
// What the Script Does
This script scans a folder for .xlsx and .csv files, stacks all their data into a single unified sheet, and writes a clean merged output file. It can optionally add a source column so you always know which row originated from which file, and it handles mismatched column orders automatically.
// How It Works
The script uses pandas to read every file in a target directory, aligns columns by name rather than position, and concatenates everything into one DataFrame. A configurable add_source_column flag appends the original filename to each row. Column mismatches are logged so you know if some files had extra or missing fields. The output is written with openpyxl and includes a summary tab showing file-by-file row counts.
⏩ Get the Excel files merger script
# Finding and Flagging Duplicate Rows
// The Pain Point
Duplicate records are common in datasets that have been exported and re-imported across systems. Exact matches are easy to find, but near-duplicates — same record, slightly different formatting or spacing — are harder to catch manually at scale.
// What the Script Does
This script scans an Excel file for duplicate rows based on columns you define, flags exact duplicates and near-duplicates through fuzzy matches on string fields, and writes an annotated output file highlighting every suspected duplicate group with color coding and a confidence score.
// How It Works
The script uses pandas for exact duplicate detection and RapidFuzz for fuzzy string matching on configurable key columns. Each row is assigned a duplicate group ID and a match confidence percentage. The output Excel file uses openpyxl formatting to highlight duplicate clusters. A separate summary sheet shows total duplicates found, broken down by match type.
⏩ Get the duplicate finder script
# Cleaning and Standardizing Messy Exported Data
// The Pain Point
Data exported from external systems often arrives inconsistently formatted with mixed date formats, inconsistent capitalization, phone numbers with varying separators, and trailing whitespaces. Cleaning this manually before any analysis adds up quickly.
// What the Script Does
This script applies a configurable set of cleaning rules to an Excel or CSV file. These include standardizing dates, trimming whitespace, fixing capitalization, normalizing phone numbers and postcodes, removing blank rows, and flagging cells that appear incorrect. It outputs a cleaned file and a change log showing exactly what was modified.
// How It Works
The script reads a configuration file that maps column names to cleaning operations: date_format, title_case, strip_whitespace, phone_normalize, remove_blank_rows, and others. Each operation is applied in sequence. A side-by-side change log is written to a second sheet in the output, showing original versus cleaned values for every modified cell. Nothing is silently discarded. If a value cannot be parsed, it is flagged in a _clean_errors column.
⏩ Get the data cleaner script
# Splitting One Sheet into Separate Files by Column Value
// The Pain Point
A master dataset often needs to be distributed as separate files — such as one per region, department, or category. Doing this manually involves filtering, copying, and saving repeatedly, with a high risk of mixing up data between files.
// What the Script Does
This script reads a single Excel sheet and splits it into separate output files — one per unique value in a specified column. Each output file contains only the rows for that value, with the original formatting preserved. Filenames are generated automatically from the column values. Optionally, it can send each file as an email attachment using a name-to-email mapping you provide.
// How It Works
The script groups the DataFrame by the target column using pandas, then writes each group to its own .xlsx file using openpyxl. A naming template, like Sales_Report_{value}_{date}.xlsx, allows you to control the output filename format. Column headers, data types, and basic formatting are preserved in each output file. An optional email mode reads a CSV mapping of {value} → {email address} and sends each file via the Simple Mail Transfer Protocol (SMTP).
⏩ Get the sheet splitter script
# Generating a Summary Pivot Report from Raw Data
// The Pain Point
Producing a summary report from raw data — totals by category, monthly trends, or top performers — involves building pivot tables, formatting them, and copying results to a presentable layout. When the source data updates regularly, this process is repeated from scratch each time.
// What the Script Does
This script reads a raw data Excel file, builds configurable pivot summaries, and writes a formatted multi-tab summary report. Charts are generated and embedded in the output file. You can re-run it any time the source data changes.
// How It Works
A configuration file defines the date field, the value field, grouping columns, and specific aggregations to run. The script uses pandas for all aggregation logic and openpyxl with Matplotlib for chart generation. Each summary type is given its own tab. Conditional formatting highlights the highest and lowest values. The report is designed for on-demand regeneration, and running the script again overwrites the previous output cleanly.
⏩ Get the pivot report generator script
# Wrapping Up
These five scripts cover common Excel tasks that are straightforward to automate but tedious to perform manually. Choose whichever one addresses the most frequent task in your workflow and start there. Here is a quick overview:
Script Name
Purpose
Key Features
Best Use Case
Excel Files Merger
Combine multiple Excel/CSV files
Column alignment, source tracking, summary sheet
Consolidating data from multiple sources
Duplicate Finder
Identify exact and fuzzy duplicates
Fuzzy matching, confidence scores, color highlighting
Cleaning datasets with repeated records
Data Cleaner
Standardize messy exported data
Formatting rules, normalization, change log
Preprocessing raw external data
Sheet Splitter
Split one sheet into multiple files
Auto file naming, grouping, optional email sending
Distributing reports by category/region
Pivot Report Generator
Create summary reports from raw data
Automated pivots, charts, multi-tab output
Recurring reporting and dashboards
Happy automating!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

