Image by Editor
# Introduction
Working intensively with data in Python teaches all of us an important lesson: data cleaning usually doesn’t feel much like performing data science, but rather like acting as a digital janitor. Here’s what it takes in most use cases: loading a dataset, discovering many column names are messy, coming across missing values, and ending up with plenty of temporary data variables, only the last of them containing your final, clean dataset.
Pyjanitor provides a cleaner approach to carry these steps out. This library can be used alongside the notion of method chaining to transform otherwise arduous data cleaning processes into pipelines that look elegant, efficient, and readable.
This article shows how and demystifies method chaining in the context of Pyjanitor and data cleaning.
# Understanding Method Chaining
Method chaining is not something new in the realm of programming: actually, it is a well-established coding pattern. It consists of calling multiple methods in sequential order on an object: all in just one statement. This way, you don’t need to reassign a variable after each step, because each method returns an object that invokes the next attached method, and so on.
The following example helps understand the concept at its core. Observe how we would apply several simple modifications to a small piece of text (string) using “standard” Python:
text = ” Hello World! “
text = text.strip()
text = text.lower()
text = text.replace(“world”, “python”)
The resulting value in text will be: “hello python!”.
Now, with method chaining, the same process would look like:
text = ” Hello World! “
cleaned_text = text.strip().lower().replace(“world”, “python”)
Notice that the logical flow of operations applied goes from left to right: all in a single, unified chain of thought!
If you got it, now you perfectly understand the notion of method chaining. Let’s translate this vision now to the context of data science using Pandas. A standard data cleaning on a dataframe, consisting of multiple steps, typically looks like this without chaining:
# Traditional, step-by-step Pandas approach
df = pd.read_csv(“data.csv”)
df.columns = df.columns.str.lower().str.replace(‘ ‘, ‘_’)
df = df.dropna(subset=[‘id’])
df = df.drop_duplicates()
As we will see shortly, by applying method chaining, we will construct a unified pipeline whereby dataframe operations are encapsulated using parentheses. On top of that, we will no longer need intermediate variables containing non-final dataframes, allowing for cleaner, more bug-resilient code. And (once again) on the very top of that, Pyjanitor makes this process seamless.
# Entering Pyjanitor: Application Example
Pandas itself offers native support for method chaining to some extent. However, some of its essential functionalities have not been designed strictly bearing this pattern in mind. This is a core motivation why Pyjanitor was born, based on a nearly-namesake R package: janitor.
In essence, Pyjanitor can be framed as an extension for Pandas that brings a pack of custom data-cleaning processes in a method chaining-friendly fashion. Examples of its application programming interface (API) method names include clean_names(), rename_column(), remove_empty(), and so on. Its API employs a suite of intuitive method names that take code expressiveness to a whole new level. Besides, Pyjanitor completely relies on open-source, free tools, and can be seamlessly run in cloud and notebook environments, such as Google Colab.
Let’s fully understand how method chaining in Pyjanitor is applied, through an example in which we first create a small, synthetic dataset that looks intentionally messy, and put it into a Pandas DataFrame object.
IMPORTANT: to avoid common, yet somewhat dreadful errors due to incompatibility between library versions, make sure you have the latest available version of both Pandas and Pyjanitor, by using !pip install –upgrade pyjanitor pandas first.
messy_data = {
‘First Name ‘: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Alice’, None],
‘ Last_Name’: [‘Smith’, ‘Jones’, ‘Brown’, ‘Smith’, ‘Doe’],
‘Age’: [25, np.nan, 30, 25, 40],
‘Date_Of_Birth’: [‘1998-01-01’, ‘1995-05-05’, ‘1993-08-08’, ‘1998-01-01’, ‘1983-12-12’],
‘Salary ($)’: [50000, 60000, 70000, 50000, 80000],
‘Empty_Col’: [np.nan, np.nan, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(messy_data)
print(“— Messy Original Data —“)
print(df.head(), “\n”)
Now we define a Pyjanitor method chain that applies a series of processing to both column names and data itself:
cleaned_df = (
df
.rename_column(‘Salary ($)’, ‘Salary’) # 1. Manually fix tricky names BEFORE getting them mangled
.clean_names() # 2. Standardize everything (makes it ‘salary’)
.remove_empty() # 3. Drop empty columns/rows
.drop_duplicates() # 4. Remove duplicate rows
.fill_empty( # 5. Impute missing values
column_names=[‘age’], # CAUTION: after previous steps, assume lowercase name: ‘age’
value=df[‘Age’].median() # Pull the median from the original raw df
)
.assign( # 6. Create a new column using assign
salary_k=lambda d: d[‘salary’] / 1000
)
)
print(“— Cleaned Pyjanitor Data —“)
print(cleaned_df)
The above code is self-explanatory, with inline comments explaining each method called at every step of the chain.
This is the output of our example, which compares the original messy data with the cleaned version:
— Messy Original Data —
First Name Last_Name Age Date_Of_Birth Salary ($) Empty_Col
0 Alice Smith 25.0 1998-01-01 50000 NaN
1 Bob Jones NaN 1995-05-05 60000 NaN
2 Charlie Brown 30.0 1993-08-08 70000 NaN
3 Alice Smith 25.0 1998-01-01 50000 NaN
4 NaN Doe 40.0 1983-12-12 80000 NaN
— Cleaned Pyjanitor Data —
first_name_ _last_name age date_of_birth salary salary_k
0 Alice Smith 25.0 1998-01-01 50000 50.0
1 Bob Jones 27.5 1995-05-05 60000 60.0
2 Charlie Brown 30.0 1993-08-08 70000 70.0
4 NaN Doe 40.0 1983-12-12 80000 80.0
# Wrapping Up
Throughout this article, we have learned how to use the Pyjanitor library to apply method chaining and simplify otherwise arduous data cleaning processes. This makes the code cleaner, expressive, and — in a manner of speaking — self-documenting, so that other developers or your future self can read the pipeline and easily understand what is going on in this journey from raw to ready dataset.
Great job!
Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

