Programming

Filter Data in Pandas Dataframe

Introduction

Pandas, a popular Python library for data manipulation and analysis, provides powerful tools for filtering data within a Pandas DataFrame. Filtering is a fundamental operation when working with large datasets, as it allows you to focus on specific subsets of your data that meet certain criteria. In this guide, we’ll explore various techniques for filtering data in Pandas DataFrame.

Prerequisites

Before starting, you should have the following prerequisites configured

  • Visual Studio Code with Jupyter extension to run the notebook
  • Python 3.9, pandas library
  • CSV data file sample

Using tool to create a sample CSV file at page https://extendsclass.com/csv-generator.html

Basic Filtering

  • Read CSV file into a Pandas DataFrame object
  • Using the query Method
  • Filtering with isin
  • Filtering Null (NaN) Values

Read CSV file into a Pandas DataFrame object

use read_csv() function to read data from CSV file and setting header for the dataframe

import pandas as pd
student_cols = [
    'id','firstname','lastname','email','email2','profession'
]
students = pd.read_csv(
    'data/myFile0.csv',
    names=student_cols
)

Using the query Method

The query method allows you to express conditions as strings, providing a more concise and readable syntax:

students.query('profession == "doctor"')

You can use logical operators (& for AND, | for OR) to combine multiple conditions:

students.query('profession == "doctor" and lastname == "Mike"')
students.query('profession == "doctor" or profession == "worker"')
students.query('profession == ("doctor", "worker")')

Filtering with isin

The isin method is useful when you want to filter rows based on a list of values:

name_list = ['firefighter']
filtered_df = students[students['profession'].isin(name_list)]
print(filtered_df)

Filtering Null (NaN) Values

You can use the isnull() or notnull() methods to filter rows with missing data:

filtered_df = students[students[‘profession’].notnull()]

print(filtered_df)

Conclusion

Filtering data is a crucial skill when working with Pandas DataFrames. Whether you need to select rows based on simple conditions or complex queries, Pandas provides a versatile set of tools to handle your data effectively.

Experiment with these techniques on your own datasets to gain a deeper understanding of how to filter data in Pandas DataFrames. As you become more comfortable with these methods, you’ll be better equipped to extract valuable insights from your data. Thank you for reading the DevopsRoles page!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.