4. Data Manipulation with Pandas

The Pandas library is one of the most powerful and widely used tools in the Python universe for data manipulation and analysis. The name "Pandas" is derived from "Panel Data", an economic term for data sets that include observations over time for the same individuals. Developed by Wes McKinney, the library was designed to facilitate working with "relational" or "labeled" data, which is fundamental in the Machine Learning and Deep Learning process, as it allows efficient manipulation of large data sets, cleaning, their transformation and analysis.

Introduction to Pandas Objects

Pandas has two main data structures: Series and DataFrames. A Series is a one-dimensional array capable of storing any type of data with axis labels, known as indices. A DataFrame is a two-dimensional structure, a kind of table, which is essentially a collection of Series with a common index.

Installation and Import

Before you start working with Pandas, you need to install it using the pip package manager:

pip install pandas

After installation, you can import Pandas usually with the alias pd:

import pandas as pd

Data Loading

One of the first tasks when working with Pandas is to load data for analysis. Pandas supports reading a variety of file formats, including CSV, Excel, JSON, HTML, and SQL. For example, to load a CSV file, we use the read_csv:

method
df = pd.read_csv('path_to_your_file.csv')

Data Exploration

Once the data is loaded into a DataFrame, we can start exploring it using methods such as head(), which displays the first rows of the DataFrame, and tail(), which shows the last few lines. Methods such as describe() provide a statistical summary of numeric columns.

Selection and Indexing

Selecting and indexing data is crucial for data manipulation. Pandas offers several ways to select a subset of data from a DataFrame. We can select specific columns using square bracket notation:

specific_series = df['column_name']

To select lines, we can use the loc method for label-based selections, or iloc for integer position-based selections.

Data Cleansing

Cleansing data is an important part of the Machine Learning preparation process. With Pandas, we can handle missing values ​​using methods like dropna(), which removes rows or columns with missing values, and fillna(), which fills those values ​​with a specified value. Additionally, we can remove duplicates with drop_duplicates().

Data Transformation

Transforming data is another common operation. We can add or remove columns, apply functions to entire rows or columns, and perform grouping operations. The apply() method is particularly useful for applying a function to a column:

df['nova_coluna'] = df['existing_coluna'].apply(uma_funcao)

For grouping operations, the groupby() method is essential, allowing you to group data and apply aggregation functions such as sum(), mean()< /code> and count().

Data Joining

In many cases, we need to combine data from different sources. Pandas offers several functions for this, such as concat() to concatenate DataFrames, and merge() to perform SQL database-style join operations.

Data Visualization

Pandas also supports data visualization directly from DataFrames, integrating with libraries like Matplotlib. We can plot line graphs, bars, histograms and many others directly from DataFrames:

df['column'].plot(kind='hist')

Data Export

After manipulating and analyzing the data, we often need to export the results. Pandas allows you to export DataFrames to a variety of formats such as CSV, Excel, JSON, among others, using methods such as to_csv(), to_excel(), etc.

Conclusion

In summary, Pandas is an extremely versatile and powerful library for data manipulation in Python, which plays a crucial role in preparing data for Machine Learning and Deep Learning. With its wide range of functionality, from data loading and cleaning to transformation and visualization, Pandas is an indispensable tool for any data scientist or data engineer.and machine learning.

As you delve deeper into Machine Learning and Deep Learning with Python, the ability to manipulate data with Pandas will become increasingly valuable, allowing you to focus on the more complex and interesting aspects of data modeling, while leaving the heavy lifting of data manipulation to this powerful library.

Now answer the exercise about the content:

What is the main purpose of the Pandas library in Python?

You are right! Congratulations, now go to the next page

You missed! Try again.

Article image Exploratory Data Analysis with Matplotlib and Seaborn 5

Next page of the Free Ebook:

Exploratory Data Analysis with Matplotlib and Seaborn

Estimated reading time: 4 minutes

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.
  • Read this course in the app to earn your Digital Certificate!
  • Listen to this course in the app without having to turn on your cell phone screen;
  • Get 100% free access to more than 4000 online courses, ebooks and audiobooks;
  • + Hundreds of exercises + Educational Stories.

+ 9 million
students

Free and Valid
Certificate

60 thousand free
exercises

4.8/5 rating in
app stores

Free courses in
video and ebooks