Free Ebook cover Machine Learning and Deep Learning with Python

Machine Learning and Deep Learning with Python

4.75

(4)

112 pages

Data Manipulation with Pandas

Capítulo 4

Estimated reading time: 5 minutes

+ Exercise
Audio Icon

Listen in audio

0:00 / 0:00

4. Data Manipulation with Pandas

The Pandas library is one of the most powerful and widely used tools in the Python universe for data manipulation and analysis. The name "Pandas" is derived from "Panel Data", an economic term for data sets that include observations over time for the same individuals. Developed by Wes McKinney, the library was designed to facilitate working with "relational" or "labeled" data, which is fundamental in the Machine Learning and Deep Learning process, as it allows efficient manipulation of large data sets, cleaning, their transformation and analysis.

Introduction to Pandas Objects

Pandas has two main data structures: Series and DataFrames. A Series is a one-dimensional array capable of storing any type of data with axis labels, known as indices. A DataFrame is a two-dimensional structure, a kind of table, which is essentially a collection of Series with a common index.

Installation and Import

Before you start working with Pandas, you need to install it using the pip package manager:

pip install pandas

After installation, you can import Pandas usually with the alias pd:

import pandas as pd

Data Loading

One of the first tasks when working with Pandas is to load data for analysis. Pandas supports reading a variety of file formats, including CSV, Excel, JSON, HTML, and SQL. For example, to load a CSV file, we use the read_csv:

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

method
df = pd.read_csv('path_to_your_file.csv')

Data Exploration

Once the data is loaded into a DataFrame, we can start exploring it using methods such as head(), which displays the first rows of the DataFrame, and tail(), which shows the last few lines. Methods such as describe() provide a statistical summary of numeric columns.

Selection and Indexing

Selecting and indexing data is crucial for data manipulation. Pandas offers several ways to select a subset of data from a DataFrame. We can select specific columns using square bracket notation:

specific_series = df['column_name']

To select lines, we can use the loc method for label-based selections, or iloc for integer position-based selections.

Data Cleansing

Cleansing data is an important part of the Machine Learning preparation process. With Pandas, we can handle missing values ​​using methods like dropna(), which removes rows or columns with missing values, and fillna(), which fills those values ​​with a specified value. Additionally, we can remove duplicates with drop_duplicates().

Data Transformation

Transforming data is another common operation. We can add or remove columns, apply functions to entire rows or columns, and perform grouping operations. The apply() method is particularly useful for applying a function to a column:

df['nova_coluna'] = df['existing_coluna'].apply(uma_funcao)

For grouping operations, the groupby() method is essential, allowing you to group data and apply aggregation functions such as sum(), mean()< /code> and count().

Data Joining

In many cases, we need to combine data from different sources. Pandas offers several functions for this, such as concat() to concatenate DataFrames, and merge() to perform SQL database-style join operations.

Data Visualization

Pandas also supports data visualization directly from DataFrames, integrating with libraries like Matplotlib. We can plot line graphs, bars, histograms and many others directly from DataFrames:

df['column'].plot(kind='hist')

Data Export

After manipulating and analyzing the data, we often need to export the results. Pandas allows you to export DataFrames to a variety of formats such as CSV, Excel, JSON, among others, using methods such as to_csv(), to_excel(), etc.

Conclusion

In summary, Pandas is an extremely versatile and powerful library for data manipulation in Python, which plays a crucial role in preparing data for Machine Learning and Deep Learning. With its wide range of functionality, from data loading and cleaning to transformation and visualization, Pandas is an indispensable tool for any data scientist or data engineer.and machine learning.

As you delve deeper into Machine Learning and Deep Learning with Python, the ability to manipulate data with Pandas will become increasingly valuable, allowing you to focus on the more complex and interesting aspects of data modeling, while leaving the heavy lifting of data manipulation to this powerful library.

Now answer the exercise about the content:

What is the main purpose of the Pandas library in Python?

You are right! Congratulations, now go to the next page

You missed! Try again.

Pandas is primarily used to manipulate and analyze data efficiently. It provides tools for handling and transforming large data sets and is essential for data cleansing, transformation, and analysis. Its main structures, Series and DataFrames, facilitate these operations and are integral to the data preparation process in Machine Learning and Deep Learning.

Next chapter

Exploratory Data Analysis with Matplotlib and Seaborn

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.
  • Read this course in the app to earn your Digital Certificate!
  • Listen to this course in the app without having to turn on your cell phone screen;
  • Get 100% free access to more than 4000 online courses, ebooks and audiobooks;
  • + Hundreds of exercises + Educational Stories.