5.2 Exploratory Data Analysis with Matplotlib and Seaborn: Initial Data Loading and Inspection
Exploratory data analysis (AED) is a fundamental step in the pipeline of any Machine Learning and Deep Learning project. It allows data scientists to better understand the structure, patterns, and possible anomalies present in data. In this context, data visualization libraries such as Matplotlib and Seaborn are essential tools to carry out this task effectively. In this section, we will cover the process of loading and initial data inspection using these Python libraries.
Data Loading
The first step in AED is to load the data set that will be analyzed. In Python, one of the most used libraries for data manipulation is Pandas, which offers functionality for loading data from various sources, such as CSV files, Excel, SQL databases, among others. Let's consider that we have a CSV file containing our data of interest:
import pandas as pd
# Loading the dataset
df = pd.read_csv('path_to_your_file.csv')
# Viewing the first rows of the DataFrame
print(df.head())
With the head()
method, we can get a quick look at the first few rows of our DataFrame, which gives us an initial idea of the data structure.
Initial Data Inspection
After loading, it is important to perform an initial inspection to understand the basic characteristics of the dataset:
- Dimensions: Check the number of rows and columns using
df.shape
. - Data Types: Identify the data types of each column with
df.dtypes
. - General Information: Get a summary with
df.info()
, which shows data types and count of non-null values. - Statistical Description: Use
df.describe()
to obtain a statistical summary of numeric columns, such as mean, median, minimum, maximum, and quartiles. - Unique Values: Check the number of unique values in each column with
df.nunique()
. - Missing Values: Identify the presence of missing values with
df.isnull().sum()
.
This basic information is essential to begin to understand the dataset and plan the next steps in the analysis.
Visualization with Matplotlib and Seaborn
With an initial understanding of the data, we can begin visualization. Matplotlib is a low-level plotting library that offers great control over graphical elements, while Seaborn is built on top of Matplotlib and provides a high-level interface for drawing more attractive and informative statistical plots.
One of the first visualizations we can make is a histogram, which helps us understand the distribution of a numerical variable:
import matplotlib.pyplot as plt
import seaborn as sns
# Configuring the graphics style
sns.set(style="whitegrid")
# Histogram with Matplotlib
plt.hist(df['your_numeric_colun'])
plt.title('Distribution of your_numeric_column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Histogram with Seaborn
sns.histplot(df['sua_coluna_numerica'], kde=True)
plt.title('Distribution of your_numeric_column with KDE')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
The kde
parameter in the Seaborn histogram adds a Kernel Density Estimate line, which is a way to estimate the probability density function of the variable.
Other useful visualizations include:
- Bar Charts: For categorical variables, showing the frequency or average of a numerical variable by category.
- Boxplots: To visualize the distribution of a numerical variable and identify outliers.
- Scatter Plots: To explore the relationship between two numerical variables.
- Correlation Charts: To evaluate the correlation between numerical variables.
- Line Charts: For time series data, visualizing the trend of one or more variables over time.
For example, to create a scatterplot with Seaborn, we can use:
sns.scatterplot(x='variable_x', y='variable_y', data=df)
plt.title('Relationship between variable_x and variable_y')
plt.xlabel('Variable X')
plt.ylabel('Variable Y')
plt.show()
In summary, Exploratory Data Analysis is a critical component in the development of Machine Learning and Deep Learning models. Initial data loading and inspection provides a basic understanding of the dataset, whileVisualizations with Matplotlib and Seaborn allow for a more detailed and intuitive analysis of the characteristics and relationships present in the data. These steps are critical to ensuring data quality and suitability before moving on to modeling and applying machine learning algorithms.