Free Ebook cover Machine Learning and Deep Learning with Python

Machine Learning and Deep Learning with Python

4.75

(4)

112 pages

Supervised Learning Principles: Datasets: Training and Testing

Capítulo 23

Estimated reading time: 4 minutes

Audio Icon

Listen in audio

0:00 / 0:00

7.2 Principles of Supervised Learning: Datasets: Training and Testing

Supervised learning is one of the fundamental pillars of machine learning, where an algorithm learns from labeled examples to make predictions or decisions. The training process of a supervised learning model depends heavily on the quality and division of the data sets into training and testing. Let's explore these concepts in more detail.

What is Supervised Learning?

In supervised learning, we work with a set of data that includes inputs (features or characteristics) and desired outputs (labels or true values). The goal is to build a model that can learn the relationship between inputs and outputs from these labeled examples, so that it can predict the output for new, previously unseen data.

Training and Testing Datasets

To effectively train and evaluate a model, we divide the dataset into two distinct groups: a training set and a test set. The training set is used to teach the model, while the test set is used to evaluate its performance and generalization to unseen data.

Training Set

The training set is the largest subset of the dataset and is used to tune the parameters of the machine learning model. During the training phase, the algorithm tries to learn patterns in the training data that can be generalized to new data. The size of this set typically varies between 60% and 80% of the total dataset, but this proportion can vary depending on the size of the dataset and the complexity of the problem.

Test Set

The test set, on the other hand, is a separate subset that is not used during training. It is used exclusively to evaluate the model's performance after training. The test set provides an unbiased estimate of the model's performance on unseen data. It generally represents between 20% and 40% of the total dataset.

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

Division of Data

Dividing data into training and testing sets must be done carefully to ensure that both represent the general distribution of the data well. Improper division can lead to a model that does not generalize well, known as overfitting (when the model learns too much detail and noise from the training set) or underfitting (when the model is too simple and does not learn the structure of the data).< /p>

Division Techniques

There are several techniques for dividing data, the simplest being random division. However, more sophisticated methods such as cross-validation are often used to ensure that each observation in the data set has a chance to appear in the training and test sets. K-fold cross-validation is a common example, where the dataset is divided into K subsets of approximately the same size, and the model is trained and tested K times, each time with a different subset as the test set.

Importance of Representativeness

It is crucial that the training and testing sets are representative of the overall distribution of the data. This means they must contain a similar mix of examples from all classes or outputs. In some cases, it may be necessary to stratify the split, ensuring that the proportion of classes in each set is the same as the proportion in the full data set.

Challenges with Imbalanced Data

When we deal with unbalanced data sets, where some classes are much more frequent than others, the division of training and testing becomes more challenging. In these cases, special techniques such as oversampling, undersampling, or the generation of synthetic data may be necessary to ensure that the model is not biased in favor of the most frequent classes.

Conclusion

Training and testing datasets are fundamental in supervised learning. A good split between training and testing is essential for developing models that not only fit well to training data, but also generalize well to new data. By applying data splitting techniques and considering class representativeness and balance, we can build robust and reliable machine learning models.

In summary, understanding and carefully applying supervised learning principles and data slicing techniques are crucial to the success of any machine learning and deep learning project with Python.

Now answer the exercise about the content:

Which of the following statements is true about supervised learning and splitting data sets?

You are right! Congratulations, now go to the next page

You missed! Try again.

In supervised learning, it is important that the training and testing sets are representative of the overall distribution of the data to ensure that the model generalizes well. This prevents overfitting, where the model learns noise from the training data, and underfitting, where the model fails to capture underlying patterns.

Next chapter

Supervised Learning Principles: Classification Algorithms

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.