7.2 Principles of Supervised Learning: Datasets: Training and Testing
Supervised learning is one of the fundamental pillars of machine learning, where an algorithm learns from labeled examples to make predictions or decisions. The training process of a supervised learning model depends heavily on the quality and division of the data sets into training and testing. Let's explore these concepts in more detail.
What is Supervised Learning?
In supervised learning, we work with a set of data that includes inputs (features or characteristics) and desired outputs (labels or true values). The goal is to build a model that can learn the relationship between inputs and outputs from these labeled examples, so that it can predict the output for new, previously unseen data.
Training and Testing Datasets
To effectively train and evaluate a model, we divide the dataset into two distinct groups: a training set and a test set. The training set is used to teach the model, while the test set is used to evaluate its performance and generalization to unseen data.
Training Set
The training set is the largest subset of the dataset and is used to tune the parameters of the machine learning model. During the training phase, the algorithm tries to learn patterns in the training data that can be generalized to new data. The size of this set typically varies between 60% and 80% of the total dataset, but this proportion can vary depending on the size of the dataset and the complexity of the problem.
Test Set
The test set, on the other hand, is a separate subset that is not used during training. It is used exclusively to evaluate the model's performance after training. The test set provides an unbiased estimate of the model's performance on unseen data. It generally represents between 20% and 40% of the total dataset.
Division of Data
Dividing data into training and testing sets must be done carefully to ensure that both represent the general distribution of the data well. Improper division can lead to a model that does not generalize well, known as overfitting (when the model learns too much detail and noise from the training set) or underfitting (when the model is too simple and does not learn the structure of the data).< /p>
Division Techniques
There are several techniques for dividing data, the simplest being random division. However, more sophisticated methods such as cross-validation are often used to ensure that each observation in the data set has a chance to appear in the training and test sets. K-fold cross-validation is a common example, where the dataset is divided into K subsets of approximately the same size, and the model is trained and tested K times, each time with a different subset as the test set. p>
Importance of Representativeness
It is crucial that the training and testing sets are representative of the overall distribution of the data. This means they must contain a similar mix of examples from all classes or outputs. In some cases, it may be necessary to stratify the split, ensuring that the proportion of classes in each set is the same as the proportion in the full data set.
Challenges with Imbalanced Data
When we deal with unbalanced data sets, where some classes are much more frequent than others, the division of training and testing becomes more challenging. In these cases, special techniques such as oversampling, undersampling, or the generation of synthetic data may be necessary to ensure that the model is not biased in favor of the most frequent classes.
Conclusion
Training and testing datasets are fundamental in supervised learning. A good split between training and testing is essential for developing models that not only fit well to training data, but also generalize well to new data. By applying data splitting techniques and considering class representativeness and balance, we can build robust and reliable machine learning models.
In summary, understanding and carefully applying supervised learning principles and data slicing techniques are crucial to the success of any machine learning and deep learning project with Python.