All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Unsupervised Learning: Discovering Groups and Structure Without Labels

Capítulo 6

Estimated reading time: 13 minutes

What “Unsupervised” Means (and What It Does)

Unsupervised learning is a family of machine learning methods that look for patterns in data when you do not have target labels (no “correct answer” column). Instead of learning to predict a known outcome, the model tries to discover structure: groups of similar items, directions of variation, unusual points, or compact representations of the data.

Think of it as organizing a messy drawer without a list of categories. You pour everything out, compare items by their properties (size, shape, material), and naturally end up with piles. No one told you “these are socks” and “these are batteries”; you inferred the groupings from similarities.

In practice, unsupervised learning is often used for:

Clustering: finding groups of similar items (customers, documents, products, images).
Dimensionality reduction: compressing many features into fewer “summary” features for visualization or downstream tasks.
Anomaly detection: spotting items that do not fit typical patterns (fraud, sensor faults).
Topic discovery in text: finding themes without predefined categories.

Because there are no labels, the output is not “right” or “wrong” in the same way as a labeled prediction. The value comes from whether the discovered structure is useful, stable, and meaningful for your goal.

When Unsupervised Learning Is a Good Fit

Unsupervised learning is especially useful when:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

You have lots of data but no labels (labels are expensive, slow, or ambiguous).
You want to explore and understand a dataset before deciding what to build.
You suspect there are natural segments (customer types, usage patterns) but you do not know how many or what defines them.
You need to reduce complexity (too many features) to visualize or to speed up later modeling.
You want to detect rare or unusual behavior without having examples of every possible “bad” case.

It is less appropriate when you need a specific, measurable prediction target immediately (for example, “will this customer churn next month?”). In that case, unsupervised methods can still help (e.g., create segments or features), but they are not the final step.

Core Idea: Similarity Depends on Representation

Unsupervised learning relies heavily on how you represent each item and how you measure similarity. If you cluster customers using only “number of purchases,” you may get different groups than if you include “average order value,” “product categories,” and “time between purchases.”

Common similarity notions include:

Distance in feature space (e.g., Euclidean distance): items are similar if their feature values are close.
Cosine similarity: items are similar if they point in a similar direction (common in text embeddings).
Density: items are similar if they lie in the same dense region of the data.

Practical implication: before running an unsupervised algorithm, you usually spend time choosing features, scaling numeric values, and deciding how to encode categories. Small choices here can change the discovered structure dramatically.

Clustering: Finding Groups of Similar Items

Clustering is the most common unsupervised task. The model assigns each item to a group (cluster) so that items in the same cluster are more similar to each other than to items in other clusters.

Common Clustering Approaches (Intuition Only)

K-means: tries to place K “centers” so that each point is assigned to the nearest center. Works best when clusters are roughly round and similar in size.
Hierarchical clustering: builds a tree of clusters by repeatedly merging similar items (or splitting). Useful when you want clusters at multiple levels (broad segments and sub-segments).
Density-based clustering (e.g., DBSCAN-like ideas): finds clusters as dense regions separated by sparse regions; can discover irregular shapes and can label outliers as noise.
Model-based clustering: assumes data comes from a mixture of underlying distributions; can provide “soft” membership (probabilities) rather than hard assignments.

You do not need to memorize algorithm names to use clustering effectively. What matters is matching the method to the shape of your data and your goal: do you want a fixed number of segments, a hierarchy, or the ability to find outliers?

Practical Step-by-Step: Customer Segmentation with Clustering

Below is a practical workflow you can apply to many clustering problems. The example is customer segmentation for an online store, but the same steps work for documents, devices, or users.

Step 1: Define the purpose of the clusters

Clustering is not useful by itself; it becomes useful when it supports a decision. Decide what you want to do with the segments. Examples:

Tailor marketing messages and offers.
Identify high-value customers vs. bargain seekers.
Detect customers with unusual purchasing patterns that might indicate fraud or account sharing.

Your purpose affects which features you choose and how you evaluate the clusters.

Step 2: Choose items and time window

Decide what each row represents and over what period. For example:

One row per customer, summarizing the last 90 days.
Or one row per customer-month to capture seasonality.

Be explicit. Clustering “customers” can mean lifetime behavior, recent behavior, or a blend—each leads to different segments.

Step 3: Build features that reflect behavior

Good clustering features are often aggregated behavior metrics. Example feature set:

Recency: days since last purchase.
Frequency: number of orders in the window.
Monetary: total spend or average order value.
Discount rate: fraction of orders using a coupon.
Category mix: percent of spend in major categories (encoded as numeric proportions).
Return rate: fraction of items returned.

Avoid including identifiers (customer ID) or features that leak the answer to a later decision in a misleading way (for example, including “received VIP offer” if you later use clusters to decide who should receive VIP offers).

Step 4: Prepare the data (scaling and encoding)

Clustering is sensitive to scale. If “total spend” ranges from 0 to 10,000 while “return rate” ranges from 0 to 1, spend may dominate distance calculations unless you scale.

Scale numeric features so they are comparable (common approach: standardization to mean 0, variance 1).
Handle skew (e.g., spend is often heavy-tailed): consider log transforms so extreme values do not overwhelm the clustering.
Encode categories (e.g., region, device type) if you include them; ensure the encoding makes sense for similarity.
Handle missing values consistently (impute or remove, depending on context).

Step 5: Pick a clustering method and choose key settings

For a first pass, many teams start with a simple method (like K-means) because it is fast and easy to interpret. The main setting is the number of clusters K. If you do not know K, you try several values and compare results.

If you expect irregular cluster shapes or many outliers, a density-based approach may be more appropriate. If you need a hierarchy (segments and subsegments), hierarchical clustering is a natural fit.

Step 6: Evaluate cluster quality (without labels)

Because there is no ground truth, evaluation is about usefulness and stability. Common checks:

Separation vs. compactness: are points within a cluster similar, and are clusters distinct? (There are quantitative scores for this, but also simple visual checks.)
Stability: if you rerun the algorithm with different random seeds or on a slightly different sample, do you get similar clusters?
Size balance: do you get one giant cluster and several tiny ones? That may be fine, but often signals that features or settings need adjustment.
Business interpretability: can you describe each cluster in plain language using a few key features?

A practical technique is to create a “cluster profile table”: for each cluster, compute average recency, frequency, spend, discount rate, etc. Then name clusters based on dominant traits (e.g., “Frequent small-basket buyers,” “High-spend occasional buyers,” “Discount-driven shoppers”).

Step 7: Use clusters in an action loop

Clusters become valuable when you connect them to actions and measure outcomes. Examples:

Send different onboarding emails to different clusters and compare engagement.
Adjust recommendation strategies per cluster (new arrivals vs. replenishment items).
Flag a cluster with unusually high return rates for policy review.

Even though clustering is unsupervised, your use of clusters can be evaluated with standard metrics (conversion rate, retention, support tickets) after you deploy an intervention.

Dimensionality Reduction: Seeing and Compressing Structure

Dimensionality reduction methods transform data with many features into a smaller set of features while preserving important structure. This is useful for:

Visualization: plotting high-dimensional data in 2D or 3D to see patterns.
Noise reduction: removing minor variations that distract from major trends.
Feature compression: creating compact representations that can help later tasks (including clustering).

Two broad styles are common:

Linear methods: find directions that capture the most variation (often easier to interpret).
Nonlinear methods: preserve local neighborhoods and complex shapes (often better for visualization, sometimes less stable and harder to interpret).

Practical Step-by-Step: Visualizing Products with Reduced Dimensions

Imagine you have product data with dozens of numeric features (price, weight, ratings, shipping time, return rate, category proportions, etc.). You want to see whether products naturally form groups.

Step 1: Scale features so one unit means roughly the same “importance” across features.
Step 2: Apply a dimensionality reduction method to produce 2D coordinates for each product.
Step 3: Plot the products on a scatter plot and color them by an existing attribute (like category) to see whether the discovered structure aligns with known groupings.
Step 4: If you also ran clustering, color by cluster ID to see whether clusters correspond to visible regions.
Step 5: Inspect points that sit far from others; they may be niche products, data errors, or genuinely unique items.

Important caution: a 2D plot is a projection. Points that look close in 2D may not be close in the original space, and vice versa. Use visualization as a diagnostic tool, not as proof.

Anomaly Detection: Finding the “Doesn’t Belong” Cases

Anomaly detection aims to identify rare items that differ significantly from the majority. In unsupervised settings, you often do not have labeled examples of anomalies, so the model learns what “normal” looks like and flags deviations.

Common real-world examples:

Payments: a transaction with an unusual amount, location, and timing compared to a user’s typical pattern.
Manufacturing: a sensor reading pattern that deviates from normal machine behavior.
IT operations: a server with unusual CPU/memory/network patterns.

Practical Step-by-Step: Simple Unsupervised Anomaly Workflow

Step 1: Define “normal” scope. Decide whether “normal” is global (across all users) or personalized (per user/device). Personalized baselines often reduce false alarms.
Step 2: Choose features that reflect behavior. For transactions: amount, merchant type, time of day, distance from typical location, device fingerprint signals.
Step 3: Train a model of normality. This might be a density estimate, a distance-to-neighbors score, or a reconstruction-based approach (compress and reconstruct; large reconstruction error suggests anomaly).
Step 4: Set a threshold. Decide how many alerts you can handle. Thresholds are often chosen to flag the top X% most unusual events.
Step 5: Validate with investigation feedback. Even if you do not have labels initially, you can collect outcomes from human review to refine features and thresholds.

Because anomalies are rare, the biggest practical challenge is managing false positives. Unsupervised anomaly detection is often best used as a prioritization tool (“review these first”) rather than an automatic blocker.

How to Interpret and Name Discovered Structure

Unsupervised outputs can feel abstract: cluster IDs like 0, 1, 2 or components like “Dimension 1.” Interpretation turns these into something usable.

Cluster interpretation checklist

Profile clusters with summary statistics (means/medians, percentiles).
Find distinguishing features: which features differ most between clusters?
Use representative examples: pick a few typical items from each cluster and review them manually.
Check edge cases: items near cluster boundaries may reveal overlap or missing features.

Dimensionality reduction interpretation checklist

Relate dimensions to original features when possible (especially for linear methods).
Color plots by known attributes (category, region, device) to see what the structure corresponds to.
Test stability: do you see similar patterns if you rerun with a different sample?

A useful habit is to write a short “data story” for each discovered group: who/what it contains, how it differs, and what action it suggests. If you cannot write that story, the clustering may not be aligned with your purpose, or the features may not capture the relevant behavior.

Common Pitfalls and How to Avoid Them

Pitfall 1: Treating clusters as “truth”

Clusters are a modeling choice, not a fact of nature. Different algorithms or feature choices can produce different groupings. Avoid overconfidence by checking stability and by validating usefulness through downstream outcomes.

Pitfall 2: Letting one feature dominate

If features are not scaled or are heavily skewed, clustering may mostly reflect one variable (often “spend” or “activity”). Use scaling, transformations, and sanity checks (e.g., see which features differ most between clusters).

Pitfall 3: Choosing the number of clusters based only on a score

Quantitative scores can help, but you also need interpretability and actionability. Five clusters that you can explain and use may beat twelve clusters with a slightly better score.

Pitfall 4: Clustering mixed data types without care

Numeric, categorical, text, and time series data require different similarity measures and preprocessing. If you mix them naively, the results can be misleading. Consider separate representations (e.g., embeddings for text) and ensure the distance metric matches the data type.

Pitfall 5: Ignoring time

Many datasets are dynamic. A customer can move between segments. If you cluster on lifetime aggregates, you may miss recent changes. Consider clustering on rolling windows or building “trajectory” features (trend up/down).

Mini Case Studies (Concrete Examples)

Case 1: Grouping support tickets by theme

You have thousands of customer support messages and want to find common issues without manually labeling them.

Convert each message into a numeric representation (for example, a vector embedding).
Cluster the vectors to group similar messages.
For each cluster, review a handful of messages to assign a human-readable theme (e.g., “password reset,” “billing confusion,” “shipping delay”).
Use cluster sizes and trends over time to prioritize product fixes.

Case 2: Discovering usage patterns in an app

You track events like sessions per week, features used, time spent, and time-of-day activity.

Aggregate events into user-level features for the last 30 days.
Cluster users to find patterns such as “daily quick checkers,” “weekend power users,” and “new users exploring features.”
Design in-app guidance tailored to each pattern (e.g., shortcuts for power users, tutorials for explorers).

Case 3: Detecting unusual sensor behavior

A factory collects sensor readings from machines. Failures are rare and not always labeled.

Define normal operating windows and extract features (mean, variance, frequency-domain summaries, correlations between sensors).
Train an unsupervised anomaly detector to score new windows.
Alert maintenance when scores exceed a threshold, and record technician feedback to refine the system.

A Simple Pseudocode Template You Can Reuse

# Unsupervised learning workflow template (conceptual) 1) Define goal (segment, visualize, detect anomalies) 2) Choose unit of analysis (row = customer, device, document, time window) 3) Create features that reflect the goal 4) Preprocess (handle missing values, scale/transform, encode categories) 5) Run method (clustering / dimensionality reduction / anomaly scoring) 6) Evaluate without labels (stability, separation, interpretability, usefulness) 7) Interpret results (profiles, examples, names) 8) Connect to action (experiments, monitoring, human review) 9) Iterate (features and settings)

Now answer the exercise about the content:

Why is scaling numeric features often an important preprocessing step before running a clustering algorithm?

You are right! Congratulations, now go to the next page

You missed! Try again.

Many clustering methods rely on distance or similarity. If one feature has a much larger scale than others, it can overwhelm the calculation and drive the grouping. Scaling makes features more comparable so clusters reflect multiple behaviors, not just the biggest-number feature.