Cardinality

Introduction

In the context of data science and machine learning, cardinality refers to the uniqueness of data values contained in a particular column (attribute) of a dataset. Understanding cardinality is essential for feature engineering, data preprocessing, and selecting appropriate machine learning algorithms.

Types of Cardinality

Cardinality can be categorized into three types:

High Cardinality:
- Columns with a large number of unique values.
- Example: Social Security Numbers, email addresses, user IDs.
Low Cardinality:
- Columns with a small number of unique values.
- Example: Boolean fields (True/False), days of the week, months of the year.
Moderate Cardinality:
- Columns with a moderate number of unique values that fall between high and low cardinality.
- Example: Zip codes, country names, product categories.

Why Cardinality Matters

Impact on Memory and Storage

High Cardinality Columns:
- Can significantly increase the memory and storage requirements.
- Can lead to performance issues if not handled appropriately.
Low Cardinality Columns:
- Easier to manage and often more efficient in terms of memory and storage.
- Typically used for categorical encoding techniques like one-hot encoding.

Impact on Machine Learning Models

High Cardinality:
- Some models struggle with high cardinality due to the curse of dimensionality.
- Techniques like feature hashing or embeddings are often used to manage high cardinality features.
Low Cardinality:
- Easier to handle and can be effectively managed with traditional encoding techniques.
- Less likely to cause performance degradation.

Handling High Cardinality

Feature Hashing

A technique that maps a large number of categories to a smaller number of buckets using a hash function.
Reduces the dimensionality but may introduce collisions (different values mapped to the same bucket).

Embeddings

Converts categorical variables into continuous vector spaces.
Commonly used in deep learning models, especially for handling high cardinality features.
Example: Word embeddings in natural language processing (NLP).

Frequency Encoding

Encodes categories based on their frequency in the dataset.
Useful for high cardinality categorical variables where certain categories are more common.

Practical Examples

Example 1: One-Hot Encoding Low Cardinality Data

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = {'day_of_week': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']}
df = pd.DataFrame(data)

# One-Hot Encoding
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['day_of_week']])
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(['day_of_week']))
print(encoded_df)

Example 2: Embedding High Cardinality Data

import torch
import torch.nn as nn

# Sample high cardinality data (user IDs)
user_ids = torch.tensor([1, 2, 3, 4, 5])

# Embedding layer
embedding = nn.Embedding(num_embeddings=10, embedding_dim=3)  # 10 users, 3-dimensional embeddings
embedded_user_ids = embedding(user_ids)
print(embedded_user_ids)

Conclusion

Understanding and appropriately handling cardinality is crucial in the preprocessing stage of machine learning projects. The choice of techniques for managing cardinality can significantly impact the performance and efficiency of the models. By leveraging methods like feature hashing, embeddings, and frequency encoding, we can effectively handle high cardinality features and improve our machine learning pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!