Clustering#

Creating composite variables from multiple survey questions.

Overview#

Clustering combines information from multiple related questions into a single derived variable. This is useful for:

  • Simplifying analysis - Reduce dimensionality by grouping related questions

  • Capturing multidimensional concepts - A single concept (e.g., “career stage”) may require multiple questions

  • Improving statistical power - Combining questions increases information per variable

  • Handling missing data - Can use partial information when some responses are missing

Clustering in Titanite#

Defining Cluster Rules#

Clusters are defined in your survey plugin:

from titanite.core import ClusterRule

def get_cluster_rules(self) -> list[ClusterRule]:
    return [
        ClusterRule(
            name="gender_cluster",
            description="Combined gender identity and expression",
            source_columns=["q01", "q02"],
            aggregation_func="combine",
        ),
    ]

ClusterRule Parameters#

  • name - Output column name

  • description - Human-readable description

  • source_columns - Input columns to combine

  • aggregation_func - How to combine values

Aggregation Methods#

1. Combine (String Concatenation)#

Combines values from multiple columns into a single value:

ClusterRule(
    name="gender_cluster",
    source_columns=["q01", "q02"],  # gender_identity, gender_expression
    aggregation_func="combine",
)

Example:

q01 (Identity)

q02 (Expression)

Result (gender_cluster)

Man

Masculine

Man - Masculine

Woman

Feminine

Woman - Feminine

Non-binary

Masculine

Non-binary - Masculine

Use case: Capturing multiple dimensions of gender identity in a single variable.

2. Majority Vote#

Selects the most common value across source columns:

ClusterRule(
    name="diversity_focus",
    source_columns=["q10", "q11", "q12"],  # Multiple diversity questions
    aggregation_func="majority_vote",
)

Example:

q10

q11

q12

Result (majority_vote)

Yes

Yes

No

Yes

Yes

No

No

No

Maybe

Maybe

Maybe

Maybe

Use case: Determining overall sentiment when multiple related yes/no questions exist.

Handling ties:

  • If no clear majority, uses first column value

  • If all different, uses first column value

3. Concatenate#

Similar to combine but with different formatting:

ClusterRule(
    name="career_path",
    source_columns=["q05", "q06", "q07"],
    aggregation_func="concatenate",
)

Example:

q05

q06

q07

Result

Academia

Physics

5 years

Academia - Physics - 5 years

Use case: Creating detailed categorical variable from multiple dimensions.

ICRC2023 Example: Gender Ratio Clustering#

The ICRC2023 survey combines gender identity (q01) and gender expression (q02):

ClusterRule(
    name="q13q14_clustered",
    description="Gender identity and expression clustering",
    source_columns=["q13", "q14"],  # (or q01, q02)
    aggregation_func="combine",
)

This creates categories like:

  • “Man - Masculine”

  • “Woman - Feminine”

  • “Non-binary - Masculine”

  • “Transgender Man - Masculine”

  • etc.

Statistical Analysis#

Once created, q13q14_clustered is treated as a categorical variable:

# Chi-square tests including clustered variable
poetry run ti chi2

# Cross-tabulations with clustered variable
poetry run ti crosstabs

# Visualizations
poetry run ti hbars

Creating Effective Clusters#

2. Document the Purpose#

ClusterRule(
    name="career_stage",
    description="Combines years of experience and seniority level",
    source_columns=["experience_years", "job_level"],
    aggregation_func="majority_vote",
)

3. Consider Interpretability#

Will the output be easy to interpret?

# Clear output
ClusterRule(
    name="gender_cluster",
    source_columns=["q01", "q02"],
    aggregation_func="combine",
    # Output: "Man - Masculine", "Woman - Feminine", etc.
)

# Unclear output
ClusterRule(
    name="derived_factor",
    source_columns=["q05", "q06", "q07", "q08", "q09"],
    aggregation_func="majority_vote",
    # Output: Too many dimensions, hard to interpret
)

4. Handle Missing Values#

Specify how to handle rows where source columns have missing values:

ClusterRule(
    name="gender_cluster",
    source_columns=["q01", "q02"],
    aggregation_func="combine",
    handle_missing="drop",  # Drop if any value missing
    # or "keep" - include NaN in output
    # or "skip_column" - skip missing columns
)

Cluster Analysis Workflow#

Step 1: Define Clusters in Plugin#

# plugins/your_survey/schema.py
def get_cluster_rules(self) -> list[ClusterRule]:
    return [
        ClusterRule(
            name="experience_cluster",
            source_columns=["years_exp", "current_level"],
            aggregation_func="combine",
        ),
    ]

Step 2: Prepare Data#

poetry run ti prepare data.csv --plugin plugins.your_survey.YourSchema

Check output columns:

ls prepared_data.csv
# Should include "experience_cluster" column

Step 3: Analyze Clusters#

poetry run ti chi2           # Test associations with clusters
poetry run ti crosstabs      # Cross-tabulations
poetry run ti hbars          # Visualizations

Step 4: Interpret Results#

Examine how the clustered variable associates with other variables:

poetry run ti p005 experience_cluster --save

This shows which variables are significantly associated with the cluster.

Advanced Clustering#

Conditional Clustering#

Apply different rules based on conditions:

def get_cluster_rules(self) -> list[ClusterRule]:
    return [
        ClusterRule(
            name="gender_cluster",
            source_columns=["q01", "q02"],
            aggregation_func="combine",
            condition=lambda row: row["q01"] != "prefer_not_to_answer",
            # Only cluster if q01 was answered
        ),
    ]

Weighted Clustering#

Weight some columns more heavily:

ClusterRule(
    name="importance_score",
    source_columns=["q10", "q11"],
    aggregation_func="weighted_combine",
    weights=[0.7, 0.3],  # q10 counts 70%, q11 counts 30%
)

Hierarchical Clustering#

Create clusters based on other clusters:

def get_cluster_rules(self) -> list[ClusterRule]:
    return [
        # First level
        ClusterRule(
            name="gender_cluster",
            source_columns=["q01", "q02"],
            aggregation_func="combine",
        ),
        # Second level - uses first cluster
        ClusterRule(
            name="gender_by_field",
            source_columns=["gender_cluster", "q05"],
            aggregation_func="combine",
        ),
    ]

Best Practices#

1. Start Simple#

Begin with straightforward combinations before complex hierarchies:

# Good - Start here
ClusterRule(
    source_columns=["q01", "q02"],
    aggregation_func="combine",
)

# More complex - Add later
ClusterRule(
    source_columns=["gender_cluster", "q05", "q06"],
    aggregation_func="majority_vote",
)

2. Validate Results#

After creating clusters, verify they make sense:

# View unique cluster values
poetry run ti config --choices
# Look for "your_cluster" column

# Check distribution
poetry run ti hbars --save
# Visualize cluster distribution

3. Test Independence#

Ensure clustered variable isn’t just repeating information:

# Chi-square test source columns with other variables
poetry run ti chi2
# Compare results before and after clustering

4. Document Decisions#

Include rationale in schema:

ClusterRule(
    name="gender_cluster",
    description="Combines gender identity (q01) and expression (q02) "
                "as they are interdependent aspects of gender",
    source_columns=["q01", "q02"],
    aggregation_func="combine",
)

Common Issues#

Too Many Categories#

Combining many questions creates too many output categories:

Problem:

ClusterRule(
    source_columns=["q01", "q02", "q03"],  # 5 × 5 × 5 = 125 categories
    aggregation_func="combine",
)

Solution: Use majority_vote instead of combine, or limit to 2-3 source columns.

Missing Value Explosion#

Missing values in any source column affects entire cluster:

Problem: If q01 or q02 is missing, entire cluster is missing.

Solution: Handle gracefully:

ClusterRule(
    source_columns=["q01", "q02"],
    aggregation_func="combine",
    handle_missing="skip_column",  # Ignore missing columns
)

Imbalanced Categories#

Some cluster combinations have many cases, others have few:

Problem: Small cell sizes reduce statistical power.

Solution:

  • Combine rare categories

  • Use cell suppression (n < 5)

  • Report effect size, not just p-values

See Also#