Plugin Development#
Guide for creating custom survey plugins for titanite.
Overview#
Plugins allow you to add support for new surveys without modifying the core framework. Each plugin implements the SurveySchema abstract base class to define survey-specific rules.
Quick Start#
1. Create Plugin Structure#
mkdir -p plugins/your_survey
touch plugins/your_survey/__init__.py
touch plugins/your_survey/schema.py
2. Implement Schema#
# plugins/your_survey/schema.py
from titanite.core import SurveySchema, SplitColumnRule, ClusterRule, BinRule
class YourSurveySchema(SurveySchema):
# Define question categories
categorical_headers = ["q01", "q02", "q03"]
numerical_headers = ["q10", "q13"]
free_text_columns = ["q15", "q16"]
def get_replace_rules(self) -> dict[str, dict]:
"""Value standardization rules."""
return {}
def get_split_rules(self) -> list[SplitColumnRule]:
"""Column splitting rules."""
return []
def get_cluster_rules(self) -> list[ClusterRule]:
"""Derived cluster columns."""
return []
def get_bin_rules(self) -> list[BinRule]:
"""Numerical binning rules."""
return []
3. Use Plugin#
poetry run ti prepare data.csv --plugin plugins.your_survey.YourSurveySchema
Detailed Implementation#
Categorical and Numerical Headers#
Define which columns are categorical (discrete) vs numerical (continuous):
class YourSurveySchema(SurveySchema):
# Categorical: Used for chi-square tests, heatmaps
categorical_headers = [
"q01", # Gender
"q02", # Age group
"q03_regional", # Geographic region
]
# Numerical: Used for correlation, box plots
numerical_headers = [
"q10", # Likert scale
"q13", # Ratio question
]
# Free-text: Stored separately for privacy
free_text_columns = [
"q15", # Comments
"q16", # Suggestions
]
Value Replacement Rules#
Standardize response values for consistency:
def get_replace_rules(self) -> dict[str, dict]:
"""
Define value replacements for each column.
Format: {
'column_name': {
'original_value': 'standardized_value',
...
}
}
"""
return {
"q01": {
"Male": "male",
"M": "male",
"male": "male",
"Female": "female",
"F": "female",
"female": "female",
"Other": "other",
},
"q02": {
"Yes": "yes",
"No": "no",
"N/A": "na",
},
}
Split Rules#
Split composite columns (e.g., geographic regions):
from titanite.core import SplitColumnRule
def get_split_rules(self) -> list[SplitColumnRule]:
"""
Define how to split composite columns.
Example: "United States - California" →
"country": "United States",
"region": "California"
"""
return [
SplitColumnRule(
source_column="q03", # Input column
split_char=" - ", # Delimiter
target_columns=["country", "region"], # Output columns
),
SplitColumnRule(
source_column="q04",
split_char="|",
target_columns=["department", "subdept"],
),
]
Cluster Rules#
Create derived columns combining multiple questions:
from titanite.core import ClusterRule
def get_cluster_rules(self) -> list[ClusterRule]:
"""
Define derived columns combining related questions.
Example: Combine gender identity (q01) and gender expression (q02)
into a single clustered column.
"""
return [
ClusterRule(
name="gender_cluster",
description="Combined gender identity and expression",
source_columns=["q01", "q02"],
aggregation_func="combine", # or "majority_vote", "concatenate"
),
ClusterRule(
name="experience_level",
description="Career experience derived from multiple factors",
source_columns=["q05", "q06", "q07"],
aggregation_func="majority_vote",
),
]
Bin Rules#
Convert numerical data to categorical bins:
from titanite.core import BinRule
def get_bin_rules(self) -> list[BinRule]:
"""
Define numerical binning rules.
Example: Convert numerical age to age groups.
"""
return [
BinRule(
source_column="age",
target_column="age_group",
bins=[0, 18, 30, 40, 50, 65, 100],
labels=["<18", "18-30", "30-40", "40-50", "50-65", "65+"],
),
BinRule(
source_column="experience_years",
target_column="experience_level",
bins=[0, 2, 5, 10, 20],
labels=["Entry", "Junior", "Senior", "Principal"],
),
]
Configuration#
Create a config.toml for your survey in sandbox/:
[questions]
q01 = "Gender identity"
q02 = "Age group"
q03 = "Geographic location"
q04 = "Organization"
q05 = "Years of experience"
[categorical_headers]
default = ["q01", "q02", "q03", "q03_regional"]
[numerical_headers]
default = ["q05"]
Testing Your Plugin#
Unit Test#
# tests/test_your_survey_schema.py
import pytest
from plugins.your_survey import YourSurveySchema
import pandas as pd
def test_schema_initialization():
schema = YourSurveySchema()
assert schema.categorical_headers
assert schema.numerical_headers
def test_replace_rules():
schema = YourSurveySchema()
rules = schema.get_replace_rules()
assert "q01" in rules
assert rules["q01"]["Male"] == "male"
def test_split_rules():
schema = YourSurveySchema()
rules = schema.get_split_rules()
assert len(rules) > 0
assert rules[0].target_columns
def test_with_real_data():
"""Test plugin with actual survey data."""
schema = YourSurveySchema()
df = pd.read_csv("data/test_data/your_survey.csv")
# Run processing pipeline
processor = SurveyProcessor(schema)
result = processor.process(df)
assert result is not None
assert len(result) == len(df)
Integration Test#
# Test full pipeline with real data
cd sandbox
poetry run ti prepare ../data/raw_data/your_survey.csv \
--plugin plugins.your_survey.YourSurveySchema
Best Practices#
1. Value Standardization#
Always convert values to lowercase and consistent format:
def get_replace_rules(self) -> dict[str, dict]:
return {
"q01": {
"YES": "yes",
"Yes": "yes",
"yes": "yes",
},
}
2. Documentation#
Document each rule clearly:
def get_split_rules(self) -> list[SplitColumnRule]:
"""
Split geographic data into components.
Input format: "Country - Region - Subregion"
Output columns: ["country", "region", "subregion"]
"""
return [...]
3. Handle Edge Cases#
Consider missing or invalid values:
def get_replace_rules(self) -> dict[str, dict]:
return {
"q01": {
"": "missing",
"N/A": "missing",
"Unknown": "missing",
"Prefer not to answer": "prefer_not_to_answer",
},
}
4. Keep Rules Maintainable#
Group related rules together:
def get_replace_rules(self) -> dict[str, dict]:
gender_rules = {"M": "male", "F": "female"}
age_rules = {"<18": "under_18", "18-25": "18_25"}
return {
"q01": gender_rules,
"q02": age_rules,
}
5. Test with Sample Data#
Always test with representative sample data:
poetry run ti prepare sample_data.csv \
--plugin plugins.your_survey.YourSurveySchema
Verify output:
All categorical values are in
config.tomlchoicesNo unexpected missing values
Split columns have correct structure
Binned columns have correct labels
Common Issues and Solutions#
Plugin Not Found#
Ensure the module path is correct:
# Correct format: plugins.package.ClassName
poetry run ti prepare data.csv --plugin plugins.your_survey.YourSurveySchema
# NOT: plugins/your_survey/YourSurveySchema
# NOT: your_survey.YourSurveySchema
Import Errors#
Add plugin to pyproject.toml if needed:
[tool.poetry.plugins."titanite.surveys"]
your_survey = "plugins.your_survey.YourSurveySchema"
Data Type Mismatches#
Ensure categorical values match config.toml:
[categorical_choices]
q01 = ["male", "female", "other"] # Lowercase
Missing Replacements#
Check that all observed values have replacement rules:
# Run validation to find unmapped values
poetry run ti config --questions
Example: ICRC2023Schema#
See plugins/icrc2023/ICRC2023Schema for a complete reference implementation covering:
Complex geographic splitting (UN geoscheme)
Multiple clustering rules
Sentiment analysis integration
Free-text response handling
Next Steps#
Review the ICRC2023Schema reference implementation
Create your survey plugin following the quick start
Write unit and integration tests
Document your schema in CLAUDE.md
Submit a PR with your plugin
For questions, see Architecture for design details or Testing for test guidelines.