Architecture#
Overview of titanite’s design and core components.
Core Framework#
Titanite is a pluggable survey processing framework with a modular architecture designed to support multiple surveys without modifying the core.
Design Principles#
Pluggable - Survey-specific logic is isolated in plugins
Extensible - New surveys can be added by implementing a schema interface
Secure - Built-in privacy protection and data handling
Configurable - Flexible configuration system for different survey types
Component Overview#
Core Framework (titanite/core/)#
Base Classes:
SurveySchema- Abstract base class defining survey-specific rulesSurveyProcessor- 8-step processing pipeline for data transformation
Data Classes:
SplitColumnRule- Define how to split composite columns (e.g., geographic regions)ClusterRule- Define derived cluster columns from multiple questionsBinRule- Define numerical column binning rules
Security:
SecureDataHandler- Privacy protection (load, suppress, anonymize)
Plugin System (plugins/)#
Each survey gets its own plugin implementing SurveySchema:
Example: ICRC2023 Survey
plugins/icrc2023/
├── __init__.py
└── ICRC2023Schema
├── categorical_headers
├── numerical_headers
├── free_text_columns
└── Methods:
├── get_replace_rules()
├── get_split_rules()
├── get_cluster_rules()
├── get_bin_rules()
CLI Interface (titanite/cli.py)#
Main entry point via ti command. Commands are organized by function:
Configuration:
configData Preparation:
prepareAnalysis:
chi2,p005,crosstabs,crosstabVisualization:
response,hbars,hbarText Analysis:
comments
Data Processing Pipeline#
The SurveyProcessor implements an 8-step pipeline:
1. Timestamp Normalization
↓
2. Response Counting
↓
3. Value Replacement (standardization)
↓
4. Column Splitting (geographic/composite)
↓
5. Clustering (derived columns)
↓
6. Binning (numerical → categorical)
↓
7. Data Validation
↓
8. Output Generation
Step Details#
1. Timestamp Normalization
Standardize timestamp formats
Handle timezone conversions
Create response timeline data
2. Response Counting
Track total responses per variable
Identify missing/invalid responses
3. Value Replacement
Standardize response values (plugin-specific)
Map synonyms to canonical forms
Handle encoding issues
4. Column Splitting
Split composite columns (e.g., “Country - Region” → “country”, “region”, “subregion”)
Apply geographic schema (UN geoscheme)
Create derived location columns
5. Clustering
Create derived columns combining multiple questions
Example: combine q13 and q14 gender ratio questions into q13q14_clustered
6. Binning
Convert numerical variables to categorical bins
Apply equal-width or equal-frequency binning
Create interpretable categories
7. Data Validation
Check for missing values
Validate categorical values against schema
Flag outliers and anomalies
8. Output Generation
Write processed datasets
Generate metadata
Create data quality reports
File Structure#
titanite/
├── __init__.py
├── cli.py # CLI entry point
├── config.py # Configuration management
├── core/
│ ├── __init__.py
│ ├── schema.py # SurveySchema base class
│ ├── processor.py # SurveyProcessor pipeline
│ └── security.py # SecureDataHandler
├── analysis.py # Statistical analysis functions
└── preprocess.py # Preprocessing utilities
plugins/
├── __init__.py
└── icrc2023/
├── __init__.py
└── ICRC2023Schema # Survey-specific implementation
tests/
├── test_core.py
├── test_schema.py
├── test_processor.py
└── test_icrc2023_schema.py
sandbox/
├── config.toml # Survey configuration
└── app.py # Streamlit dashboard
data/
├── raw_data/ # Raw survey CSV files
└── test_data/ # Test datasets
Data Structure#
Categorical Variables#
Questions treated as discrete categories:
q01, q02, q03 (demographic data)
q03_regional, q03_subregional (geographic splits)
Derived: q01_clustered, q13q14_clustered
Each categorical variable has defined valid choices in config.toml.
Numerical Variables#
Questions treated as continuous values:
q10, q13 (ratio/scale questions)
Sentiment scores (derived from text analysis)
Free-text Columns#
Questions with written responses:
q15-q22 (participant comments and feedback)
Stored separately for privacy protection.
Key Workflows#
Adding a New Survey#
Create plugin:
plugins/your_survey/Implement
SurveySchemawith rule methodsDefine
categorical_headers,numerical_headers,free_text_columnsImplement rule methods:
get_replace_rules()- value standardizationget_split_rules()- column splittingget_cluster_rules()- derived columnsget_bin_rules()- numerical binning
Register plugin in
pyproject.tomlentry pointsAdd configuration to
sandbox/config.toml
Custom Analysis#
Create analysis function in
titanite/analysis.pyAdd CLI command in
titanite/cli.pyUse
SurveyProcessorfor data preparationImplement visualization with Altair/Matplotlib
Add tests to
tests/
Testing Strategy#
Unit Tests - Test individual components (schema, rules)
Integration Tests - Test full pipeline with real data
Plugin Tests - Test survey-specific implementations
Run with:
poetry run pytest tests/ -v
See Testing for comprehensive guide.
Security Considerations#
Privacy Protection#
Free-text responses stored separately
Cell suppression for small groups (n < 5)
Anonymization support via
SecureDataHandler
Data Validation#
Input validation at system boundaries
Type checking in dataclasses
Configuration validation
Access Control#
Configuration-based access rules
Plugin isolation
No credentials in code (use environment variables)
Dependencies#
Core dependencies:
pandas - Data manipulation
numpy - Numerical computing
scipy - Statistical analysis
altair - Visualization
click - CLI framework
pydantic - Data validation
pyarrow - Data I/O
Development dependencies:
pytest - Testing framework
sphinx - Documentation
ruff - Code linting/formatting
pre-commit - Git hooks
See pyproject.toml for complete dependency list.
Performance Considerations#
Data Loading - Uses pandas for efficient I/O
Processing - Vectorized operations where possible
Memory - Streaming large files to avoid loading entire dataset
Caching - Configuration caching for repeated CLI calls
Visualization Stack#
Altair - Interactive statistical visualizations
Matplotlib - Static plots and fine-grained control
Streamlit - Interactive dashboard (
sandbox/app.py)
Visualization types:
Categorical vs Categorical → Heatmaps (
mark_rect)Categorical vs Numerical → Box plots (
mark_boxplot)Numerical vs Numerical → Scatter plots (
mark_point)