The AltScore data architecture follows a medallion (Bronze → Silver → Gold) pattern over a cloud object store, with PostgreSQL as the operational store for scores, consent, and audit records, and Redis as a caching layer. No PII crosses the Bronze → Silver boundary in raw form.
DAS-ALTSCORE-001 · Architecture Charters3://lake/bronze/dist_{id}/year/month/s3://lake/silver/retailer_{uuid}/year/month/s3://lake/gold/scores/year/month/scores table for API servingBronze stores raw ERP data exactly as received. PII fields (GST, phone, address) are present. Access is restricted to Data Engineering via IAM. No ML or API service has access to Bronze.
| Column | Type | PII | Description |
|---|---|---|---|
invoice_id | STRING | No | Original ERP invoice ID (source system) |
distributor_id | STRING | No | AltScore distributor UUID |
retailer_gst | STRING | YES | GSTIN — pseudonymized in Silver layer |
retailer_name | STRING | YES | Business name — dropped in Silver layer |
retailer_phone | STRING | YES | Phone number — pseudonymized in Silver layer |
invoice_date | DATE | No | Invoice creation date |
due_date | DATE | No | Payment due date per credit terms |
amount_inr | DECIMAL(15,2) | No | Invoice gross amount in INR |
erp_type | STRING | No | tally_prime | sap_b1 | csv |
ingested_at | TIMESTAMP | No | Ingest timestamp (UTC) |
batch_id | STRING | No | Source batch_id from ERP Ingestion Service |
_row_hash | STRING | No | SHA-256 of all fields — tamper detection |
| Column | Type | PII | Description |
|---|---|---|---|
payment_id | STRING | No | Original ERP payment ID |
invoice_id | STRING | No | Links to bronze_invoices.invoice_id |
distributor_id | STRING | No | AltScore distributor UUID |
retailer_gst | STRING | YES | GSTIN — pseudonymized in Silver layer |
paid_date | DATE | No | Date payment was received |
amount_paid_inr | DECIMAL(15,2) | No | Amount actually paid (may be partial) |
payment_method | STRING | No | NEFT | RTGS | IMPS | cheque | cash |
ingested_at | TIMESTAMP | No | Ingest timestamp (UTC) |
_row_hash | STRING | No | SHA-256 — tamper detection |
Silver contains pseudonymized, normalized transaction records. No PII is present. The retailer_gst → retailer_uuid mapping lives in a separate access-restricted identity store. ML and API services interact with Silver data only through the Gold layer — never directly.
| Column | Type | Description |
|---|---|---|
retailer_uuid | STRING (UUID) | Stable pseudonymous identifier (replaces GST) |
distributor_uuid | STRING (UUID) | Pseudonymous distributor identifier |
invoice_uuid | STRING (UUID) | Pseudonymous invoice ID (deduped) |
invoice_date | DATE | Invoice creation date |
due_date | DATE | Payment due date |
amount_inr | DECIMAL(15,2) | Invoice amount (outlier-capped at p99 per distributor) |
paid_date | DATE (nullable) | Actual payment date; NULL if unpaid |
amount_paid_inr | DECIMAL(15,2) | Actual amount paid (0 if unpaid) |
payment_delay_days | INTEGER | paid_date - due_date (negative = early, positive = late, NULL = unpaid) |
is_partial_payment | BOOLEAN | True if amount_paid < amount × 0.95 |
is_return | BOOLEAN | True if invoice is a credit note / return |
data_source | STRING | tally_prime | sap_b1 | csv |
silver_created_at | TIMESTAMP | Silver pipeline run timestamp |
| Column | Type | Description |
|---|---|---|
retailer_uuid | UUID PRIMARY KEY | Stable pseudonymous identifier (exposed to API) |
gst_hash | BYTEA | HMAC-SHA256 of GSTIN (allows lookup without storing raw GSTIN) |
phone_hash | BYTEA | HMAC-SHA256 of phone number |
distributor_uuid | UUID | Primary distributor UUID |
consent_id | UUID | FK to consent_records.consent_id |
created_at | TIMESTAMP | First time retailer seen in system |
erasure_requested_at | TIMESTAMP (nullable) | Set when DPDP Right-to-Erasure filed (30-day SLA timer) |
| Feature Column | Type | Family | Description |
|---|---|---|---|
retailer_uuid | UUID | Key | Stable pseudonymous retailer identifier |
feature_month | DATE | Key | Month for which features are computed (YYYY-MM-01) |
avg_payment_delay_days | FLOAT | Payment | Mean payment_delay_days over trailing 6 months |
p90_payment_delay_days | FLOAT | Payment | 90th percentile payment delay |
zero_delay_rate | FLOAT | Payment | Fraction of invoices paid on or before due date |
partial_payment_freq | FLOAT | Payment | Fraction of invoices with partial payment |
partial_payment_ratio | FLOAT | Payment | Median (amount_paid / amount) for partial payments |
max_consecutive_delays | INTEGER | Payment | Longest consecutive delay streak (invoices) |
monthly_order_frequency | FLOAT | Order | Avg invoices per month, trailing 6 months |
order_gap_median_days | FLOAT | Order | Median days between consecutive invoices |
order_gap_cv | FLOAT | Order | Coefficient of variation of inter-order gaps |
return_rate_pct | FLOAT | Order | % of total invoiced value returned |
return_freq_pct | FLOAT | Order | % of invoices with any return |
gmv_trend_slope | FLOAT | Trajectory | OLS slope of monthly GMV (INR/month) |
gmv_3m_vs_12m_ratio | FLOAT | Trajectory | Recent 3-month avg GMV / trailing 12-month avg GMV |
gmv_yoy_growth | FLOAT (nullable) | Trajectory | Year-over-year GMV growth rate (requires 12+ months) |
avg_monthly_gmv_log | FLOAT | Trajectory | log(avg_monthly_gmv) — normalised for model |
seasonal_amplitude | FLOAT | Seasonal | Peak season / trough season GMV ratio |
seasonal_recovery_rate | FLOAT | Seasonal | GMV recovery speed post-seasonal trough (0–1) |
distributor_tenure_months | INTEGER | Relationship | Months since first invoice with any distributor |
distributor_count | INTEGER | Relationship | Number of distinct distributors in history |
data_completeness_score | FLOAT | Quality | Weighted completeness 0–100 (GR-5) |
data_freshness_days | INTEGER | Quality | Days since most recent ERP record |
is_sparse_retailer | BOOLEAN | Routing | True → Croston's routing; False → XGBoost primary |
| Column | Type | Description |
|---|---|---|
score_id | UUID PRIMARY KEY | Immutable score record identifier |
retailer_uuid | UUID | FK to identity_map.retailer_uuid |
score_month | DATE | Month score is valid for |
altscore | INTEGER | 300–900 integer score |
risk_band | CHAR(1) | A | B | C | D |
probability_of_default | FLOAT | Raw XGBoost output, calibrated to PD (0.0–1.0) |
recommended_limit_inr | DECIMAL(15,0) | Guardrail-capped limit (30% of 6-month GMV) |
reason_codes | JSONB | Array of {code, direction, rank, label_en, label_hi} |
low_confidence | BOOLEAN | True if completeness 50–69% or freshness >14 days |
anomaly_flag | BOOLEAN | True if Isolation Forest flagged gaming signals |
score_version | STRING | MLflow model version (e.g., altscore-xgb-v1-prod) |
scored_at | TIMESTAMP | Scoring engine run timestamp (UTC) |
feature_snapshot_hash | BYTEA | SHA-256 of feature vector — audit / reproducibility |
PostgreSQL (Amazon RDS, Multi-AZ) is the operational database. Six logical databases are segregated by function. Row-Level Security (RLS) enforces per-lender data isolation on the scores table. All tables use UUIDs as primary keys (no sequential integers exposed to API).
consent_records — signed consent artefacts (immutable)consent_scope — lender_id + consent_id + purposeconsent_revocations — revocation events + timestampsidentity_map — GST_hash → retailer_uuid (access-isolated)scores — current scores per retailer (RLS enforced)score_history — monthly score snapshots (12-month ring)score_blocks — records where scoring was blocked + reasonfeature_store — Gold layer feature cache per retaileraudit_log — append-only, no UPDATE/DELETE (DB trigger enforced)audit_log partitioned monthly; old partitions archived to S3 WORMdistributors — onboarded distributors + cert metadatalenders — NBFC partner records + API key metadatagrievances — grievance cases with SLA timersmodel_registry_pointers — active model version per env| Cache Key Pattern | Value | TTL | Invalidation |
|---|---|---|---|
score:{retailer_uuid}:{lender_id} | Full score JSON (serialized) | 4 hours | On score.computed Kafka event |
consent:{retailer_uuid}:{lender_id} | active | revoked | none | 60 seconds | On consent.revoked Kafka event (immediate purge) |
ratelimit:{lender_id}:{minute_bucket} | Request counter | 61 seconds | Natural TTL expiry |
session:{token_jti} | JWT validation metadata | JWT exp - now | On JWT expiry or explicit logout |
blocklist:{token_jti} | revoked | JWT exp - now | On admin token revocation |
features:{retailer_uuid} | Latest Gold feature vector | 24 hours | On pipeline.gold.ready Kafka event |
Redis Cluster mode is used (3 shards × 2 replicas). All keys are prefixed with environment (prod:, staging:) to prevent cross-environment contamination. Consent cache TTL is intentionally short (60s) so revocations propagate quickly. Score cache TTL (4h) is longer since scores change only on pipeline runs.
dbt Core manages all SQL transformations from Bronze → Silver → Gold. Models are organized in three layers following dbt best practice: sources (Bronze), staging (Silver normalization), intermediate (feature computation), marts (Gold layer, ML-ready). All models are fully documented and tested via dbt schema tests.
| Layer | dbt Model | Input | Output | Key Transformations |
|---|---|---|---|---|
| Sources | src_bronze_invoices | S3 Bronze Parquet | Raw staging ref | None — source definition only |
| Staging | stg_transactions | Bronze invoices + payments | Silver transactions | GST→UUID lookup, schema normalisation, outlier cap |
| Staging | stg_returns | Bronze invoices (credit notes) | Silver returns | Flag returns, link to original invoice |
| Intermediate | int_payment_features | stg_transactions | Payment family features | avg_delay, p90_delay, zero_delay_rate, partial_payment_freq |
| Intermediate | int_order_features | stg_transactions, stg_returns | Order family features | monthly_freq, gap_cv, return_rate, gmv_per_order_cv |
| Intermediate | int_trajectory_features | stg_transactions | Trajectory family features | OLS GMV slope, yoy_growth, 3m_vs_12m_ratio |
| Intermediate | int_seasonal_features | stg_transactions | Seasonal family features | seasonal_amplitude, recovery_rate, festival windows |
| Intermediate | int_relationship_features | stg_transactions | Relationship family features | distributor_tenure, distributor_count, credit_term_utilization |
| Intermediate | int_quality_features | All staging models | Data quality dimensions | completeness_score (GR-5 formula), freshness_days |
| Mart | mart_retailer_features | All int_ models | gold_retailer_features | Union all 40+ features; routing flag (sparse/XGBoost) |
| Mart | mart_scores | Scoring Engine output | gold_scores | Final score + reason codes; written by Scoring Engine (not dbt) |
Every dbt model records its source tables, transformations applied, and test results. OpenLineage integration emits lineage events to Marquez for cross-pipeline tracking. Any score can be traced back to its source Bronze batch_id.
All dbt and Spark transformations are deterministic — same input always produces same output. Feature snapshots are SHA-256 hashed at scoring time, enabling exact reproduction of any historical score for audit or dispute resolution.
Every gold_scores record stores: feature_snapshot_hash, score_version (MLflow model ID), scored_at, and the Airflow DAG run_id. A full lineage chain exists from retailer score → features → Bronze records → ERP batch.
| Data Store | Active Retention | Archive Retention | Deletion Trigger | Deletion Method |
|---|---|---|---|---|
| Bronze (S3 WORM) | 2 years | 5 years (Glacier) | Automatic lifecycle rule | S3 lifecycle → Glacier; CMK deletion after 7yr |
| Silver (S3) | Consent duration + 6 months | None | Consent revocation | Spark delete + S3 object delete; UUID nulled |
| Gold scores (S3) | Consent duration + 12 months | None | Consent revocation | Parquet partition delete |
| PostgreSQL scores | 36 months rolling | Partitioned archive | Monthly partition drop | Table partition detach + drop |
| Consent records | Forever (never deleted) | N/A | Never | DPDP requirement: consent records immutable |
| Audit log | 3 years hot | 4 years (S3 WORM) | Never deleted | Partition archive to S3; 7-year total retention |
| Identity map | Active consent | Erasure hold | Consent revocation + 30-day erasure SLA | Row deletion + UUID tombstone retained |
| Redis cache | TTL per key (60s–24h) | None | TTL expiry or event-driven invalidation | Automatic TTL |
| Model artifacts | Indefinite (model lineage) | N/A | Never deleted | WORM; old versions marked deprecated in MLflow |
DPDP Act Right-to-Erasure (Article 13): on retailer request, all Silver and Gold records are deleted within 30 days. Bronze records are pseudonymized (CMK deleted so Bronze becomes unreadable). Consent records and audit logs are retained (legal hold exemption under DPDP Act Section 8(7)).