AltScore — Architecture Decision Records

Index —

Decision Registry

Each ADR captures a key architectural decision, the context that drove it, alternatives that were seriously considered, and the consequences — both positive and negative — of the chosen path. ADRs are immutable once accepted; superseding decisions create a new ADR that references the old one.

ADR-001XGBoost as primary credit scoring model

ADR-002AWS ap-south-1 as primary cloud with Azure DR

ADR-003Medallion data lake (Bronze/Silver/Gold) over OLTP

ADR-004WhatsApp as primary retailer consent channel

ADR-005UUID pseudonymization over tokenization or encryption

ADR-006PostgreSQL with RLS for score and consent storage

ADR-007JWT RS256 for NBFC authentication over OAuth 2.0 full flow

ADR-008Kafka for async inter-service events over direct HTTP

ADR-009SHAP TreeExplainer for mandatory reason codes

ADR-010Compile-time guardrail constants over runtime configuration

ADR-001

XGBoost as Primary Credit Scoring Model

May 2026 · Authors: ML Lead, Head of Risk

Accepted

Context & Decision

We need a classification model that predicts probability of default (PD) for informal Indian retailers from ERP transaction features. The model must be: explainable (RBI FLDG mandates reason codes), robust on tabular/sparse data, trainable on a limited initial dataset (<100K labelled examples), and operationally simple enough for a small ML team to maintain.

Decision: Use XGBoost (gradient-boosted trees) as the primary model, with SHAP TreeExplainer for feature attribution, and Croston's Method as a sub-model for sparse/seasonal retailers.

Alternatives Considered

Logistic Regression: Highly explainable, but insufficient capacity for non-linear payment behaviour patterns (e.g., seasonal spikes). Discarded.
Neural Network (LSTM/Transformer): Strong sequence modelling for time-series, but requires 10× more labelled data, black-box by default, and operationally complex. Not viable for v1 with limited outcome labels.
LightGBM: Near-equivalent performance to XGBoost on tabular data. SHAP support equally strong. Not chosen because XGBoost has more community credit-risk precedent in India and lender familiarity.
CatBoost: Strong on categorical features, but our features are predominantly numeric. No material advantage.

Consequences

Positive: SHAP TreeExplainer gives exact feature attributions (not approximations), satisfying RBI FLDG reason-code mandate. XGBoost is battle-tested in credit scoring globally. Handles missing values natively. Retraining is fast (<30 min on full dataset). Negative: Doesn't natively model temporal sequences — we compensate with engineered time-series features. Croston's sub-model adds routing complexity. Neural approaches may outperform in Phase 3+ as outcome labels accumulate.

ADR-002

AWS ap-south-1 as Primary Cloud with Azure Central India DR

May 2026 · Authors: CTO, CISO

Accepted

Context & Decision

The DPDP Act 2023 requires all personal data related to Indian citizens to be stored within India. We need a cloud provider with India-region infrastructure, mature managed ML/data services (Kafka, Spark, Kubernetes), and strong support for Customer Managed Keys (CMK). We also need a DR strategy that avoids single-cloud vendor lock-in.

Decision: AWS ap-south-1 (Mumbai) as primary. Azure Central India (Pune) as passive DR. All data — at rest, in transit, in computation — remains within India geographic boundaries at all times.

Alternatives Considered

GCP Mumbai (asia-south1): Comparable ML tooling (Vertex AI), but GCP's Kafka equivalent (Pub/Sub) has different semantics than MSK and would require significant data pipeline rework. AWS MSK experience in the team is stronger.
Azure only (Central India): Azure ADLS Gen2 is excellent for data lake but Azure's ML tooling (AzureML) is more opinionated and expensive at our scale. Chosen as DR rather than primary.
On-premises / colocation: Compliant with DPDP but prohibitively expensive for a startup and eliminates auto-scaling for batch ML workloads. Discarded.
AWS primary + AWS DR (second India AZ): Simpler operationally but doesn't protect against AWS-wide India region outage. Multi-cloud DR chosen for resilience.

Consequences

Positive: Full DPDP data localisation compliance. AWS MSK, EMR, EKS, RDS, S3 are mature and well-understood by the team. Multi-cloud DR reduces catastrophic vendor risk. Negative: Operating two cloud environments increases infrastructure complexity and cost (~15% overhead for DR replication). Team must maintain Azure familiarity even if rarely used. DNS failover to Azure involves manual steps (quarterly drill required).

ADR-003

Medallion Data Lake (Bronze/Silver/Gold) over Single OLTP Database

May 2026 · Authors: Data Engineering Lead, DPO

Accepted

Context & Decision

We receive batch ERP data from 1,000s of distributors covering millions of transaction records. We need to: (1) preserve raw data immutably for audit, (2) pseudonymize PII before ML processing, (3) compute 40+ features per retailer per month efficiently, and (4) serve scores with low latency. These concerns pull in different directions — a single OLTP store cannot efficiently satisfy all four.

Decision: Three-tier medallion lake on S3 (Bronze → Silver → Gold), with dbt + Spark for transformations, and PostgreSQL as the operational serving layer for the Score API.

Alternatives Considered

Single PostgreSQL database: Familiar and simple, but: (a) WORM immutability for Bronze requires hacky triggers, not native; (b) Spark-scale feature computation on PostgreSQL is impractical; (c) Per-distributor CMK encryption is not supported natively. Discarded.
Apache Hudi / Delta Lake: Excellent ACID on top of S3 with CDC support. Added complexity (schema evolution, compaction) not justified at v1 scale. Revisit if real-time scoring is needed in Phase 5.
Snowflake: Strong for analytics but stores data outside India unless specifically configured (no dedicated India region at time of decision). DPDP non-compliant. Discarded.
MongoDB Atlas: Good for flexible schemas but poor for columnar ML feature computation. Not suited to our analytical workload.

Consequences

Positive: Bronze WORM provides genuine immutability for audit. Silver cleanly separates PII-containing from PII-free data — a hard DPDP boundary. Gold serves fast feature reads for ML. S3 + Parquet scales to billions of records cheaply. Negative: Three-layer pipeline adds latency (Bronze → Gold takes up to 4 hours). Engineers need Spark + dbt + SQL skills. Data lineage requires explicit tooling (OpenLineage). Score API cannot query Bronze directly — must wait for Gold layer.

ADR-004

WhatsApp as Primary Retailer Consent Channel

May 2026 · Authors: Head of Product, DPO

Accepted

Context & Decision

DPDP Act 2023 requires free, specific, informed, and unambiguous consent from retailers before processing their personal data. Consent must be demonstrably obtained and revocable at any time. We need a channel that: (1) reaches 12M+ informal retailers reliably, (2) supports vernacular (Hindi) communication, (3) creates a documented, timestamped consent record, and (4) allows retailers to exercise rights (revocation, grievance) without a smartphone app or web browser.

Decision: WhatsApp Business API as the primary consent channel, with SMS as fallback. All 5-step consent flow (Invite → Notice → Decision → Record → Confirmation) runs via WhatsApp messages.

Alternatives Considered

Web portal consent: Requires smartphone with browser and data connection. ~40% of target retailers are feature-phone users or low-data users. Accessibility gap unacceptable. Discarded.
IVR (phone call): High reach but: (a) no written record of consent content shown, (b) difficult to satisfy DPDP "informed" requirement for complex purpose statements, (c) high per-call cost at scale. Discarded.
SMS-only: SMS has ~99% reach but 160-char limit makes it impossible to convey full notice (purpose, rights, consequences) in a single message. Works as fallback for consent confirmation; not sufficient as primary for initial notice.
Distributor-mediated (paper) consent: Scalable initially but creates DPDP risk — AltScore cannot independently verify the consent was genuinely informed. Distributor incentive misalignment is a fraud vector. Rejected on governance grounds.

Consequences

Positive: WhatsApp reaches 500M+ India users including feature phone users (via basic WhatsApp). Supports rich Hindi text. Webhook delivery provides timestamped record of every message. Revocation via WhatsApp "NAHI" reply is intuitive for retailers. Negative: Meta (WhatsApp) is a third-party sub-processor — DPA required and under review. WhatsApp API downtime blocks consent operations (SMS fallback mitigates). Meta policy changes (e.g., template restrictions) could impact consent flow. WhatsApp message_id serves as consent proof — auditors must trust Meta's delivery timestamps.

ADR-005

UUID Pseudonymization over Tokenization or Field-Level Encryption

May 2026 · Authors: DPO, Data Engineering Lead

Accepted

Context & Decision

We must prevent PII (GST numbers, phone numbers) from reaching the ML pipeline while still being able to: (a) link records across ERP batches to the same retailer, (b) exercise DPDP rights (erasure, portability) by retailer, and (c) resolve retailer identity for the Score API lookup endpoint.

Decision: Replace GST and phone with a stable, random UUID at the Bronze → Silver boundary. The GST → UUID mapping lives in a physically separate, access-isolated identity store. The ML pipeline operates exclusively on UUIDs — it has zero access to the identity store.

Alternatives Considered

Field-level encryption (FLE): Encrypt GST in place, decrypt on access. Problem: encrypted ciphertext is still PII under DPDP (it's the same data, protected). ML pipeline would need decryption keys — widening the trust boundary. Discarded.
Tokenization (format-preserving): Replace GST with a synthetic GSTIN-format token. Preserves format for downstream compatibility but complicates identity resolution and doesn't offer stronger privacy guarantees than UUID. More complex to implement. Discarded.
Hashing (SHA-256 of GST): One-way, but GST space is small enough (<5 crore unique GSTINs) to be rainbow-table attacked. Not sufficient pseudonymization under DPDP. Discarded as sole mechanism (used only for lookup, not as stable identifier).
Per-record random token (no stable mapping): Maximum privacy but breaks cross-batch record linkage. Retailers would appear as new entities on every sync. Discarded — linkage is fundamental to feature computation.

Consequences

Positive: ML pipeline has zero knowledge of real retailer identities — even a full Silver/Gold lake breach reveals no PII. DPDP erasure is operationally clean: delete UUID from identity store, delete Silver/Gold records by UUID. DPDP considers stable UUID pseudonymization (not anonymization), so we retain ability to re-identify for rights requests. Negative: Identity store becomes a crown jewel — its compromise re-links UUIDs to real GST numbers. Must be physically isolated, access-logged, and separately encrypted. Cross-distributor deduplication (same retailer, two distributors) requires identity store lookup — adds latency to dedup pipeline.

ADR-006

PostgreSQL with Row-Level Security for Score and Consent Storage

May 2026 · Authors: CTO, API Engineering Lead

Accepted

Context & Decision

The Score API serves up to 50 concurrent NBFC lenders, each of whom must only see scores for retailers who have consented to that specific lender. Isolation must be enforced even if application code has a bug — we need defense-in-depth at the database layer. We also need ACID transactions for consent records (a consent grant and its audit record must be written atomically).

Decision: PostgreSQL with Row-Level Security (RLS) policies that enforce per-lender score isolation at the database engine level. Application code sets the app.current_lender_id session variable; the RLS policy filters accordingly. No lender can retrieve another's consented retailers even via SQL injection.

Alternatives Considered

Application-layer filtering only: Simpler, but a single SQL injection or code bug exposes cross-lender data. Not acceptable given the sensitivity of credit scores. Discarded — defense-in-depth requires database-level enforcement.
Separate database per lender: Maximum isolation but operationally unsustainable at 50 lenders (50× schema migrations, connection pools, backup jobs). Discarded.
MongoDB with tenant field filtering: No native RLS equivalent. Application-layer filtering only. Discarded for same reason as option 1.
CockroachDB: Distributed SQL, strong consistency, but no India-region AWS deployment at time of decision. RLS support less mature. Discarded.

Consequences

Positive: Cross-lender isolation enforced at the database engine — not just application code. PostgreSQL JSONB column handles reason_codes[] without a separate table. ACID transactions guarantee consent + audit atomicity. Amazon RDS Multi-AZ provides HA with automated failover. Negative: RLS adds ~5–10% query overhead on scores table. Every connection must set the session variable correctly — missed by a service = all rows visible to that session (mitigated by CI test that verifies RLS enforcement). Schema migrations on large tables (score_history) require careful planning with pg_repack.

ADR-007

JWT RS256 for NBFC API Authentication over Full OAuth 2.0 Flow

May 2026 · Authors: API Engineering Lead, CISO

Accepted

Context & Decision

NBFC partners need to authenticate machine-to-machine API calls at high frequency (up to 100 req/min). Authentication must be stateless (no session lookup on every call), cryptographically strong, and operable by NBFC engineering teams without complex OAuth flows. Admin users need a separate, higher-assurance channel.

Decision: JWT RS256 for NBFC API — lenders obtain a 1-hour JWT via client_credentials exchange and sign every request. FIDO2 hardware key + SAML 2.0 SSO for human admin users. These are intentionally separate trust tiers.

Alternatives Considered

API Key only (static secret in header): Simple to implement but: (a) no expiry without explicit revocation, (b) key rotation requires coordination, (c) no claims-based authorization. Insufficient for financial API. Discarded as sole mechanism (used as second factor alongside JWT).
Full OAuth 2.0 Authorization Code flow: Best practice for user-delegated access but adds redirect-based flow complexity inappropriate for M2M API calls. NBFCs would need to operate a redirect URI. Over-engineering for our use case.
mTLS for NBFC API: Strong cryptographic identity but certificate management for 50 NBFC partners (each with their own cert rotation schedule) is operationally complex. mTLS is reserved for ERP connectors (fewer partners, known infrastructure).
HMAC-signed requests (AWS SigV4 style): Strong but requires NBFC teams to implement signing logic. Higher integration friction than JWT which has universal library support. Discarded.

Consequences

Positive: JWT is stateless — API gateway verifies signature without database round-trip. RS256 asymmetric signing means NBFC partners never see the private key. 1-hour TTL limits blast radius of token compromise. JWKs endpoint enables seamless key rotation. Negative: JWT cannot be revoked before expiry (1-hour window for compromised tokens). Mitigated by: blocklist in Redis for known-compromised JTIs, short TTL. NBFC teams must implement token refresh logic. jti claim uniqueness must be validated to prevent replay attacks.

ADR-008

Kafka for Async Inter-Service Events over Synchronous HTTP Callbacks

May 2026 · Authors: Platform Engineering Lead, CTO

Accepted

Context & Decision

Several critical side effects must happen asynchronously without blocking API responses: audit logging, consent revocation propagation, score cache invalidation, notification triggering, and pipeline triggering. These cannot be synchronous (audit write cannot block score API response) but also cannot be lost (audit records are legally required).

Decision: Apache Kafka (Amazon MSK) as the durable event bus for all async inter-service communication. Avro schema registry for type-safe events. Outbox pattern for at-least-once delivery guarantees. Dead-letter queues for failed events.

Alternatives Considered

Synchronous HTTP callbacks (service-to-service): Simple but creates tight coupling — a slow Audit Service blocks the Score API. Cascading failures. Discarded for side-effects that should not block the critical path.
AWS SQS + SNS: Simpler operational model than Kafka but: (a) max 256KB message size constrains audit payloads, (b) no guaranteed ordering within a partition (needed for consent events), (c) no native consumer group replay for debugging. Discarded.
RabbitMQ: Good for task queues but lacks Kafka's durable log semantics — cannot replay events after consumer failure without explicit design. For audit records that must never be lost, Kafka's append-only log is a better fit.
Redis Pub/Sub: Low latency but no persistence — messages lost if consumer is down. Unacceptable for audit events. Discarded (Redis is used for cache, not durable events).

Consequences

Positive: Kafka's durable log means audit events are never lost even if Audit Service is down (replayed on recovery). Consumer groups enable independent scaling. Event replay is invaluable for debugging and DR scenarios. Consent revocation propagates to all consumers within seconds. Negative: Kafka (MSK) adds operational complexity and cost (~$400/month for 3-broker cluster). Team needs Kafka expertise. At-least-once delivery requires idempotent consumers everywhere. Schema evolution (Avro) requires careful backward compatibility discipline.

ADR-009

SHAP TreeExplainer for Mandatory Reason Codes

May 2026 · Authors: ML Lead, Head of Risk

Accepted

Context & Decision

RBI FLDG Guidelines require that every credit score delivered to an NBFC partner be accompanied by human-readable reason codes explaining the key positive and negative factors. DPDP Act further requires that automated decision-making be explainable to the data subject (the retailer). We need a method that: (a) produces accurate, stable feature attributions, (b) works natively with XGBoost, (c) can be translated into categorical reason codes, and (d) runs within our latency budget.

Decision: SHAP (SHapley Additive exPlanations) TreeExplainer for XGBoost. Top-N SHAP values are mapped to a fixed 24-code taxonomy with Hindi translations. A score is blocked (not returned) if SHAP computation fails.

Alternatives Considered

LIME (Local Interpretable Model-agnostic Explanations): Model-agnostic and flexible, but produces approximations (not exact attributions). For tree models, SHAP TreeExplainer gives exact Shapley values — strictly more accurate. LIME also has higher variance across runs (non-deterministic). Discarded.
Feature importance (global, permutation-based): Produces population-level importance, not individual-level attribution. Cannot explain why this retailer got this specific score. Does not satisfy FLDG per-request reason code requirement. Discarded.
Rule extraction (decision tree approximation): Approximate the XGBoost model with a shallow decision tree, then read the decision path as "reasons." Loses accuracy of XGBoost significantly. Discarded.
Manual rule-based reason codes: Hardcoded thresholds (e.g., "if avg_delay > 5 → PAYMENT_DELAY_HIGH"). Simple but: (a) doesn't reflect actual model attribution, (b) reason codes may contradict score if model weighting differs from thresholds. Discarded.

Consequences

Positive: SHAP TreeExplainer is exact (not approximate) for tree models. Consistent with model logic — reasons always reflect actual feature contributions. Satisfies both RBI FLDG and DPDP explainability requirements. Enables retailer-level dispute resolution (we can show exactly which features drove a low score). Negative: SHAP adds ~50–200ms to scoring time per retailer (acceptable within p95 budget). SHAP numeric values must be converted to categorical codes — translation taxonomy must be maintained and versioned. Exposing SHAP values directly would enable gaming — so only categorical codes are returned in the API (numeric values are internal only).

ADR-010

Compile-Time Guardrail Constants over Runtime Configuration

May 2026 · Authors: Head of Risk, CTO, DPO

Accepted

Context & Decision

Guardrail thresholds (30% GMV affordability ceiling, AIR ≥ 0.80, PSI > 0.20 model refresh) protect retailers from exploitative lending and protect the platform from regulatory sanctions. If these thresholds are stored in a runtime config (database, environment variable, feature flag), they can be changed by a malicious insider, a misconfigured deploy, or a vendor compromise without audit or governance.

Decision: All guardrail thresholds are compile-time constants in the Scoring Engine codebase. They appear in the source code, are reviewed in code review, and can only change via a full deployment with Guardrail Committee sign-off. They are never read from environment variables, databases, or feature flag systems at runtime.

Alternatives Considered

Database-stored thresholds (runtime configurable): Flexible — operations team can tune thresholds without a deployment. But: a single DB write bypasses all governance. Insider threat or SQL injection could silently remove all credit limits. Discarded — flexibility is the exact risk we're trying to eliminate.
Feature flags (LaunchDarkly / GrowthBook): Convenient for A/B testing product features, but feature flag changes bypass code review and deployment governance. A flag flip is indistinguishable from a legitimate config change vs. a policy violation. Discarded for safety-critical thresholds.
Signed configuration files (HSM-signed JSON): Would require runtime HSM signature verification on every score — adds latency and HSM availability dependency. Complexity not justified when compile-time constants achieve the same tamper-resistance more simply. Discarded.
Separate policy service with access controls: Could provide auditability via an API, but introduces another service that could fail, be compromised, or be misconfigured. Adding runtime network dependency to a safety-critical constraint is a fragility. Discarded.

Consequences

Positive: Guardrail thresholds are visible in code review — any change is subject to the full PR + Committee approval process. No runtime bypass possible. Scoring Engine always uses the same thresholds that were tested and approved. Tamper evidence is trivial — git history shows every change with author and timestamp. Negative: Threshold changes require a full deployment (30–60 min) rather than a config update (seconds). In a genuine emergency (e.g., model producing dangerously high limits due to a bug), response time is slower. Mitigated by: model freeze capability is a separate, fast operation; we can freeze scoring without changing thresholds.