Each ADR captures a key architectural decision, the context that drove it, alternatives that were seriously considered, and the consequences — both positive and negative — of the chosen path. ADRs are immutable once accepted; superseding decisions create a new ADR that references the old one.
We need a classification model that predicts probability of default (PD) for informal Indian retailers from ERP transaction features. The model must be: explainable (RBI FLDG mandates reason codes), robust on tabular/sparse data, trainable on a limited initial dataset (<100K labelled examples), and operationally simple enough for a small ML team to maintain.
Decision: Use XGBoost (gradient-boosted trees) as the primary model, with SHAP TreeExplainer for feature attribution, and Croston's Method as a sub-model for sparse/seasonal retailers.
Positive: SHAP TreeExplainer gives exact feature attributions (not approximations), satisfying RBI FLDG reason-code mandate. XGBoost is battle-tested in credit scoring globally. Handles missing values natively. Retraining is fast (<30 min on full dataset). Negative: Doesn't natively model temporal sequences — we compensate with engineered time-series features. Croston's sub-model adds routing complexity. Neural approaches may outperform in Phase 3+ as outcome labels accumulate.
The DPDP Act 2023 requires all personal data related to Indian citizens to be stored within India. We need a cloud provider with India-region infrastructure, mature managed ML/data services (Kafka, Spark, Kubernetes), and strong support for Customer Managed Keys (CMK). We also need a DR strategy that avoids single-cloud vendor lock-in.
Decision: AWS ap-south-1 (Mumbai) as primary. Azure Central India (Pune) as passive DR. All data — at rest, in transit, in computation — remains within India geographic boundaries at all times.
Positive: Full DPDP data localisation compliance. AWS MSK, EMR, EKS, RDS, S3 are mature and well-understood by the team. Multi-cloud DR reduces catastrophic vendor risk. Negative: Operating two cloud environments increases infrastructure complexity and cost (~15% overhead for DR replication). Team must maintain Azure familiarity even if rarely used. DNS failover to Azure involves manual steps (quarterly drill required).
We receive batch ERP data from 1,000s of distributors covering millions of transaction records. We need to: (1) preserve raw data immutably for audit, (2) pseudonymize PII before ML processing, (3) compute 40+ features per retailer per month efficiently, and (4) serve scores with low latency. These concerns pull in different directions — a single OLTP store cannot efficiently satisfy all four.
Decision: Three-tier medallion lake on S3 (Bronze → Silver → Gold), with dbt + Spark for transformations, and PostgreSQL as the operational serving layer for the Score API.
Positive: Bronze WORM provides genuine immutability for audit. Silver cleanly separates PII-containing from PII-free data — a hard DPDP boundary. Gold serves fast feature reads for ML. S3 + Parquet scales to billions of records cheaply. Negative: Three-layer pipeline adds latency (Bronze → Gold takes up to 4 hours). Engineers need Spark + dbt + SQL skills. Data lineage requires explicit tooling (OpenLineage). Score API cannot query Bronze directly — must wait for Gold layer.
DPDP Act 2023 requires free, specific, informed, and unambiguous consent from retailers before processing their personal data. Consent must be demonstrably obtained and revocable at any time. We need a channel that: (1) reaches 12M+ informal retailers reliably, (2) supports vernacular (Hindi) communication, (3) creates a documented, timestamped consent record, and (4) allows retailers to exercise rights (revocation, grievance) without a smartphone app or web browser.
Decision: WhatsApp Business API as the primary consent channel, with SMS as fallback. All 5-step consent flow (Invite → Notice → Decision → Record → Confirmation) runs via WhatsApp messages.
Positive: WhatsApp reaches 500M+ India users including feature phone users (via basic WhatsApp). Supports rich Hindi text. Webhook delivery provides timestamped record of every message. Revocation via WhatsApp "NAHI" reply is intuitive for retailers. Negative: Meta (WhatsApp) is a third-party sub-processor — DPA required and under review. WhatsApp API downtime blocks consent operations (SMS fallback mitigates). Meta policy changes (e.g., template restrictions) could impact consent flow. WhatsApp message_id serves as consent proof — auditors must trust Meta's delivery timestamps.
We must prevent PII (GST numbers, phone numbers) from reaching the ML pipeline while still being able to: (a) link records across ERP batches to the same retailer, (b) exercise DPDP rights (erasure, portability) by retailer, and (c) resolve retailer identity for the Score API lookup endpoint.
Decision: Replace GST and phone with a stable, random UUID at the Bronze → Silver boundary. The GST → UUID mapping lives in a physically separate, access-isolated identity store. The ML pipeline operates exclusively on UUIDs — it has zero access to the identity store.
Positive: ML pipeline has zero knowledge of real retailer identities — even a full Silver/Gold lake breach reveals no PII. DPDP erasure is operationally clean: delete UUID from identity store, delete Silver/Gold records by UUID. DPDP considers stable UUID pseudonymization (not anonymization), so we retain ability to re-identify for rights requests. Negative: Identity store becomes a crown jewel — its compromise re-links UUIDs to real GST numbers. Must be physically isolated, access-logged, and separately encrypted. Cross-distributor deduplication (same retailer, two distributors) requires identity store lookup — adds latency to dedup pipeline.
The Score API serves up to 50 concurrent NBFC lenders, each of whom must only see scores for retailers who have consented to that specific lender. Isolation must be enforced even if application code has a bug — we need defense-in-depth at the database layer. We also need ACID transactions for consent records (a consent grant and its audit record must be written atomically).
Decision: PostgreSQL with Row-Level Security (RLS) policies that enforce per-lender score isolation at the database engine level. Application code sets the app.current_lender_id session variable; the RLS policy filters accordingly. No lender can retrieve another's consented retailers even via SQL injection.
Positive: Cross-lender isolation enforced at the database engine — not just application code. PostgreSQL JSONB column handles reason_codes[] without a separate table. ACID transactions guarantee consent + audit atomicity. Amazon RDS Multi-AZ provides HA with automated failover. Negative: RLS adds ~5–10% query overhead on scores table. Every connection must set the session variable correctly — missed by a service = all rows visible to that session (mitigated by CI test that verifies RLS enforcement). Schema migrations on large tables (score_history) require careful planning with pg_repack.
NBFC partners need to authenticate machine-to-machine API calls at high frequency (up to 100 req/min). Authentication must be stateless (no session lookup on every call), cryptographically strong, and operable by NBFC engineering teams without complex OAuth flows. Admin users need a separate, higher-assurance channel.
Decision: JWT RS256 for NBFC API — lenders obtain a 1-hour JWT via client_credentials exchange and sign every request. FIDO2 hardware key + SAML 2.0 SSO for human admin users. These are intentionally separate trust tiers.
Positive: JWT is stateless — API gateway verifies signature without database round-trip. RS256 asymmetric signing means NBFC partners never see the private key. 1-hour TTL limits blast radius of token compromise. JWKs endpoint enables seamless key rotation. Negative: JWT cannot be revoked before expiry (1-hour window for compromised tokens). Mitigated by: blocklist in Redis for known-compromised JTIs, short TTL. NBFC teams must implement token refresh logic. jti claim uniqueness must be validated to prevent replay attacks.
Several critical side effects must happen asynchronously without blocking API responses: audit logging, consent revocation propagation, score cache invalidation, notification triggering, and pipeline triggering. These cannot be synchronous (audit write cannot block score API response) but also cannot be lost (audit records are legally required).
Decision: Apache Kafka (Amazon MSK) as the durable event bus for all async inter-service communication. Avro schema registry for type-safe events. Outbox pattern for at-least-once delivery guarantees. Dead-letter queues for failed events.
Positive: Kafka's durable log means audit events are never lost even if Audit Service is down (replayed on recovery). Consumer groups enable independent scaling. Event replay is invaluable for debugging and DR scenarios. Consent revocation propagates to all consumers within seconds. Negative: Kafka (MSK) adds operational complexity and cost (~$400/month for 3-broker cluster). Team needs Kafka expertise. At-least-once delivery requires idempotent consumers everywhere. Schema evolution (Avro) requires careful backward compatibility discipline.
RBI FLDG Guidelines require that every credit score delivered to an NBFC partner be accompanied by human-readable reason codes explaining the key positive and negative factors. DPDP Act further requires that automated decision-making be explainable to the data subject (the retailer). We need a method that: (a) produces accurate, stable feature attributions, (b) works natively with XGBoost, (c) can be translated into categorical reason codes, and (d) runs within our latency budget.
Decision: SHAP (SHapley Additive exPlanations) TreeExplainer for XGBoost. Top-N SHAP values are mapped to a fixed 24-code taxonomy with Hindi translations. A score is blocked (not returned) if SHAP computation fails.
Positive: SHAP TreeExplainer is exact (not approximate) for tree models. Consistent with model logic — reasons always reflect actual feature contributions. Satisfies both RBI FLDG and DPDP explainability requirements. Enables retailer-level dispute resolution (we can show exactly which features drove a low score). Negative: SHAP adds ~50–200ms to scoring time per retailer (acceptable within p95 budget). SHAP numeric values must be converted to categorical codes — translation taxonomy must be maintained and versioned. Exposing SHAP values directly would enable gaming — so only categorical codes are returned in the API (numeric values are internal only).
Guardrail thresholds (30% GMV affordability ceiling, AIR ≥ 0.80, PSI > 0.20 model refresh) protect retailers from exploitative lending and protect the platform from regulatory sanctions. If these thresholds are stored in a runtime config (database, environment variable, feature flag), they can be changed by a malicious insider, a misconfigured deploy, or a vendor compromise without audit or governance.
Decision: All guardrail thresholds are compile-time constants in the Scoring Engine codebase. They appear in the source code, are reviewed in code review, and can only change via a full deployment with Guardrail Committee sign-off. They are never read from environment variables, databases, or feature flag systems at runtime.
Positive: Guardrail thresholds are visible in code review — any change is subject to the full PR + Committee approval process. No runtime bypass possible. Scoring Engine always uses the same thresholds that were tested and approved. Tamper evidence is trivial — git history shows every change with author and timestamp. Negative: Threshold changes require a full deployment (30–60 min) rather than a config update (seconds). In a genuine emergency (e.g., model producing dangerously high limits due to a bug), response time is slower. Mitigated by: model freeze capability is a separate, fast operation; we can freeze scoring without changing thresholds.