← Hub
AltScore Platform · Data
Data Architecture
& Schema
DOC-ID: DAS-ALTSCORE-001
Version: 1.0 · May 2026
Classification: INTERNAL — Data Engineering & ML
Owner: Data Engineering, AltScore Technologies
3-TierMedallion ArchitectureBronze (raw) → Silver (pseudonymized features) → Gold (scores)
40+Engineered FeaturesComputed in Silver layer via dbt + Spark; fully deterministic
7 yearsAudit & Consent RetentionDPDP Act 2023 compliance; S3 WORM Object Lock for immutable records
RLSPostgreSQL Row-Level SecurityPer-lender data isolation enforced at the database layer
01 —

Data Architecture Overview

The AltScore data architecture follows a medallion (Bronze → Silver → Gold) pattern over a cloud object store, with PostgreSQL as the operational store for scores, consent, and audit records, and Redis as a caching layer. No PII crosses the Bronze → Silver boundary in raw form.

DAS-ALTSCORE-001 · Architecture Charter
ERP SOURCE
Tally / SAP / CSV
Raw invoices
payments
returns
GST numbers
BRONZE LAYER
Raw Immutable
WORM S3
Per-distributor CMK
Append-only
Schema validated
PSEUDONYMIZE
GST → UUID
Identity resolver
ECDSA signed
Isolated store
One-way mapping
SILVER LAYER
Clean Features
No PII
dbt normalized
Outlier-capped
Retailer UUID only
GOLD LAYER
Scores + Features
ML-ready features
XGBoost scores
SHAP codes
PD estimates
OPERATIONAL
PostgreSQL + Redis
Score store RLS
Consent records
Audit log WORM
Redis cache TTL 4h
02 —

Medallion Lake Design

Bronze
Raw · Immutable · PII Present
  • Append-only; never modified post-write
  • S3 Object Lock WORM (compliance mode)
  • Encrypted with per-distributor Customer Managed Key (CMK)
  • Partitioned: s3://lake/bronze/dist_{id}/year/month/
  • Retains raw GST, phone, invoice data as received
  • Schema validation at write time; quarantine on fail
  • Retention: 2 years active + 5 years archive (Glacier)
  • Access: Data Engineering only (IAM scoped)
Silver
Pseudonymized · Normalized · No PII
  • GST and phone replaced with stable retailer_uuid
  • Schema normalized across Tally / SAP / CSV sources
  • Outlier capping applied (e.g., delay days capped at 365)
  • Partitioned: s3://lake/silver/retailer_{uuid}/year/month/
  • dbt source layer models run here
  • Encrypted with shared Silver KMS key (no per-distributor CMK)
  • Retention: consent duration + 6 months, then delete
  • Access: ML Engineering + Data Engineering (IAM scoped)
Gold
Features · Scores · ML-Ready
  • 40+ computed features per retailer per month
  • AltScore value, risk_band, PD, recommended_limit
  • SHAP reason codes (categorical, ranked)
  • Partitioned: s3://lake/gold/scores/year/month/
  • dbt mart layer models run here
  • Mirrored to PostgreSQL scores table for API serving
  • Retention: consent duration + 12 months
  • Access: Scoring Engine + Score API (read only via RLS)

S3 Bucket Structure

# Bucket: altscore-datalake-prod (ap-south-1, versioning ON)
bronze/
  dist_{distributor_id}/
    invoices/year={y}/month={m}/invoices_{batch_id}.parquet
    payments/year={y}/month={m}/payments_{batch_id}.parquet
    quarantine/quarantined_{batch_id}.json # malformed records
silver/
  transactions/year={y}/month={m}/retailer_uuid={uuid}/transactions.parquet
  identity_map/ # GST→UUID resolver (separate, access-restricted)
gold/
  features/year={y}/month={m}/features.parquet # 40+ features per retailer
  scores/year={y}/month={m}/scores.parquet # AltScore + PD + reason codes
audit/
  year={y}/month={m}/day={d}/audit_{timestamp}.parquet # WORM 7yr
mlflow/
  artifacts/models/altscore-xgb-v{n}/model.pkl + sha256.txt # WORM
03 —

Bronze Layer Schema

Bronze stores raw ERP data exactly as received. PII fields (GST, phone, address) are present. Access is restricted to Data Engineering via IAM. No ML or API service has access to Bronze.

bronze_invoicesParquet · S3 Bronze · Partitioned by distributor_id + year + month
ColumnTypePIIDescription
invoice_idSTRINGNoOriginal ERP invoice ID (source system)
distributor_idSTRINGNoAltScore distributor UUID
retailer_gstSTRINGYESGSTIN — pseudonymized in Silver layer
retailer_nameSTRINGYESBusiness name — dropped in Silver layer
retailer_phoneSTRINGYESPhone number — pseudonymized in Silver layer
invoice_dateDATENoInvoice creation date
due_dateDATENoPayment due date per credit terms
amount_inrDECIMAL(15,2)NoInvoice gross amount in INR
erp_typeSTRINGNotally_prime | sap_b1 | csv
ingested_atTIMESTAMPNoIngest timestamp (UTC)
batch_idSTRINGNoSource batch_id from ERP Ingestion Service
_row_hashSTRINGNoSHA-256 of all fields — tamper detection
bronze_paymentsParquet · S3 Bronze · Partitioned by distributor_id + year + month
ColumnTypePIIDescription
payment_idSTRINGNoOriginal ERP payment ID
invoice_idSTRINGNoLinks to bronze_invoices.invoice_id
distributor_idSTRINGNoAltScore distributor UUID
retailer_gstSTRINGYESGSTIN — pseudonymized in Silver layer
paid_dateDATENoDate payment was received
amount_paid_inrDECIMAL(15,2)NoAmount actually paid (may be partial)
payment_methodSTRINGNoNEFT | RTGS | IMPS | cheque | cash
ingested_atTIMESTAMPNoIngest timestamp (UTC)
_row_hashSTRINGNoSHA-256 — tamper detection
04 —

Silver Layer Schema

Silver contains pseudonymized, normalized transaction records. No PII is present. The retailer_gstretailer_uuid mapping lives in a separate access-restricted identity store. ML and API services interact with Silver data only through the Gold layer — never directly.

silver_transactionsParquet · S3 Silver · Partitioned by retailer_uuid + year + month
ColumnTypeDescription
retailer_uuidSTRING (UUID)Stable pseudonymous identifier (replaces GST)
distributor_uuidSTRING (UUID)Pseudonymous distributor identifier
invoice_uuidSTRING (UUID)Pseudonymous invoice ID (deduped)
invoice_dateDATEInvoice creation date
due_dateDATEPayment due date
amount_inrDECIMAL(15,2)Invoice amount (outlier-capped at p99 per distributor)
paid_dateDATE (nullable)Actual payment date; NULL if unpaid
amount_paid_inrDECIMAL(15,2)Actual amount paid (0 if unpaid)
payment_delay_daysINTEGERpaid_date - due_date (negative = early, positive = late, NULL = unpaid)
is_partial_paymentBOOLEANTrue if amount_paid < amount × 0.95
is_returnBOOLEANTrue if invoice is a credit note / return
data_sourceSTRINGtally_prime | sap_b1 | csv
silver_created_atTIMESTAMPSilver pipeline run timestamp
identity_mapPostgreSQL (isolated schema) · Access-restricted · Never in ML pipeline
ColumnTypeDescription
retailer_uuidUUID PRIMARY KEYStable pseudonymous identifier (exposed to API)
gst_hashBYTEAHMAC-SHA256 of GSTIN (allows lookup without storing raw GSTIN)
phone_hashBYTEAHMAC-SHA256 of phone number
distributor_uuidUUIDPrimary distributor UUID
consent_idUUIDFK to consent_records.consent_id
created_atTIMESTAMPFirst time retailer seen in system
erasure_requested_atTIMESTAMP (nullable)Set when DPDP Right-to-Erasure filed (30-day SLA timer)
05 —

Gold Layer Schema

gold_retailer_featuresParquet · S3 Gold + PostgreSQL feature_store · ML-ready
Feature ColumnTypeFamilyDescription
retailer_uuidUUIDKeyStable pseudonymous retailer identifier
feature_monthDATEKeyMonth for which features are computed (YYYY-MM-01)
avg_payment_delay_daysFLOATPaymentMean payment_delay_days over trailing 6 months
p90_payment_delay_daysFLOATPayment90th percentile payment delay
zero_delay_rateFLOATPaymentFraction of invoices paid on or before due date
partial_payment_freqFLOATPaymentFraction of invoices with partial payment
partial_payment_ratioFLOATPaymentMedian (amount_paid / amount) for partial payments
max_consecutive_delaysINTEGERPaymentLongest consecutive delay streak (invoices)
monthly_order_frequencyFLOATOrderAvg invoices per month, trailing 6 months
order_gap_median_daysFLOATOrderMedian days between consecutive invoices
order_gap_cvFLOATOrderCoefficient of variation of inter-order gaps
return_rate_pctFLOATOrder% of total invoiced value returned
return_freq_pctFLOATOrder% of invoices with any return
gmv_trend_slopeFLOATTrajectoryOLS slope of monthly GMV (INR/month)
gmv_3m_vs_12m_ratioFLOATTrajectoryRecent 3-month avg GMV / trailing 12-month avg GMV
gmv_yoy_growthFLOAT (nullable)TrajectoryYear-over-year GMV growth rate (requires 12+ months)
avg_monthly_gmv_logFLOATTrajectorylog(avg_monthly_gmv) — normalised for model
seasonal_amplitudeFLOATSeasonalPeak season / trough season GMV ratio
seasonal_recovery_rateFLOATSeasonalGMV recovery speed post-seasonal trough (0–1)
distributor_tenure_monthsINTEGERRelationshipMonths since first invoice with any distributor
distributor_countINTEGERRelationshipNumber of distinct distributors in history
data_completeness_scoreFLOATQualityWeighted completeness 0–100 (GR-5)
data_freshness_daysINTEGERQualityDays since most recent ERP record
is_sparse_retailerBOOLEANRoutingTrue → Croston's routing; False → XGBoost primary
gold_scoresParquet · S3 Gold + PostgreSQL scores table · Served by Score API
ColumnTypeDescription
score_idUUID PRIMARY KEYImmutable score record identifier
retailer_uuidUUIDFK to identity_map.retailer_uuid
score_monthDATEMonth score is valid for
altscoreINTEGER300–900 integer score
risk_bandCHAR(1)A | B | C | D
probability_of_defaultFLOATRaw XGBoost output, calibrated to PD (0.0–1.0)
recommended_limit_inrDECIMAL(15,0)Guardrail-capped limit (30% of 6-month GMV)
reason_codesJSONBArray of {code, direction, rank, label_en, label_hi}
low_confidenceBOOLEANTrue if completeness 50–69% or freshness >14 days
anomaly_flagBOOLEANTrue if Isolation Forest flagged gaming signals
score_versionSTRINGMLflow model version (e.g., altscore-xgb-v1-prod)
scored_atTIMESTAMPScoring engine run timestamp (UTC)
feature_snapshot_hashBYTEASHA-256 of feature vector — audit / reproducibility
06 —

PostgreSQL Schema Design

PostgreSQL (Amazon RDS, Multi-AZ) is the operational database. Six logical databases are segregated by function. Row-Level Security (RLS) enforces per-lender data isolation on the scores table. All tables use UUIDs as primary keys (no sequential integers exposed to API).

Database: consent_db

Tables

  • consent_records — signed consent artefacts (immutable)
  • consent_scope — lender_id + consent_id + purpose
  • consent_revocations — revocation events + timestamps
  • identity_map — GST_hash → retailer_uuid (access-isolated)
Database: score_db

Tables

  • scores — current scores per retailer (RLS enforced)
  • score_history — monthly score snapshots (12-month ring)
  • score_blocks — records where scoring was blocked + reason
  • feature_store — Gold layer feature cache per retailer
Database: audit_db

Tables

  • audit_log — append-only, no UPDATE/DELETE (DB trigger enforced)
  • audit_log partitioned monthly; old partitions archived to S3 WORM
  • Indexes on: event_type, retailer_uuid, lender_id, timestamp
Database: admin_db

Tables

  • distributors — onboarded distributors + cert metadata
  • lenders — NBFC partner records + API key metadata
  • grievances — grievance cases with SLA timers
  • model_registry_pointers — active model version per env

Row-Level Security (RLS) on scores table

-- Enable RLS on scores table
ALTER TABLE score_db.scores ENABLE ROW LEVEL SECURITY;

-- Policy: lender can only see scores for retailers who consented to them
CREATE POLICY lender_score_isolation ON score_db.scores
  USING (
    retailer_uuid IN (
      SELECT retailer_uuid FROM consent_db.consent_scope
      WHERE lender_id = current_setting('app.current_lender_id')
        AND status = 'active'
    )
  );

-- Set per-request lender context before any query
SET LOCAL app.current_lender_id = 'lender_shriram';
SELECT * FROM score_db.scores WHERE retailer_uuid = 'ret_7f3a...';
07 —

Redis Cache Strategy

Cache Key PatternValueTTLInvalidation
score:{retailer_uuid}:{lender_id}Full score JSON (serialized)4 hoursOn score.computed Kafka event
consent:{retailer_uuid}:{lender_id}active | revoked | none60 secondsOn consent.revoked Kafka event (immediate purge)
ratelimit:{lender_id}:{minute_bucket}Request counter61 secondsNatural TTL expiry
session:{token_jti}JWT validation metadataJWT exp - nowOn JWT expiry or explicit logout
blocklist:{token_jti}revokedJWT exp - nowOn admin token revocation
features:{retailer_uuid}Latest Gold feature vector24 hoursOn pipeline.gold.ready Kafka event

Redis Cluster mode is used (3 shards × 2 replicas). All keys are prefixed with environment (prod:, staging:) to prevent cross-environment contamination. Consent cache TTL is intentionally short (60s) so revocations propagate quickly. Score cache TTL (4h) is longer since scores change only on pipeline runs.

08 —

dbt Model Map

dbt Core manages all SQL transformations from Bronze → Silver → Gold. Models are organized in three layers following dbt best practice: sources (Bronze), staging (Silver normalization), intermediate (feature computation), marts (Gold layer, ML-ready). All models are fully documented and tested via dbt schema tests.

Layerdbt ModelInputOutputKey Transformations
Sourcessrc_bronze_invoicesS3 Bronze ParquetRaw staging refNone — source definition only
Stagingstg_transactionsBronze invoices + paymentsSilver transactionsGST→UUID lookup, schema normalisation, outlier cap
Stagingstg_returnsBronze invoices (credit notes)Silver returnsFlag returns, link to original invoice
Intermediateint_payment_featuresstg_transactionsPayment family featuresavg_delay, p90_delay, zero_delay_rate, partial_payment_freq
Intermediateint_order_featuresstg_transactions, stg_returnsOrder family featuresmonthly_freq, gap_cv, return_rate, gmv_per_order_cv
Intermediateint_trajectory_featuresstg_transactionsTrajectory family featuresOLS GMV slope, yoy_growth, 3m_vs_12m_ratio
Intermediateint_seasonal_featuresstg_transactionsSeasonal family featuresseasonal_amplitude, recovery_rate, festival windows
Intermediateint_relationship_featuresstg_transactionsRelationship family featuresdistributor_tenure, distributor_count, credit_term_utilization
Intermediateint_quality_featuresAll staging modelsData quality dimensionscompleteness_score (GR-5 formula), freshness_days
Martmart_retailer_featuresAll int_ modelsgold_retailer_featuresUnion all 40+ features; routing flag (sparse/XGBoost)
Martmart_scoresScoring Engine outputgold_scoresFinal score + reason codes; written by Scoring Engine (not dbt)
09 —

Data Lineage & Reproducibility

Lineage Tracking

dbt Docs + OpenLineage

Every dbt model records its source tables, transformations applied, and test results. OpenLineage integration emits lineage events to Marquez for cross-pipeline tracking. Any score can be traced back to its source Bronze batch_id.

Reproducibility

Deterministic Transformations

All dbt and Spark transformations are deterministic — same input always produces same output. Feature snapshots are SHA-256 hashed at scoring time, enabling exact reproduction of any historical score for audit or dispute resolution.

Audit Trail

Score Provenance

Every gold_scores record stores: feature_snapshot_hash, score_version (MLflow model ID), scored_at, and the Airflow DAG run_id. A full lineage chain exists from retailer score → features → Bronze records → ERP batch.

10 —

Data Retention & Lifecycle

Data StoreActive RetentionArchive RetentionDeletion TriggerDeletion Method
Bronze (S3 WORM)2 years5 years (Glacier)Automatic lifecycle ruleS3 lifecycle → Glacier; CMK deletion after 7yr
Silver (S3)Consent duration + 6 monthsNoneConsent revocationSpark delete + S3 object delete; UUID nulled
Gold scores (S3)Consent duration + 12 monthsNoneConsent revocationParquet partition delete
PostgreSQL scores36 months rollingPartitioned archiveMonthly partition dropTable partition detach + drop
Consent recordsForever (never deleted)N/ANeverDPDP requirement: consent records immutable
Audit log3 years hot4 years (S3 WORM)Never deletedPartition archive to S3; 7-year total retention
Identity mapActive consentErasure holdConsent revocation + 30-day erasure SLARow deletion + UUID tombstone retained
Redis cacheTTL per key (60s–24h)NoneTTL expiry or event-driven invalidationAutomatic TTL
Model artifactsIndefinite (model lineage)N/ANever deletedWORM; old versions marked deprecated in MLflow

DPDP Act Right-to-Erasure (Article 13): on retailer request, all Silver and Gold records are deleted within 30 days. Bronze records are pseudonymized (CMK deleted so Bronze becomes unreadable). Consent records and audit logs are retained (legal hold exemption under DPDP Act Section 8(7)).