AltScore — Deployment Architecture

01 —

Overview & Cloud Strategy

AltScore deploys on AWS ap-south-1 (Mumbai) as the primary cloud region, with Azure Central India as a disaster-recovery target. All data — in transit, at rest, and in computation — remains within India's geographic boundaries to comply with the DPDP Act 2023 data localisation requirements.

DA-ALTSCORE-001 · Deployment Charter

The infrastructure is entirely containerised (Docker) and orchestrated via Amazon EKS. No pet servers — every compute resource is defined in code (Terraform) and reproduced identically across environments. The deployment pipeline enforces immutable image tags, environment-specific secrets, and mandatory security scanning before any image reaches production.

Primary Region

AWS ap-south-1

Mumbai. All production workloads. Multi-AZ active-active for API services. EKS, RDS PostgreSQL, ElastiCache Redis, S3, MSK Kafka, CloudWatch.

DR Region

Azure Central India

Pune. Passive standby for data lake and PostgreSQL (continuous replication). Activated on primary region failure. RTO 4h, RPO 1h.

IaC

Terraform + Helm

All infrastructure defined as code. Terraform manages cloud resources. Helm charts manage Kubernetes workloads. No manual console changes in production.

02 —

Cloud Topology

Internet / External

NBFC LendersScore API consumers

ERP Connector SDKDistributor sites

WhatsApp Business APIMeta Cloud

Distributor DashboardWeb browser

↕ AWS Shield Advanced · CloudFront · WAF · TLS 1.3 termination ↕

DMZ — Public Subnet (ap-south-1a/b/c)

AWS ALBLayer 7 load balancer · TLS offload

API Gateway (Kong)JWT verify · rate limit · routing

NAT GatewayOutbound-only for private subnets

VPN GatewaymTLS ERP connector tunnels

↕ Internal VPC routing — no direct internet access ↕

Application Zone — Private Subnet (EKS cluster)

Score API pods3–20 replicas · HPA

Consent Service pods3–10 replicas · HPA

ERP Ingestion pods2–8 replicas · HPA

Notification pods2–6 replicas · HPA

Admin Service pods2 replicas · fixed

Audit Service pods2–4 replicas · HPA

↕ VPC peering / private endpoints — no public egress ↕

Data & ML Zone — Isolated Private Subnet

Amazon EMR / SparkFeature pipeline · Bronze→Gold

Apache Airflow (MWAA)DAG orchestration

Scoring Engine podsML inference · GPU-optional

MLflow on EKSModel registry · artifact store

↕ AWS PrivateLink / VPC endpoints ↕

Managed Storage (AWS Managed Services)

Amazon S3Data Lake · Object Lock (WORM)

Amazon RDS PostgreSQLMulti-AZ · Encrypted · RLS

Amazon ElastiCache RedisCluster mode · TLS · AUTH

Amazon MSK (Kafka)3-broker · SASL · TLS

AWS KMSCMKs · HSM-backed · per-distributor

03 —

Kubernetes Cluster Design

The production EKS cluster runs in Amazon EKS v1.29+ across three availability zones (ap-south-1a, 1b, 1c). Three node groups separate compute concerns and enable independent scaling policies.

Node Group	Instance Type	Min/Max Nodes	Workloads	Taints
api-nodes	c6i.xlarge (4 vCPU, 8GB)	3 / 20	Score API, Consent, ERP Ingestion, Notification, Audit, Admin	workload=api:NoSchedule
ml-nodes	m6i.2xlarge (8 vCPU, 32GB)	2 / 10	Scoring Engine, MLflow, Feature Pipeline workers	workload=ml:NoSchedule
system-nodes	t3.medium (2 vCPU, 4GB)	3 / 3	Cluster DNS, metrics-server, cluster-autoscaler, cert-manager, Istio control plane	node-role=system:NoSchedule

Namespace Strategy

Namespace	Services	Network Policy	Resource Quota
`altscore-api`	Score API, Consent, ERP Ingestion, Notification, Admin	Deny all except from altscore-gateway + internal cross-namespace on port 8080	4 CPU / 8Gi RAM max
`altscore-ml`	Scoring Engine, MLflow, Guardrail Engine	Deny all external; allow only altscore-api inbound; S3/RDS via PrivateLink only	16 CPU / 64Gi RAM max
`altscore-data`	Airflow workers, dbt runner, pipeline utilities	No external inbound; EMR outbound via VPC peering; S3 via PrivateLink	8 CPU / 32Gi RAM max
`altscore-gateway`	Kong API Gateway, ingress controller	Accept inbound from ALB; egress to altscore-api only	2 CPU / 4Gi RAM max
`altscore-monitoring`	Prometheus, Grafana, AlertManager	Read-only scrape from all namespaces; no write access	2 CPU / 4Gi RAM max
`istio-system`	Istio control plane, sidecar injector	Managed by Istio; enforces mTLS between all pods	2 CPU / 4Gi RAM max

Pod Security & mTLS

Istio service mesh is deployed cluster-wide with STRICT mTLS mode — all pod-to-pod communication is mutually authenticated and encrypted. PodSecurityPolicy (or Pod Security Admission in newer K8s) enforces: no privilege escalation, read-only root filesystem, non-root user, no host network/PID. Secrets are injected via AWS Secrets Manager CSI driver — never as environment variables or ConfigMaps.

04 —

Network Security Architecture

VPC Design

Subnet Type	CIDR	Routing
Public (DMZ)	10.0.0.0/24 per AZ	Internet Gateway; ALB, NAT GW only
Private App	10.0.10.0/23 per AZ	NAT GW for egress only; no direct inbound
Private Data/ML	10.0.20.0/23 per AZ	No NAT; PrivateLink only; no internet
DB Subnet Group	10.0.30.0/24 per AZ	No routing to internet; RDS/ElastiCache/MSK

Security Groups

Group	Inbound	Outbound
sg-alb	443 from 0.0.0.0/0	8080 to sg-gateway
sg-gateway	8080 from sg-alb	8080 to sg-api-pods
sg-api-pods	8080 from sg-gateway	5432 to sg-rds, 6379 to sg-redis, 9092 to sg-kafka
sg-ml-pods	8080 from sg-api-pods only	5432 to sg-rds, S3 via endpoint
sg-rds	5432 from sg-api-pods, sg-ml-pods only	None
sg-erp-vpn	443 from distributor IP allowlist	8080 to sg-api-pods

TLS & Certificate Management

All external TLS is terminated at the ALB with ACM-managed certificates (auto-renewed). Internal pod-to-pod TLS is managed by Istio using SPIFFE/SPIRE identities — certificates rotate every 24 hours automatically. ERP connector mTLS uses per-distributor client certificates issued by AltScore's internal CA (cert-manager on K8s), rotated every 90 days. All certificates are SHA-256/RSA-4096 or EC P-256.

05 —

CI/CD Pipeline

Every service follows an identical pipeline enforced by GitHub Actions. Merges to main automatically deploy to staging; promotion to production requires manual approval from two engineers. No direct production access — all changes go through the pipeline.

1 · Commit

PR raised; branch protection enforced
Linting (ruff, black)
Unit tests (pytest ≥80% coverage)
SAST scan (Bandit, Semgrep)
Dependency audit (pip-audit)

2 · Build

Docker build (multi-stage)
Image scan: Trivy (CRITICAL/HIGH = fail)
SRI hash generated
Image tagged with git SHA (no latest)
Push to ECR (private, encrypted)

3 · Staging Deploy

Auto-deploy on merge to main
Helm upgrade with SHA-pinned image
Smoke tests run against staging
Integration tests: consent, score, ingest
Load test: 50 concurrent lenders

4 · Approval Gate

Two-engineer approval required
Security sign-off for infra changes
DPO sign-off for data pipeline changes
Change record auto-created
Rollback plan documented

5 · Production

Helm upgrade (rolling, zero-downtime)
Readiness gates: health checks pass
Canary: 10% traffic → 100% over 15min
Auto-rollback on error rate >1%
Slack + PagerDuty notify on deploy

ML Model Deployment

ML model promotion follows a separate path governed by the Guardrail Committee process (GRD-ALTSCORE-001, Section 10). Model artifacts are stored in MLflow on S3 with SHA-256 hash verification. The Scoring Engine pulls the active model at startup — it never bakes model weights into a container image. Model promotion is triggered via the Admin Service (two-person sign-off), which updates a model pointer in the database; the Scoring Engine reloads the new artifact within 60 seconds without a container restart.

GitOps Repository Structure

Repository	Contents	Deployment Target
`altscore/platform`	Terraform modules: VPC, EKS, RDS, ElastiCache, MSK, S3, KMS, IAM	AWS infrastructure
`altscore/services`	Service source code + Dockerfile + Helm chart per service	EKS via Helm
`altscore/ml`	Training code, feature engineering, model evaluation notebooks	MLflow registry
`altscore/data-pipelines`	Airflow DAGs, dbt models, Spark jobs, schema definitions	MWAA + EMR
`altscore/k8s-config`	Helm values per environment, NetworkPolicy, RBAC, PodSecurity manifests	EKS via ArgoCD

06 —

Environment Strategy

Three fully isolated environments — Development, Staging, Production — each in separate AWS accounts (AWS Organizations). No shared resources; no prod credentials accessible from dev or staging. Staging mirrors production configuration exactly (same instance types, same security policies) to prevent "works in staging" surprises.

Development (dev)

AWS account: dedicated dev account
EKS: single-AZ, smaller instances (t3.large)
RDS: single instance, non-HA
Data: synthetic retailer data only (no real ERP)
No real NBFC partner credentials
Feature flags: all experimental features ON
Cost optimised: cluster auto-scales to zero overnight
Access: all engineers (role-based)

Staging (staging)

AWS account: dedicated staging account
EKS: multi-AZ, production-identical instance types
RDS: Multi-AZ, same config as prod
Data: anonymised production-shaped dataset
NBFC partner sandbox credentials only
Integration tests + load tests run here
Used for NBFC partner UAT
Access: engineers + DPO + NBFC sandbox users

Production (prod)

AWS account: dedicated prod account
EKS: multi-AZ, full HA, HPA enabled
RDS: Multi-AZ, automated backups, PITR
Data: real retailer data under DPDP consent
Live NBFC partner credentials
Two-engineer approval required for any change
No developer SSH/kubectl exec access by default
Access: break-glass via FIDO2 + audit trail

Environment Promotion Flow

Code flows dev → staging → production only. No hot-patches to production without staging validation except for P1 incidents (break-glass procedure with post-incident review required). Environment-specific secrets are managed in AWS Secrets Manager per account; no secrets in code repositories.

07 —

Data Storage Deployment

Store	Service	Config	Encryption	Backup / Retention
PostgreSQL	Amazon RDS PostgreSQL 15	db.r6g.large, Multi-AZ, 500GB gp3	AES-256 (AWS KMS CMK)	Automated daily snapshots; 35-day PITR; 7-year audit log partition on S3
Redis	Amazon ElastiCache Redis 7	cache.r6g.large, cluster mode (3 shards × 2 replicas)	TLS in-transit; AES-256 at-rest	No persistent backup (cache); reconstructed from DB on failure
Kafka	Amazon MSK 3.5	3 broker, kafka.m5.large, 1TB per broker	TLS in-transit; EBS AES-256 at-rest	Topic-specific retention (7 days to indefinite per event catalogue)
Data Lake	Amazon S3 + Object Lock	Standard + Intelligent-Tiering	SSE-KMS per-distributor CMK; S3 Object Lock (WORM) for audit + Bronze	Versioning enabled; Bronze WORM 7yr; Silver 2yr+5yr archive; Gold 3yr
Secrets	AWS Secrets Manager + KMS	HSM-backed CMK; per-secret rotation	AES-256; HSM-backed for API signing keys	Secret versions retained 30 days post-rotation
Container Registry	Amazon ECR (private)	Per-service repository; image scanning on push	AES-256 at-rest	Untagged images purged after 90 days; tagged retained indefinitely

08 —

Observability & Monitoring

Metrics

Prometheus + Grafana

Prometheus scrapes all pods via ServiceMonitor CRDs. Grafana dashboards for: Score API latency/error rate, consent check SLA, pipeline run times, guardrail trigger rates, ML drift (PSI). Alert rules for p95 > thresholds and error rate spikes.

Logging

Structured JSON → AWS CloudWatch

All services log structured JSON (no plaintext). Fluentd DaemonSet ships logs to CloudWatch Logs. Log groups per service, per environment. 90-day hot retention; 7-year cold on S3. PII fields redacted at log emission.

Tracing

OpenTelemetry + AWS X-Ray

Distributed tracing across Score API → Consent → Scoring Engine → Audit. Trace IDs propagated via W3C headers. Sampling: 1% baseline, 100% for error traces. Used for latency root-cause analysis.

Key Alerts

Alert	Condition	Severity	Response
Score API error rate high	5xx rate > 1% over 5 minutes	P1	PagerDuty; auto-rollback if post-deploy
Consent Service unreachable	Health check fails for > 30s	P1	PagerDuty; circuit breaker activates
Score API latency degraded	p95 > 2s for 5 minutes	P2	Slack alert; on-call investigation
Pipeline SLA breach	Gold layer not updated in >6h	P2	Slack + PagerDuty; pipeline engineer notified
Guardrail triggered	Any GR-1 through GR-6 escalation	P2/P3	Risk team Slack channel; case auto-created
Certificate expiry	Any cert < 14 days remaining	P3	Slack alert; cert-manager auto-renews
Kafka consumer lag	altscore.audit consumer lag > 10K messages	P3	Audit Service scaling investigation

09 —

Disaster Recovery & Backup

RTO / RPO Targets

Recovery Objectives

Score API RTO: 4 hours (AWS region failure)
Score API RPO: 1 hour (max data loss)
Consent Service RTO: 4 hours
Data Lake RPO: 24 hours (S3 cross-region replication)
Model rollback RTO: <10 minutes (same region)
PostgreSQL PITR: any point in last 35 days

DR Architecture

Active-Passive to Azure Central India

PostgreSQL: continuous streaming replication to Azure DB for PostgreSQL
S3 Data Lake: S3 cross-region replication to Azure Blob Storage (ADLS Gen2)
Kafka: MSK topic mirroring to Azure Event Hubs
EKS workloads: Terraform templates ready; deploy-from-scratch in <2h
DNS failover: Route 53 health checks → CNAME switch to Azure ALB
DR drill: quarterly test (RTO/RPO verified)

Backup Schedule

Resource	Backup Frequency	Retention	Restore Tested
RDS PostgreSQL	Continuous (PITR) + daily snapshot	35 days PITR; 7 years snapshots for audit tables	Monthly
S3 Bronze layer	Object Lock WORM (no backup needed)	7 years	Quarterly DR drill
S3 Silver/Gold layers	Versioning + cross-region replication	2yr active + 5yr archive	Quarterly DR drill
MLflow model artifacts	WORM on S3 + ECR tag	Indefinite (model lineage)	On every model rollback test
Kafka topics	MSK MirrorMaker 2 (continuous)	Per topic retention policy	Semi-annual
Secrets Manager	Automatic version history	30 days per version	Monthly

10 —

Capacity Planning

Workload	Phase 01 (Pilot)	Phase 03 (10 Districts)	Phase 05 (National)	Scaling Mechanism
Active retailers	10,000	250,000	2,000,000	Data lake partitioning by distributor
NBFC lenders	3	15	50	Kong rate-limit config; RLS policy per lender
Score API req/day	5,000	150,000	2,000,000	HPA on Score API pods (CPU + RPS)
ERP batches/day	30 distributors	500 distributors	5,000 distributors	Spark cluster auto-scales (EMR managed)
Nightly scoring run	10K retailers / 10min	250K / 2h	2M / 6h	ML node group auto-scale; EMR spot instances
Kafka throughput	~100 events/s	~2,000 events/s	~20,000 events/s	MSK partition increase; consumer group scale-out
PostgreSQL storage	50 GB	500 GB	5 TB	RDS storage auto-scaling; table partitioning by month
S3 data lake	100 GB	5 TB	100 TB	S3 Intelligent-Tiering; Glacier archive for Bronze >2yr

All capacity estimates assume a 2× headroom buffer over peak measured load. The cluster autoscaler and HPA are configured to scale before hitting 70% utilization on any node group, ensuring buffer for traffic spikes without service degradation.