← Hub
AltScore Platform · Architecture
Deployment
Architecture
DOC-ID: DA-ALTSCORE-001
Version: 1.0 · May 2026
Classification: INTERNAL — Engineering & DevOps
Owner: Platform Engineering, AltScore Technologies
India-Only Data Residency AWS ap-south-1 (Mumbai) primary · Azure Central India DR · No cross-border data transfer
K8s + EKS Container Orchestration Amazon EKS · 3 node groups · Auto-scaling · Namespace-per-service
3 Envs Environment Strategy Development · Staging (prod-mirror) · Production · Isolated VPCs
99.9% Score API SLA RTO: 4h · RPO: 1h · Multi-AZ active-active
01 —

Overview & Cloud Strategy

AltScore deploys on AWS ap-south-1 (Mumbai) as the primary cloud region, with Azure Central India as a disaster-recovery target. All data — in transit, at rest, and in computation — remains within India's geographic boundaries to comply with the DPDP Act 2023 data localisation requirements.

DA-ALTSCORE-001 · Deployment Charter

The infrastructure is entirely containerised (Docker) and orchestrated via Amazon EKS. No pet servers — every compute resource is defined in code (Terraform) and reproduced identically across environments. The deployment pipeline enforces immutable image tags, environment-specific secrets, and mandatory security scanning before any image reaches production.

Primary Region

AWS ap-south-1

Mumbai. All production workloads. Multi-AZ active-active for API services. EKS, RDS PostgreSQL, ElastiCache Redis, S3, MSK Kafka, CloudWatch.

DR Region

Azure Central India

Pune. Passive standby for data lake and PostgreSQL (continuous replication). Activated on primary region failure. RTO 4h, RPO 1h.

IaC

Terraform + Helm

All infrastructure defined as code. Terraform manages cloud resources. Helm charts manage Kubernetes workloads. No manual console changes in production.

02 —

Cloud Topology

Internet / External
NBFC LendersScore API consumers
ERP Connector SDKDistributor sites
WhatsApp Business APIMeta Cloud
Distributor DashboardWeb browser
↕ AWS Shield Advanced · CloudFront · WAF · TLS 1.3 termination ↕
DMZ — Public Subnet (ap-south-1a/b/c)
AWS ALBLayer 7 load balancer · TLS offload
API Gateway (Kong)JWT verify · rate limit · routing
NAT GatewayOutbound-only for private subnets
VPN GatewaymTLS ERP connector tunnels
↕ Internal VPC routing — no direct internet access ↕
Application Zone — Private Subnet (EKS cluster)
Score API pods3–20 replicas · HPA
Consent Service pods3–10 replicas · HPA
ERP Ingestion pods2–8 replicas · HPA
Notification pods2–6 replicas · HPA
Admin Service pods2 replicas · fixed
Audit Service pods2–4 replicas · HPA
↕ VPC peering / private endpoints — no public egress ↕
Data & ML Zone — Isolated Private Subnet
Amazon EMR / SparkFeature pipeline · Bronze→Gold
Apache Airflow (MWAA)DAG orchestration
Scoring Engine podsML inference · GPU-optional
MLflow on EKSModel registry · artifact store
↕ AWS PrivateLink / VPC endpoints ↕
Managed Storage (AWS Managed Services)
Amazon S3Data Lake · Object Lock (WORM)
Amazon RDS PostgreSQLMulti-AZ · Encrypted · RLS
Amazon ElastiCache RedisCluster mode · TLS · AUTH
Amazon MSK (Kafka)3-broker · SASL · TLS
AWS KMSCMKs · HSM-backed · per-distributor
03 —

Kubernetes Cluster Design

The production EKS cluster runs in Amazon EKS v1.29+ across three availability zones (ap-south-1a, 1b, 1c). Three node groups separate compute concerns and enable independent scaling policies.

Node GroupInstance TypeMin/Max NodesWorkloadsTaints
api-nodesc6i.xlarge (4 vCPU, 8GB)3 / 20Score API, Consent, ERP Ingestion, Notification, Audit, Adminworkload=api:NoSchedule
ml-nodesm6i.2xlarge (8 vCPU, 32GB)2 / 10Scoring Engine, MLflow, Feature Pipeline workersworkload=ml:NoSchedule
system-nodest3.medium (2 vCPU, 4GB)3 / 3Cluster DNS, metrics-server, cluster-autoscaler, cert-manager, Istio control planenode-role=system:NoSchedule

Namespace Strategy

NamespaceServicesNetwork PolicyResource Quota
altscore-apiScore API, Consent, ERP Ingestion, Notification, AdminDeny all except from altscore-gateway + internal cross-namespace on port 80804 CPU / 8Gi RAM max
altscore-mlScoring Engine, MLflow, Guardrail EngineDeny all external; allow only altscore-api inbound; S3/RDS via PrivateLink only16 CPU / 64Gi RAM max
altscore-dataAirflow workers, dbt runner, pipeline utilitiesNo external inbound; EMR outbound via VPC peering; S3 via PrivateLink8 CPU / 32Gi RAM max
altscore-gatewayKong API Gateway, ingress controllerAccept inbound from ALB; egress to altscore-api only2 CPU / 4Gi RAM max
altscore-monitoringPrometheus, Grafana, AlertManagerRead-only scrape from all namespaces; no write access2 CPU / 4Gi RAM max
istio-systemIstio control plane, sidecar injectorManaged by Istio; enforces mTLS between all pods2 CPU / 4Gi RAM max

Pod Security & mTLS

Istio service mesh is deployed cluster-wide with STRICT mTLS mode — all pod-to-pod communication is mutually authenticated and encrypted. PodSecurityPolicy (or Pod Security Admission in newer K8s) enforces: no privilege escalation, read-only root filesystem, non-root user, no host network/PID. Secrets are injected via AWS Secrets Manager CSI driver — never as environment variables or ConfigMaps.

04 —

Network Security Architecture

VPC Design

Subnet TypeCIDRRouting
Public (DMZ)10.0.0.0/24 per AZInternet Gateway; ALB, NAT GW only
Private App10.0.10.0/23 per AZNAT GW for egress only; no direct inbound
Private Data/ML10.0.20.0/23 per AZNo NAT; PrivateLink only; no internet
DB Subnet Group10.0.30.0/24 per AZNo routing to internet; RDS/ElastiCache/MSK

Security Groups

GroupInboundOutbound
sg-alb443 from 0.0.0.0/08080 to sg-gateway
sg-gateway8080 from sg-alb8080 to sg-api-pods
sg-api-pods8080 from sg-gateway5432 to sg-rds, 6379 to sg-redis, 9092 to sg-kafka
sg-ml-pods8080 from sg-api-pods only5432 to sg-rds, S3 via endpoint
sg-rds5432 from sg-api-pods, sg-ml-pods onlyNone
sg-erp-vpn443 from distributor IP allowlist8080 to sg-api-pods

TLS & Certificate Management

All external TLS is terminated at the ALB with ACM-managed certificates (auto-renewed). Internal pod-to-pod TLS is managed by Istio using SPIFFE/SPIRE identities — certificates rotate every 24 hours automatically. ERP connector mTLS uses per-distributor client certificates issued by AltScore's internal CA (cert-manager on K8s), rotated every 90 days. All certificates are SHA-256/RSA-4096 or EC P-256.

05 —

CI/CD Pipeline

Every service follows an identical pipeline enforced by GitHub Actions. Merges to main automatically deploy to staging; promotion to production requires manual approval from two engineers. No direct production access — all changes go through the pipeline.

1 · Commit
  • PR raised; branch protection enforced
  • Linting (ruff, black)
  • Unit tests (pytest ≥80% coverage)
  • SAST scan (Bandit, Semgrep)
  • Dependency audit (pip-audit)
2 · Build
  • Docker build (multi-stage)
  • Image scan: Trivy (CRITICAL/HIGH = fail)
  • SRI hash generated
  • Image tagged with git SHA (no latest)
  • Push to ECR (private, encrypted)
3 · Staging Deploy
  • Auto-deploy on merge to main
  • Helm upgrade with SHA-pinned image
  • Smoke tests run against staging
  • Integration tests: consent, score, ingest
  • Load test: 50 concurrent lenders
4 · Approval Gate
  • Two-engineer approval required
  • Security sign-off for infra changes
  • DPO sign-off for data pipeline changes
  • Change record auto-created
  • Rollback plan documented
5 · Production
  • Helm upgrade (rolling, zero-downtime)
  • Readiness gates: health checks pass
  • Canary: 10% traffic → 100% over 15min
  • Auto-rollback on error rate >1%
  • Slack + PagerDuty notify on deploy

ML Model Deployment

ML model promotion follows a separate path governed by the Guardrail Committee process (GRD-ALTSCORE-001, Section 10). Model artifacts are stored in MLflow on S3 with SHA-256 hash verification. The Scoring Engine pulls the active model at startup — it never bakes model weights into a container image. Model promotion is triggered via the Admin Service (two-person sign-off), which updates a model pointer in the database; the Scoring Engine reloads the new artifact within 60 seconds without a container restart.

GitOps Repository Structure

RepositoryContentsDeployment Target
altscore/platformTerraform modules: VPC, EKS, RDS, ElastiCache, MSK, S3, KMS, IAMAWS infrastructure
altscore/servicesService source code + Dockerfile + Helm chart per serviceEKS via Helm
altscore/mlTraining code, feature engineering, model evaluation notebooksMLflow registry
altscore/data-pipelinesAirflow DAGs, dbt models, Spark jobs, schema definitionsMWAA + EMR
altscore/k8s-configHelm values per environment, NetworkPolicy, RBAC, PodSecurity manifestsEKS via ArgoCD
06 —

Environment Strategy

Three fully isolated environments — Development, Staging, Production — each in separate AWS accounts (AWS Organizations). No shared resources; no prod credentials accessible from dev or staging. Staging mirrors production configuration exactly (same instance types, same security policies) to prevent "works in staging" surprises.

Development (dev)
  • AWS account: dedicated dev account
  • EKS: single-AZ, smaller instances (t3.large)
  • RDS: single instance, non-HA
  • Data: synthetic retailer data only (no real ERP)
  • No real NBFC partner credentials
  • Feature flags: all experimental features ON
  • Cost optimised: cluster auto-scales to zero overnight
  • Access: all engineers (role-based)
Staging (staging)
  • AWS account: dedicated staging account
  • EKS: multi-AZ, production-identical instance types
  • RDS: Multi-AZ, same config as prod
  • Data: anonymised production-shaped dataset
  • NBFC partner sandbox credentials only
  • Integration tests + load tests run here
  • Used for NBFC partner UAT
  • Access: engineers + DPO + NBFC sandbox users
Production (prod)
  • AWS account: dedicated prod account
  • EKS: multi-AZ, full HA, HPA enabled
  • RDS: Multi-AZ, automated backups, PITR
  • Data: real retailer data under DPDP consent
  • Live NBFC partner credentials
  • Two-engineer approval required for any change
  • No developer SSH/kubectl exec access by default
  • Access: break-glass via FIDO2 + audit trail

Environment Promotion Flow

Code flows dev → staging → production only. No hot-patches to production without staging validation except for P1 incidents (break-glass procedure with post-incident review required). Environment-specific secrets are managed in AWS Secrets Manager per account; no secrets in code repositories.

07 —

Data Storage Deployment

StoreServiceConfigEncryptionBackup / Retention
PostgreSQLAmazon RDS PostgreSQL 15db.r6g.large, Multi-AZ, 500GB gp3AES-256 (AWS KMS CMK)Automated daily snapshots; 35-day PITR; 7-year audit log partition on S3
RedisAmazon ElastiCache Redis 7cache.r6g.large, cluster mode (3 shards × 2 replicas)TLS in-transit; AES-256 at-restNo persistent backup (cache); reconstructed from DB on failure
KafkaAmazon MSK 3.53 broker, kafka.m5.large, 1TB per brokerTLS in-transit; EBS AES-256 at-restTopic-specific retention (7 days to indefinite per event catalogue)
Data LakeAmazon S3 + Object LockStandard + Intelligent-TieringSSE-KMS per-distributor CMK; S3 Object Lock (WORM) for audit + BronzeVersioning enabled; Bronze WORM 7yr; Silver 2yr+5yr archive; Gold 3yr
SecretsAWS Secrets Manager + KMSHSM-backed CMK; per-secret rotationAES-256; HSM-backed for API signing keysSecret versions retained 30 days post-rotation
Container RegistryAmazon ECR (private)Per-service repository; image scanning on pushAES-256 at-restUntagged images purged after 90 days; tagged retained indefinitely
08 —

Observability & Monitoring

Metrics

Prometheus + Grafana

Prometheus scrapes all pods via ServiceMonitor CRDs. Grafana dashboards for: Score API latency/error rate, consent check SLA, pipeline run times, guardrail trigger rates, ML drift (PSI). Alert rules for p95 > thresholds and error rate spikes.

Logging

Structured JSON → AWS CloudWatch

All services log structured JSON (no plaintext). Fluentd DaemonSet ships logs to CloudWatch Logs. Log groups per service, per environment. 90-day hot retention; 7-year cold on S3. PII fields redacted at log emission.

Tracing

OpenTelemetry + AWS X-Ray

Distributed tracing across Score API → Consent → Scoring Engine → Audit. Trace IDs propagated via W3C headers. Sampling: 1% baseline, 100% for error traces. Used for latency root-cause analysis.

Key Alerts

AlertConditionSeverityResponse
Score API error rate high5xx rate > 1% over 5 minutesP1PagerDuty; auto-rollback if post-deploy
Consent Service unreachableHealth check fails for > 30sP1PagerDuty; circuit breaker activates
Score API latency degradedp95 > 2s for 5 minutesP2Slack alert; on-call investigation
Pipeline SLA breachGold layer not updated in >6hP2Slack + PagerDuty; pipeline engineer notified
Guardrail triggeredAny GR-1 through GR-6 escalationP2/P3Risk team Slack channel; case auto-created
Certificate expiryAny cert < 14 days remainingP3Slack alert; cert-manager auto-renews
Kafka consumer lagaltscore.audit consumer lag > 10K messagesP3Audit Service scaling investigation
09 —

Disaster Recovery & Backup

RTO / RPO Targets

Recovery Objectives

  • Score API RTO: 4 hours (AWS region failure)
  • Score API RPO: 1 hour (max data loss)
  • Consent Service RTO: 4 hours
  • Data Lake RPO: 24 hours (S3 cross-region replication)
  • Model rollback RTO: <10 minutes (same region)
  • PostgreSQL PITR: any point in last 35 days
DR Architecture

Active-Passive to Azure Central India

  • PostgreSQL: continuous streaming replication to Azure DB for PostgreSQL
  • S3 Data Lake: S3 cross-region replication to Azure Blob Storage (ADLS Gen2)
  • Kafka: MSK topic mirroring to Azure Event Hubs
  • EKS workloads: Terraform templates ready; deploy-from-scratch in <2h
  • DNS failover: Route 53 health checks → CNAME switch to Azure ALB
  • DR drill: quarterly test (RTO/RPO verified)

Backup Schedule

ResourceBackup FrequencyRetentionRestore Tested
RDS PostgreSQLContinuous (PITR) + daily snapshot35 days PITR; 7 years snapshots for audit tablesMonthly
S3 Bronze layerObject Lock WORM (no backup needed)7 yearsQuarterly DR drill
S3 Silver/Gold layersVersioning + cross-region replication2yr active + 5yr archiveQuarterly DR drill
MLflow model artifactsWORM on S3 + ECR tagIndefinite (model lineage)On every model rollback test
Kafka topicsMSK MirrorMaker 2 (continuous)Per topic retention policySemi-annual
Secrets ManagerAutomatic version history30 days per versionMonthly
10 —

Capacity Planning

WorkloadPhase 01 (Pilot)Phase 03 (10 Districts)Phase 05 (National)Scaling Mechanism
Active retailers10,000250,0002,000,000Data lake partitioning by distributor
NBFC lenders31550Kong rate-limit config; RLS policy per lender
Score API req/day5,000150,0002,000,000HPA on Score API pods (CPU + RPS)
ERP batches/day30 distributors500 distributors5,000 distributorsSpark cluster auto-scales (EMR managed)
Nightly scoring run10K retailers / 10min250K / 2h2M / 6hML node group auto-scale; EMR spot instances
Kafka throughput~100 events/s~2,000 events/s~20,000 events/sMSK partition increase; consumer group scale-out
PostgreSQL storage50 GB500 GB5 TBRDS storage auto-scaling; table partitioning by month
S3 data lake100 GB5 TB100 TBS3 Intelligent-Tiering; Glacier archive for Bronze >2yr

All capacity estimates assume a 2× headroom buffer over peak measured load. The cluster autoscaler and HPA are configured to scale before hitting 70% utilization on any node group, ensuring buffer for traffic spikes without service degradation.