AltScore deploys on AWS ap-south-1 (Mumbai) as the primary cloud region, with Azure Central India as a disaster-recovery target. All data — in transit, at rest, and in computation — remains within India's geographic boundaries to comply with the DPDP Act 2023 data localisation requirements.
DA-ALTSCORE-001 · Deployment CharterThe infrastructure is entirely containerised (Docker) and orchestrated via Amazon EKS. No pet servers — every compute resource is defined in code (Terraform) and reproduced identically across environments. The deployment pipeline enforces immutable image tags, environment-specific secrets, and mandatory security scanning before any image reaches production.
Mumbai. All production workloads. Multi-AZ active-active for API services. EKS, RDS PostgreSQL, ElastiCache Redis, S3, MSK Kafka, CloudWatch.
Pune. Passive standby for data lake and PostgreSQL (continuous replication). Activated on primary region failure. RTO 4h, RPO 1h.
All infrastructure defined as code. Terraform manages cloud resources. Helm charts manage Kubernetes workloads. No manual console changes in production.
The production EKS cluster runs in Amazon EKS v1.29+ across three availability zones (ap-south-1a, 1b, 1c). Three node groups separate compute concerns and enable independent scaling policies.
| Node Group | Instance Type | Min/Max Nodes | Workloads | Taints |
|---|---|---|---|---|
| api-nodes | c6i.xlarge (4 vCPU, 8GB) | 3 / 20 | Score API, Consent, ERP Ingestion, Notification, Audit, Admin | workload=api:NoSchedule |
| ml-nodes | m6i.2xlarge (8 vCPU, 32GB) | 2 / 10 | Scoring Engine, MLflow, Feature Pipeline workers | workload=ml:NoSchedule |
| system-nodes | t3.medium (2 vCPU, 4GB) | 3 / 3 | Cluster DNS, metrics-server, cluster-autoscaler, cert-manager, Istio control plane | node-role=system:NoSchedule |
| Namespace | Services | Network Policy | Resource Quota |
|---|---|---|---|
altscore-api | Score API, Consent, ERP Ingestion, Notification, Admin | Deny all except from altscore-gateway + internal cross-namespace on port 8080 | 4 CPU / 8Gi RAM max |
altscore-ml | Scoring Engine, MLflow, Guardrail Engine | Deny all external; allow only altscore-api inbound; S3/RDS via PrivateLink only | 16 CPU / 64Gi RAM max |
altscore-data | Airflow workers, dbt runner, pipeline utilities | No external inbound; EMR outbound via VPC peering; S3 via PrivateLink | 8 CPU / 32Gi RAM max |
altscore-gateway | Kong API Gateway, ingress controller | Accept inbound from ALB; egress to altscore-api only | 2 CPU / 4Gi RAM max |
altscore-monitoring | Prometheus, Grafana, AlertManager | Read-only scrape from all namespaces; no write access | 2 CPU / 4Gi RAM max |
istio-system | Istio control plane, sidecar injector | Managed by Istio; enforces mTLS between all pods | 2 CPU / 4Gi RAM max |
Istio service mesh is deployed cluster-wide with STRICT mTLS mode — all pod-to-pod communication is mutually authenticated and encrypted. PodSecurityPolicy (or Pod Security Admission in newer K8s) enforces: no privilege escalation, read-only root filesystem, non-root user, no host network/PID. Secrets are injected via AWS Secrets Manager CSI driver — never as environment variables or ConfigMaps.
| Subnet Type | CIDR | Routing |
|---|---|---|
| Public (DMZ) | 10.0.0.0/24 per AZ | Internet Gateway; ALB, NAT GW only |
| Private App | 10.0.10.0/23 per AZ | NAT GW for egress only; no direct inbound |
| Private Data/ML | 10.0.20.0/23 per AZ | No NAT; PrivateLink only; no internet |
| DB Subnet Group | 10.0.30.0/24 per AZ | No routing to internet; RDS/ElastiCache/MSK |
| Group | Inbound | Outbound |
|---|---|---|
| sg-alb | 443 from 0.0.0.0/0 | 8080 to sg-gateway |
| sg-gateway | 8080 from sg-alb | 8080 to sg-api-pods |
| sg-api-pods | 8080 from sg-gateway | 5432 to sg-rds, 6379 to sg-redis, 9092 to sg-kafka |
| sg-ml-pods | 8080 from sg-api-pods only | 5432 to sg-rds, S3 via endpoint |
| sg-rds | 5432 from sg-api-pods, sg-ml-pods only | None |
| sg-erp-vpn | 443 from distributor IP allowlist | 8080 to sg-api-pods |
All external TLS is terminated at the ALB with ACM-managed certificates (auto-renewed). Internal pod-to-pod TLS is managed by Istio using SPIFFE/SPIRE identities — certificates rotate every 24 hours automatically. ERP connector mTLS uses per-distributor client certificates issued by AltScore's internal CA (cert-manager on K8s), rotated every 90 days. All certificates are SHA-256/RSA-4096 or EC P-256.
Every service follows an identical pipeline enforced by GitHub Actions. Merges to main automatically deploy to staging; promotion to production requires manual approval from two engineers. No direct production access — all changes go through the pipeline.
latest)ML model promotion follows a separate path governed by the Guardrail Committee process (GRD-ALTSCORE-001, Section 10). Model artifacts are stored in MLflow on S3 with SHA-256 hash verification. The Scoring Engine pulls the active model at startup — it never bakes model weights into a container image. Model promotion is triggered via the Admin Service (two-person sign-off), which updates a model pointer in the database; the Scoring Engine reloads the new artifact within 60 seconds without a container restart.
| Repository | Contents | Deployment Target |
|---|---|---|
altscore/platform | Terraform modules: VPC, EKS, RDS, ElastiCache, MSK, S3, KMS, IAM | AWS infrastructure |
altscore/services | Service source code + Dockerfile + Helm chart per service | EKS via Helm |
altscore/ml | Training code, feature engineering, model evaluation notebooks | MLflow registry |
altscore/data-pipelines | Airflow DAGs, dbt models, Spark jobs, schema definitions | MWAA + EMR |
altscore/k8s-config | Helm values per environment, NetworkPolicy, RBAC, PodSecurity manifests | EKS via ArgoCD |
Three fully isolated environments — Development, Staging, Production — each in separate AWS accounts (AWS Organizations). No shared resources; no prod credentials accessible from dev or staging. Staging mirrors production configuration exactly (same instance types, same security policies) to prevent "works in staging" surprises.
Code flows dev → staging → production only. No hot-patches to production without staging validation except for P1 incidents (break-glass procedure with post-incident review required). Environment-specific secrets are managed in AWS Secrets Manager per account; no secrets in code repositories.
| Store | Service | Config | Encryption | Backup / Retention |
|---|---|---|---|---|
| PostgreSQL | Amazon RDS PostgreSQL 15 | db.r6g.large, Multi-AZ, 500GB gp3 | AES-256 (AWS KMS CMK) | Automated daily snapshots; 35-day PITR; 7-year audit log partition on S3 |
| Redis | Amazon ElastiCache Redis 7 | cache.r6g.large, cluster mode (3 shards × 2 replicas) | TLS in-transit; AES-256 at-rest | No persistent backup (cache); reconstructed from DB on failure |
| Kafka | Amazon MSK 3.5 | 3 broker, kafka.m5.large, 1TB per broker | TLS in-transit; EBS AES-256 at-rest | Topic-specific retention (7 days to indefinite per event catalogue) |
| Data Lake | Amazon S3 + Object Lock | Standard + Intelligent-Tiering | SSE-KMS per-distributor CMK; S3 Object Lock (WORM) for audit + Bronze | Versioning enabled; Bronze WORM 7yr; Silver 2yr+5yr archive; Gold 3yr |
| Secrets | AWS Secrets Manager + KMS | HSM-backed CMK; per-secret rotation | AES-256; HSM-backed for API signing keys | Secret versions retained 30 days post-rotation |
| Container Registry | Amazon ECR (private) | Per-service repository; image scanning on push | AES-256 at-rest | Untagged images purged after 90 days; tagged retained indefinitely |
Prometheus scrapes all pods via ServiceMonitor CRDs. Grafana dashboards for: Score API latency/error rate, consent check SLA, pipeline run times, guardrail trigger rates, ML drift (PSI). Alert rules for p95 > thresholds and error rate spikes.
All services log structured JSON (no plaintext). Fluentd DaemonSet ships logs to CloudWatch Logs. Log groups per service, per environment. 90-day hot retention; 7-year cold on S3. PII fields redacted at log emission.
Distributed tracing across Score API → Consent → Scoring Engine → Audit. Trace IDs propagated via W3C headers. Sampling: 1% baseline, 100% for error traces. Used for latency root-cause analysis.
| Alert | Condition | Severity | Response |
|---|---|---|---|
| Score API error rate high | 5xx rate > 1% over 5 minutes | P1 | PagerDuty; auto-rollback if post-deploy |
| Consent Service unreachable | Health check fails for > 30s | P1 | PagerDuty; circuit breaker activates |
| Score API latency degraded | p95 > 2s for 5 minutes | P2 | Slack alert; on-call investigation |
| Pipeline SLA breach | Gold layer not updated in >6h | P2 | Slack + PagerDuty; pipeline engineer notified |
| Guardrail triggered | Any GR-1 through GR-6 escalation | P2/P3 | Risk team Slack channel; case auto-created |
| Certificate expiry | Any cert < 14 days remaining | P3 | Slack alert; cert-manager auto-renews |
| Kafka consumer lag | altscore.audit consumer lag > 10K messages | P3 | Audit Service scaling investigation |
| Resource | Backup Frequency | Retention | Restore Tested |
|---|---|---|---|
| RDS PostgreSQL | Continuous (PITR) + daily snapshot | 35 days PITR; 7 years snapshots for audit tables | Monthly |
| S3 Bronze layer | Object Lock WORM (no backup needed) | 7 years | Quarterly DR drill |
| S3 Silver/Gold layers | Versioning + cross-region replication | 2yr active + 5yr archive | Quarterly DR drill |
| MLflow model artifacts | WORM on S3 + ECR tag | Indefinite (model lineage) | On every model rollback test |
| Kafka topics | MSK MirrorMaker 2 (continuous) | Per topic retention policy | Semi-annual |
| Secrets Manager | Automatic version history | 30 days per version | Monthly |
| Workload | Phase 01 (Pilot) | Phase 03 (10 Districts) | Phase 05 (National) | Scaling Mechanism |
|---|---|---|---|---|
| Active retailers | 10,000 | 250,000 | 2,000,000 | Data lake partitioning by distributor |
| NBFC lenders | 3 | 15 | 50 | Kong rate-limit config; RLS policy per lender |
| Score API req/day | 5,000 | 150,000 | 2,000,000 | HPA on Score API pods (CPU + RPS) |
| ERP batches/day | 30 distributors | 500 distributors | 5,000 distributors | Spark cluster auto-scales (EMR managed) |
| Nightly scoring run | 10K retailers / 10min | 250K / 2h | 2M / 6h | ML node group auto-scale; EMR spot instances |
| Kafka throughput | ~100 events/s | ~2,000 events/s | ~20,000 events/s | MSK partition increase; consumer group scale-out |
| PostgreSQL storage | 50 GB | 500 GB | 5 TB | RDS storage auto-scaling; table partitioning by month |
| S3 data lake | 100 GB | 5 TB | 100 TB | S3 Intelligent-Tiering; Glacier archive for Bronze >2yr |
All capacity estimates assume a 2× headroom buffer over peak measured load. The cluster autoscaler and HPA are configured to scale before hitting 70% utilization on any node group, ensuring buffer for traffic spikes without service degradation.