scenario_id stringlengths 2 2 | scenario_name stringlengths 18 38 | scenario_type stringclasses 7
values | what_this_demonstrates stringlengths 177 343 | finding_type stringclasses 3
values | primary_tier stringclasses 5
values | secondary_tier stringclasses 3
values | action_category stringclasses 6
values | specific_change stringlengths 288 1.06k | savings_monthly_usd float64 -3,000 10.4k ⌀ | current_monthly_usd float64 1.9k 18.4k ⌀ | projected_monthly_usd float64 1.68k 10.8k ⌀ | sla_availability_preserved bool 1
class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | Chronic Underutilization | single_tier_negative | A compute fleet that has been systematically over-provisioned: utilization is far below healthy across all 14 days with no spikes, no business-hours variation, and no SLA pressure. The right action is straightforward rightsizing. | issue_found | compute | null | rightsizing | Right-size the compute fleet from t3.large × 8 to t3.medium × 4 (fixed, no scaling). Update the aws_launch_template instance_type from "t3.large" to "t3.medium" and reduce aws_instance count from 8 to 4. This yields ~62% monthly compute cost reduction (~$2,850/month savings) with no SLA impact. | 2,850 | 4,600 | 1,750 | true |
02 | Spiky Compute Load | single_tier_negative | A fleet that sits at a healthy baseline most of the time but is catastrophically under-provisioned during two predictable daily spike windows on weekdays. The fleet is fixed-capacity; scheduled scaling would absorb the spikes without affecting baseline cost much. | issue_found | compute | null | scaling_policy_change | Replace the fixed 6× m5.large fleet with a scheduled auto-scaling group: maintain a baseline of 3× m5.large during off-peak hours (weekdays 20:00–09:00 UTC and weekends all day), scale to 6× m5.large at 09:00 UTC on weekdays, and burst to 9× m5.large during the two predictable spike windows (weekdays 09:45–11:15 UTC an... | 2,496 | 5,200 | 2,704 | true |
03 | Over-provisioned Database | single_tier_negative | A database that has been sized for a workload it doesn't have. CPU, connections, and IO wait are all flat low across all 14 days. The fix is a one-step instance-class downgrade. | issue_found | database | null | rightsizing | Downsize the primary RDS instance from db.r6g.xlarge (4 vCPU, 32 GiB RAM) to db.r6g.large (2 vCPU, 16 GiB RAM) and enable storage auto-scaling on the 200 GB volume. This halves the database compute cost, saving approximately $1,680/month (50% reduction). Update the Terraform resource aws_db_instance.database_primary by... | 1,680 | 3,360 | 1,680 | true |
04 | Database Connection Bottleneck | single_tier_negative | A database under pressure from both a slow-query problem and a connection pool that's exhausted during business hours. Symptoms: connection counts push above pool limits during business hours, db_query_p95_latency_ms spikes during the same windows. | issue_found | database | null | query_cache_optimization | Optimize the top 5 slowest queries: (1) add index on users(email) for the email-lookup query (820ms p95), (2) add index on sessions(token, expires_at) for the session-token query (640ms p95), (3) add composite index on profiles(user_id) and ensure covering columns for the users-profiles join query (510ms p95), (4) add ... | -2,300 | 2,100 | 4,400 | true |
05 | Load Balancer Inefficiency | single_tier_negative | A fleet behind an Application Load Balancer using round-robin distribution. Cluster-level aggregates show a wide p50-to-p95 spread because some instances are persistently overloaded while others are persistently idle. The fix is a load-balancing algorithm change, not a capacity change. | issue_found | network | compute | load_balancer_reconfiguration | Reconfigure the ALB target group (app05-tg) load-balancing algorithm from round_robin to least_outstanding_requests. No capacity change is needed — the existing 8 × c5.xlarge fleet is sufficient. In Terraform, set `load_balancing_algorithm_type = "least_outstanding_requests"` on the `aws_lb_target_group.main` resource. | 0 | 3,800 | 3,800 | true |
06 | Healthy Application | healthy | An application where every tier is correctly sized, all metrics stay in healthy bands across all 14 days, no SLA pressure, no patterns of concern. The correct recommendation is no action. | no_issue_found | null | null | null | No changes recommended. Each tier was evaluated against its healthy operating bands and SLA targets (99.5% availability, P95 < 500ms). Compute CPU p95 peaks at 78% with comfortable headroom, memory stays under 72%, application P95 latency at 194.6ms max is well within the 500ms target. Database query P95 latency maxes ... | null | null | null | null |
07 | Cache Miss Cascade | cross_tier_negative | A cache tier that degrades (hit ratio drops below the healthy band) cascades into elevated database load and elevated application latency, even though the database and compute tiers themselves are correctly sized. The fix is in the cache layer, not in the downstream tiers — a sophisticated recommender must see past the... | issue_found | cache | database | cache_capacity_adjustment | Scale the Redis cache cluster from 3 to 6 cache.r6g.large nodes to relieve memory pressure (currently 88–95% used) and reduce evictions. Implement cache warming logic targeting the three hottest key patterns—rec:user:* (27% miss rate), rec:trending:* (30% miss rate), and rec:similar:* (38% miss rate)—and redesign these... | -700 | 5,800 | 6,500 | true |
08 | Database Bottleneck Impact | cross_tier_negative | A database with slow queries during business hours that cascade into elevated application latency on the compute tier. Compute itself is correctly sized; the problem is downstream. | issue_found | database | compute | query_cache_optimization | Optimize the top 6 slowest SQL queries and add 2 read replicas with read/write splitting. Specifically: (1) Add a composite index (user_id, id) on carts and (cart_id) on cart_items for the carts-by-user query (p95 820ms, 6.05M calls); (2) Add a composite index (warehouse_id, product_id) on inventory for the inventory-w... | -2,400 | 6,400 | 8,800 | true |
09 | Peak Hours Cost vs Reliability | cross_tier_negative | A high-criticality e-commerce platform with a clear bimodal weekday pattern: heavy use during peak hours, very light use off-peak and on weekends. Currently provisioned at peak capacity 24/7. Scheduled scaling would dramatically reduce off-peak cost without affecting peak SLA. | issue_found | compute | database | scaling_policy_change | Replace the fixed 20× m5.xlarge compute fleet with scheduled Auto Scaling: maintain 20 instances during peak windows (weekdays 09:00–11:00 UTC and 14:00–16:00 UTC), scale down to 7 instances (65% reduction) during off-peak hours (weekdays 16:00–09:00 UTC and all day weekends 00:00–23:59 UTC). Similarly, stop one databa... | 10,400 | 18,400 | 8,000 | true |
10 | Network Latency Impact | cross_tier_negative | A payment service whose external-provider integration uses basic cross-region VPC peering. Network latency to the provider spikes during business hours, cascading into elevated compute-tier application latency. Compute itself is correctly sized; the bottleneck is at the network boundary. | issue_found | network | compute | network_topology_change | Replace basic VPC peering for the payment-provider integration with AWS PrivateLink to eliminate cross-region latency spikes during business hours (weekdays 09:00–18:00 UTC). Add application-level retries with exponential backoff (initial 100ms, max 3 retries, 2× multiplier, jitter) to the payment-provider client. Do n... | -150 | 4,100 | 4,250 | true |
11 | Multi-Tier Over-provisioning | cross_tier_negative | An internal analytics platform that has been globally over-provisioned — all three tiers (compute, database, network) sit at low utilization across all 14 days with no spikes. The right action is comprehensive rightsizing across all tiers, not a single-tier adjustment. | issue_found | compute | database | rightsizing | Right-size compute from m5.2xlarge × 12 to m5.large × 6 and database from db.r6g.4xlarge to db.r6g.xlarge. Retain the existing ALB and network tier unchanged, as network throughput and latency are healthy. This coordinated multi-tier downsize yields ~$4,200/month savings while preserving the 99.7% SLA. | 4,200 | 11,800 | 7,600 | true |
12 | Healthy Compute, Problematic Database | mixed | A user profile service where compute is correctly sized and operating in healthy ranges, but the database is significantly over-provisioned and operating at very low utilization. The right action is downsize the database only — compute and read replicas should not change. | issue_found | database | null | rightsizing | Downsize the RDS primary instance from db.r6g.2xlarge to db.r6g.large. Leave compute tier (m5.large × 6 with target-tracking ASG min=6/max=10) completely unchanged — compute is correctly sized and operating in healthy ranges. Update the Terraform aws_db_instance.database_primary instance_class from "db.r6g.2xlarge" to ... | 1,400 | 7,200 | 5,800 | true |
13 | Compute Spike + Database Strain | cross_tier_negative | A search service where weekday peak-hour compute spikes drive database connection counts roughly 3x above baseline, exhausting the connection pool. Both tiers need attention: compute needs predictive scaling, database needs replicas and a larger pool. | issue_found | compute | database | scaling_policy_change | Replace step scaling with predictive auto-scaling on the compute ASG (trigger scale-out at cpu_p95 > 65%, raise max_size from 12 to 14), add 2 read replicas (db.r6g.xlarge) with read/write splitting to the database tier, and increase the connection pool from 150 to 300. Schedule predictive scaling to pre-warm capacity ... | -3,000 | 7,800 | 10,800 | true |
14 | Good Performance, High Cost | mixed | A checkout flow whose latency and error metrics are excellent — well inside SLA — but utilization is far below the healthy band on both compute and database. The system is over-provisioned to deliver performance that exceeds requirements. Right action is to rightsize while preserving the SLA buffer. | issue_found | compute | database | rightsizing | Right-size compute from m5.2xlarge × 12 to m5.large × 8 (reducing from 96 vCPUs / 384 GiB to 16 vCPUs / 64 GiB total) and downsize the database primary and replicas from db.r6g.4xlarge to db.r6g.xlarge. Retain both read replicas and the PrivateLink/ALB network tier unchanged to preserve the reliability posture for this... | 5,500 | 14,400 | 8,900 | true |
15 | Reliability Focused Over-provisioning | mixed | A payment-processing platform configured for 99.99% SLA via heavy over-provisioning across all tiers and multi-AZ redundancy. Achieves near-zero error rate at substantial cost. The right next step is a business-context question, not a technical rightsizing. | diagnostic_deferral | deferred | deferred | null | Before making any rightsizing changes, confirm with the business whether the 99.99% availability SLA target is contractually required or aspirational. The current infrastructure is heavily over-provisioned (CPU p95 at 34%, memory p95 at 42%, DB connections p95 at 60) to achieve near-zero error rates and comfortable lat... | null | null | null | null |
16 | Partial Optimization | single_tier_mild_negative | A reporting dashboard whose compute tier shows mild under-utilization (slightly below the healthy band) while database and network sit in healthy ranges. The right action is a targeted single-step compute adjustment — not aggressive multi-tier rightsizing. | issue_found | compute | null | rightsizing | Reduce the compute fleet from 4 × m5.large to 3 × m5.large by setting count = 3 in the aws_instance.compute resource. Keep the instance class unchanged (m5.large). Do not modify the database (db.r6g.large) or network (ALB with least_outstanding_requests) tiers, which are correctly sized. | 320 | 2,400 | 2,080 | true |
17 | Cross-Tier Performance Degradation | diagnostic_deferral | A core API platform whose latency rises simultaneously across all three tiers during peak hours, with no clear lead-lag relationship. CPU and connection counts are within normal ranges on all tiers — the problem is latency-distributed, not capacity-driven. Root cause is ambiguous from the observable signals alone. | diagnostic_deferral | deferred | deferred | null | Defer any scaling, rightsizing, or infrastructure changes until a full end-to-end distributed trace analysis is deployed and analyzed across the compute, database, and network tiers. The simultaneous latency rise across all three tiers with zero lead-lag (Pearson coefficients 0.963 and 0.975 at lag 0 minutes) indicates... | null | null | null | null |
18 | Mostly Healthy with Minor Inefficiency | mostly_healthy | An internal tool that is mostly correctly sized. Compute shows slightly low utilization (below the ideal but not below the threshold that justifies aggressive action). All other tiers are in healthy ranges. The right action is a minor compute adjustment, no other changes. | issue_found | compute | null | rightsizing | Reduce the compute fleet from t3.medium × 5 to t3.medium × 4 by updating `aws_instance.compute` count from 5 to 4 in main.tf. This is a minor refinement — the system is mostly well-optimized and only this single adjustment is warranted. No changes to database, cache, or network tiers are needed. | 110 | 1,900 | 1,790 | true |
Synthesized Cloud-Optimization Recommendations
18 scenarios that pair cloud telemetry with a hand-crafted optimization recommendation. Use them to train models or to evaluate AI agents.
Summary
Each scenario has multi-tier telemetry, a Terraform file describing the deployed infrastructure, and a gold-standard recommendation.
The dataset is built around a simple input-output mapping. The input is telemetry plus the infrastructure. The output is an optimization recommendation that says what to change and what the impact will be.
The dataset is synthesized. Telemetry was generated procedurally to match each scenario's narrative. Gold recommendations were hand-crafted and verified.
The dataset uses AWS vocabulary throughout. Instance types, service names, and field names match AWS. This makes the scenarios concrete instead of vendor-neutral.
Folder layout
README.md # this file
LICENSE # MIT
EVAL.md # what eval.py checks
eval.py # Floor sanity check (smoke test)
sample_predictions.json # worked example of submission shape
scenarios_summary.jsonl # one row per scenario (viewer table)
scenarios/
01/
metadata.json # scenario summary + fixtures
main.tf # Terraform for the infra
compute_telemetry.json # CPU, memory, latency
database_telemetry.json # query rates, pool stats
cache_telemetry.json # hit rate, eviction
network_telemetry.json # bandwidth, packet loss
correlation_evidence.json # cross-tier correlations
handcrafted_recommendation.json # the gold answer
02/
...
Each scenario covers a different optimization situation. Some are single-tier (only compute is wrong). Some span tiers (database problem that surfaces in compute). Some are no-action cases. Two are diagnostic deferral cases. One asks for an SLA review instead of an infra change.
The summary table (scenarios_summary.jsonl)
The Hugging Face Dataset Viewer renders scenarios_summary.jsonl as a
browsable table. Each row is one scenario and includes the headline fields
from that scenario's metadata and gold recommendation.
The summary is for discovery only. The full inputs (telemetry, Terraform,
correlation evidence) live in scenarios/NN/. Always train or evaluate on
the full files, not on the summary.
Columns in the summary table:
| Column | Source |
|---|---|
scenario_id |
folder name |
scenario_name |
metadata.scenario_name |
scenario_type |
metadata.scenario_type |
what_this_demonstrates |
metadata.narrative.what_this_demonstrates |
finding_type |
gold.finding_type |
primary_tier |
gold.primary_tier |
secondary_tier |
gold.secondary_tier |
action_category |
gold.action_category |
specific_change |
gold.specific_change |
savings_monthly_usd |
gold.cost_impact.savings_monthly_usd |
current_monthly_usd |
gold.cost_impact.current_monthly_usd |
projected_monthly_usd |
gold.cost_impact.projected_monthly_usd |
Some scenarios have negative savings_monthly_usd. That is expected. For
those scenarios the right action increases cost to fix a performance or
reliability problem (for example, adding a read replica).
Schema
Scenario inputs
Each scenarios/NN/ folder has these files.
| File | What it is |
|---|---|
metadata.json |
scenario name, narrative, fixtures |
main.tf |
Terraform for the deployed infra |
compute_telemetry.json |
per-window CPU, memory, latency |
database_telemetry.json |
per-window DB query rate, pool, slow queries |
cache_telemetry.json |
per-window hit rate, evictions |
network_telemetry.json |
per-window bandwidth, packet loss |
correlation_evidence.json |
cross-tier correlation pairs |
handcrafted_recommendation.json |
the gold answer |
Recommendation shape
{
"scenario_id": "01",
"finding_type": "issue_found",
"specific_change": "...",
"primary_tier": "compute",
"secondary_tier": null,
"action_category": "rightsizing",
"conclusion": { ... },
"evidence": {
"telemetry_observations": [ ... ],
"infrastructure_context": [ ... ],
"correlation_observations": [ ... ]
},
"reasoning": "...",
"projected_state": { ... },
"cost_impact": { ... },
"risk_assessment": { ... }
}
Allowed values
finding_type:issue_found,no_issue_found,diagnostic_deferral,insufficient_dataprimary_tier:compute,database,cache,network,deferred, or nullsecondary_tier: same set asprimary_tieraction_category:rightsizing,scaling_policy_change,query_cache_optimization,cache_capacity_adjustment,pool_sizing,replica_adjustment,load_balancer_reconfiguration,network_topology_change,sla_review, or null
The deferred tier sentinel is used in diagnostic-deferral scenarios
where the agent explicitly cannot pick a tier yet (scenarios 15 and 17).
insufficient_data is reserved for future scenarios where the dataset
is too sparse to support any finding; no current scenario uses it.
Scenario coverage
| ID | Type | Description |
|---|---|---|
| 01 | single-tier | compute over-provisioned |
| 02 | single-tier | compute peak windows, needs scheduled scaling |
| 03 | single-tier | database over-provisioned |
| 04 | single-tier | slow queries plus exhausted pool |
| 05 | single-tier | ALB round-robin causing uneven CPU |
| 06 | no-action | all tiers healthy |
| 07 | single-tier | cache hit ratio degraded |
| 08 | cross-tier | slow DB queries cascade to compute |
| 09 | cross-tier | weekday bimodal peaks, needs scheduled scaling |
| 10 | cross-tier | network latency cascades to compute |
| 11 | cross-tier | all three tiers over-provisioned |
| 12 | mixed | healthy compute, over-provisioned database |
| 13 | cross-tier | compute spike strains database |
| 14 | cross-tier | compute and database both over-provisioned |
| 15 | reliability | 99.99% SLA via over-provisioning |
| 16 | mild | partial compute optimization |
| 17 | deferral | all tiers rise in lockstep, need more diagnosis |
| 18 | mostly healthy | minor compute inefficiency |
How to use it
You can use this dataset two ways.
Train or fine-tune. Treat each scenario's telemetry plus metadata as
input. Use the handcrafted_recommendation.json as the target output.
Evaluate AI agents. Run your agent on the scenario inputs. Compare its output to the hand-crafted recommendation in that scenario's folder.
Quick sanity check
python eval.py --predictions sample_predictions.json
This runs the bundled Floor sanity check. It confirms your predictions
parse, have the required fields, and use allowed category values. It does
NOT score recommendation quality. See EVAL.md for what is checked.
Prediction shape
See sample_predictions.json for a worked example. Required fields per
prediction: scenario_id, finding_type, specific_change,
primary_tier, action_category. Optional but useful for deeper
scoring: secondary_tier, reasoning, evidence, projected_state,
cost_impact, risk_assessment.
How to score beyond the Floor check
The dataset ships gold answers and a Floor sanity check. It does not ship a quality scorer. Beyond the Floor check, the scoring method is up to you. Common options:
- Exact match on the enum fields (
finding_type,primary_tier,action_category). - Keyword or substring checks on
specific_change. - Semantic similarity on the prose fields.
- A custom rubric per scenario, comparing prediction fields against the
matching
handcrafted_recommendation.json.
Intended uses
- Train or fine-tune a model that maps cloud telemetry to an optimization recommendation.
- Evaluate AI agents on cloud-optimization reasoning.
- Compare single-shot vs orchestrated agent designs.
License
MIT. See LICENSE.
Citation
@misc{synthesized_cloud_optimization_recommendations_2026,
title = {Synthesized Cloud-Optimization Recommendations},
author = {Alexander Meau},
year = {2026},
version = {1.0.0}
}
- Downloads last month
- 305