Datasets:

ameau01
/

synthesized-cloud-optimization-recommendations

scenario_id stringlengths 2 2	scenario_name stringlengths 18 38	scenario_type stringclasses 7 values	what_this_demonstrates stringlengths 177 343	finding_type stringclasses 3 values	primary_tier stringclasses 5 values	secondary_tier stringclasses 3 values	action_category stringclasses 6 values	specific_change stringlengths 288 1.06k	savings_monthly_usd float64 -3,000 10.4k ⌀	current_monthly_usd float64 1.9k 18.4k ⌀	projected_monthly_usd float64 1.68k 10.8k ⌀	sla_availability_preserved bool 1 class
01	Chronic Underutilization	single_tier_negative	A compute fleet that has been systematically over-provisioned: utilization is far below healthy across all 14 days with no spikes, no business-hours variation, and no SLA pressure. The right action is straightforward rightsizing.	issue_found	compute	null	rightsizing	Right-size the compute fleet from t3.large × 8 to t3.medium × 4 (fixed, no scaling). Update the aws_launch_template instance_type from "t3.large" to "t3.medium" and reduce aws_instance count from 8 to 4. This yields ~62% monthly compute cost reduction (~$2,850/month savings) with no SLA impact.	2,850	4,600	1,750	true
02	Spiky Compute Load	single_tier_negative	A fleet that sits at a healthy baseline most of the time but is catastrophically under-provisioned during two predictable daily spike windows on weekdays. The fleet is fixed-capacity; scheduled scaling would absorb the spikes without affecting baseline cost much.	issue_found	compute	null	scaling_policy_change	Replace the fixed 6× m5.large fleet with a scheduled auto-scaling group: maintain a baseline of 3× m5.large during off-peak hours (weekdays 20:00–09:00 UTC and weekends all day), scale to 6× m5.large at 09:00 UTC on weekdays, and burst to 9× m5.large during the two predictable spike windows (weekdays 09:45–11:15 UTC an...	2,496	5,200	2,704	true
03	Over-provisioned Database	single_tier_negative	A database that has been sized for a workload it doesn't have. CPU, connections, and IO wait are all flat low across all 14 days. The fix is a one-step instance-class downgrade.	issue_found	database	null	rightsizing	Downsize the primary RDS instance from db.r6g.xlarge (4 vCPU, 32 GiB RAM) to db.r6g.large (2 vCPU, 16 GiB RAM) and enable storage auto-scaling on the 200 GB volume. This halves the database compute cost, saving approximately $1,680/month (50% reduction). Update the Terraform resource aws_db_instance.database_primary by...	1,680	3,360	1,680	true
04	Database Connection Bottleneck	single_tier_negative	A database under pressure from both a slow-query problem and a connection pool that's exhausted during business hours. Symptoms: connection counts push above pool limits during business hours, db_query_p95_latency_ms spikes during the same windows.	issue_found	database	null	query_cache_optimization	Optimize the top 5 slowest queries: (1) add index on users(email) for the email-lookup query (820ms p95), (2) add index on sessions(token, expires_at) for the session-token query (640ms p95), (3) add composite index on profiles(user_id) and ensure covering columns for the users-profiles join query (510ms p95), (4) add ...	-2,300	2,100	4,400	true
05	Load Balancer Inefficiency	single_tier_negative	A fleet behind an Application Load Balancer using round-robin distribution. Cluster-level aggregates show a wide p50-to-p95 spread because some instances are persistently overloaded while others are persistently idle. The fix is a load-balancing algorithm change, not a capacity change.	issue_found	network	compute	load_balancer_reconfiguration	Reconfigure the ALB target group (app05-tg) load-balancing algorithm from round_robin to least_outstanding_requests. No capacity change is needed — the existing 8 × c5.xlarge fleet is sufficient. In Terraform, set `load_balancing_algorithm_type = "least_outstanding_requests"` on the `aws_lb_target_group.main` resource.	0	3,800	3,800	true
06	Healthy Application	healthy	An application where every tier is correctly sized, all metrics stay in healthy bands across all 14 days, no SLA pressure, no patterns of concern. The correct recommendation is no action.	no_issue_found	null	null	null	No changes recommended. Each tier was evaluated against its healthy operating bands and SLA targets (99.5% availability, P95 < 500ms). Compute CPU p95 peaks at 78% with comfortable headroom, memory stays under 72%, application P95 latency at 194.6ms max is well within the 500ms target. Database query P95 latency maxes ...	null	null	null	null
07	Cache Miss Cascade	cross_tier_negative	A cache tier that degrades (hit ratio drops below the healthy band) cascades into elevated database load and elevated application latency, even though the database and compute tiers themselves are correctly sized. The fix is in the cache layer, not in the downstream tiers — a sophisticated recommender must see past the...	issue_found	cache	database	cache_capacity_adjustment	Scale the Redis cache cluster from 3 to 6 cache.r6g.large nodes to relieve memory pressure (currently 88–95% used) and reduce evictions. Implement cache warming logic targeting the three hottest key patterns—rec:user:* (27% miss rate), rec:trending:* (30% miss rate), and rec:similar:* (38% miss rate)—and redesign these...	-700	5,800	6,500	true
08	Database Bottleneck Impact	cross_tier_negative	A database with slow queries during business hours that cascade into elevated application latency on the compute tier. Compute itself is correctly sized; the problem is downstream.	issue_found	database	compute	query_cache_optimization	Optimize the top 6 slowest SQL queries and add 2 read replicas with read/write splitting. Specifically: (1) Add a composite index (user_id, id) on carts and (cart_id) on cart_items for the carts-by-user query (p95 820ms, 6.05M calls); (2) Add a composite index (warehouse_id, product_id) on inventory for the inventory-w...	-2,400	6,400	8,800	true
09	Peak Hours Cost vs Reliability	cross_tier_negative	A high-criticality e-commerce platform with a clear bimodal weekday pattern: heavy use during peak hours, very light use off-peak and on weekends. Currently provisioned at peak capacity 24/7. Scheduled scaling would dramatically reduce off-peak cost without affecting peak SLA.	issue_found	compute	database	scaling_policy_change	Replace the fixed 20× m5.xlarge compute fleet with scheduled Auto Scaling: maintain 20 instances during peak windows (weekdays 09:00–11:00 UTC and 14:00–16:00 UTC), scale down to 7 instances (65% reduction) during off-peak hours (weekdays 16:00–09:00 UTC and all day weekends 00:00–23:59 UTC). Similarly, stop one databa...	10,400	18,400	8,000	true
10	Network Latency Impact	cross_tier_negative	A payment service whose external-provider integration uses basic cross-region VPC peering. Network latency to the provider spikes during business hours, cascading into elevated compute-tier application latency. Compute itself is correctly sized; the bottleneck is at the network boundary.	issue_found	network	compute	network_topology_change	Replace basic VPC peering for the payment-provider integration with AWS PrivateLink to eliminate cross-region latency spikes during business hours (weekdays 09:00–18:00 UTC). Add application-level retries with exponential backoff (initial 100ms, max 3 retries, 2× multiplier, jitter) to the payment-provider client. Do n...	-150	4,100	4,250	true
11	Multi-Tier Over-provisioning	cross_tier_negative	An internal analytics platform that has been globally over-provisioned — all three tiers (compute, database, network) sit at low utilization across all 14 days with no spikes. The right action is comprehensive rightsizing across all tiers, not a single-tier adjustment.	issue_found	compute	database	rightsizing	Right-size compute from m5.2xlarge × 12 to m5.large × 6 and database from db.r6g.4xlarge to db.r6g.xlarge. Retain the existing ALB and network tier unchanged, as network throughput and latency are healthy. This coordinated multi-tier downsize yields ~$4,200/month savings while preserving the 99.7% SLA.	4,200	11,800	7,600	true
12	Healthy Compute, Problematic Database	mixed	A user profile service where compute is correctly sized and operating in healthy ranges, but the database is significantly over-provisioned and operating at very low utilization. The right action is downsize the database only — compute and read replicas should not change.	issue_found	database	null	rightsizing	Downsize the RDS primary instance from db.r6g.2xlarge to db.r6g.large. Leave compute tier (m5.large × 6 with target-tracking ASG min=6/max=10) completely unchanged — compute is correctly sized and operating in healthy ranges. Update the Terraform aws_db_instance.database_primary instance_class from "db.r6g.2xlarge" to ...	1,400	7,200	5,800	true
13	Compute Spike + Database Strain	cross_tier_negative	A search service where weekday peak-hour compute spikes drive database connection counts roughly 3x above baseline, exhausting the connection pool. Both tiers need attention: compute needs predictive scaling, database needs replicas and a larger pool.	issue_found	compute	database	scaling_policy_change	Replace step scaling with predictive auto-scaling on the compute ASG (trigger scale-out at cpu_p95 > 65%, raise max_size from 12 to 14), add 2 read replicas (db.r6g.xlarge) with read/write splitting to the database tier, and increase the connection pool from 150 to 300. Schedule predictive scaling to pre-warm capacity ...	-3,000	7,800	10,800	true
14	Good Performance, High Cost	mixed	A checkout flow whose latency and error metrics are excellent — well inside SLA — but utilization is far below the healthy band on both compute and database. The system is over-provisioned to deliver performance that exceeds requirements. Right action is to rightsize while preserving the SLA buffer.	issue_found	compute	database	rightsizing	Right-size compute from m5.2xlarge × 12 to m5.large × 8 (reducing from 96 vCPUs / 384 GiB to 16 vCPUs / 64 GiB total) and downsize the database primary and replicas from db.r6g.4xlarge to db.r6g.xlarge. Retain both read replicas and the PrivateLink/ALB network tier unchanged to preserve the reliability posture for this...	5,500	14,400	8,900	true
15	Reliability Focused Over-provisioning	mixed	A payment-processing platform configured for 99.99% SLA via heavy over-provisioning across all tiers and multi-AZ redundancy. Achieves near-zero error rate at substantial cost. The right next step is a business-context question, not a technical rightsizing.	diagnostic_deferral	deferred	deferred	null	Before making any rightsizing changes, confirm with the business whether the 99.99% availability SLA target is contractually required or aspirational. The current infrastructure is heavily over-provisioned (CPU p95 at 34%, memory p95 at 42%, DB connections p95 at 60) to achieve near-zero error rates and comfortable lat...	null	null	null	null
16	Partial Optimization	single_tier_mild_negative	A reporting dashboard whose compute tier shows mild under-utilization (slightly below the healthy band) while database and network sit in healthy ranges. The right action is a targeted single-step compute adjustment — not aggressive multi-tier rightsizing.	issue_found	compute	null	rightsizing	Reduce the compute fleet from 4 × m5.large to 3 × m5.large by setting count = 3 in the aws_instance.compute resource. Keep the instance class unchanged (m5.large). Do not modify the database (db.r6g.large) or network (ALB with least_outstanding_requests) tiers, which are correctly sized.	320	2,400	2,080	true
17	Cross-Tier Performance Degradation	diagnostic_deferral	A core API platform whose latency rises simultaneously across all three tiers during peak hours, with no clear lead-lag relationship. CPU and connection counts are within normal ranges on all tiers — the problem is latency-distributed, not capacity-driven. Root cause is ambiguous from the observable signals alone.	diagnostic_deferral	deferred	deferred	null	Defer any scaling, rightsizing, or infrastructure changes until a full end-to-end distributed trace analysis is deployed and analyzed across the compute, database, and network tiers. The simultaneous latency rise across all three tiers with zero lead-lag (Pearson coefficients 0.963 and 0.975 at lag 0 minutes) indicates...	null	null	null	null
18	Mostly Healthy with Minor Inefficiency	mostly_healthy	An internal tool that is mostly correctly sized. Compute shows slightly low utilization (below the ideal but not below the threshold that justifies aggressive action). All other tiers are in healthy ranges. The right action is a minor compute adjustment, no other changes.	issue_found	compute	null	rightsizing	Reduce the compute fleet from t3.medium × 5 to t3.medium × 4 by updating `aws_instance.compute` count from 5 to 4 in main.tf. This is a minor refinement — the system is mostly well-optimized and only this single adjustment is warranted. No changes to database, cache, or network tiers are needed.	110	1,900	1,790	true

Synthesized Cloud-Optimization Recommendations

18 scenarios that pair cloud telemetry with a hand-crafted optimization recommendation. Use them to train models or to evaluate AI agents.

Summary

Each scenario has multi-tier telemetry, a Terraform file describing the deployed infrastructure, and a gold-standard recommendation.

The dataset is built around a simple input-output mapping. The input is telemetry plus the infrastructure. The output is an optimization recommendation that says what to change and what the impact will be.

The dataset is synthesized. Telemetry was generated procedurally to match each scenario's narrative. Gold recommendations were hand-crafted and verified.

The dataset uses AWS vocabulary throughout. Instance types, service names, and field names match AWS. This makes the scenarios concrete instead of vendor-neutral.

Folder layout

README.md                                # this file
LICENSE                                  # MIT
EVAL.md                                  # what eval.py checks
eval.py                                  # Floor sanity check (smoke test)
sample_predictions.json                  # worked example of submission shape
scenarios_summary.jsonl                  # one row per scenario (viewer table)
scenarios/
  01/
    metadata.json                        # scenario summary + fixtures
    main.tf                              # Terraform for the infra
    compute_telemetry.json               # CPU, memory, latency
    database_telemetry.json              # query rates, pool stats
    cache_telemetry.json                 # hit rate, eviction
    network_telemetry.json               # bandwidth, packet loss
    correlation_evidence.json            # cross-tier correlations
    handcrafted_recommendation.json      # the gold answer
  02/
    ...

Each scenario covers a different optimization situation. Some are single-tier (only compute is wrong). Some span tiers (database problem that surfaces in compute). Some are no-action cases. Two are diagnostic deferral cases. One asks for an SLA review instead of an infra change.

The summary table (`scenarios_summary.jsonl`)

The Hugging Face Dataset Viewer renders scenarios_summary.jsonl as a browsable table. Each row is one scenario and includes the headline fields from that scenario's metadata and gold recommendation.

The summary is for discovery only. The full inputs (telemetry, Terraform, correlation evidence) live in scenarios/NN/. Always train or evaluate on the full files, not on the summary.

Columns in the summary table:

Column	Source
`scenario_id`	folder name
`scenario_name`	metadata.scenario_name
`scenario_type`	metadata.scenario_type
`what_this_demonstrates`	metadata.narrative.what_this_demonstrates
`finding_type`	gold.finding_type
`primary_tier`	gold.primary_tier
`secondary_tier`	gold.secondary_tier
`action_category`	gold.action_category
`specific_change`	gold.specific_change
`savings_monthly_usd`	gold.cost_impact.savings_monthly_usd
`current_monthly_usd`	gold.cost_impact.current_monthly_usd
`projected_monthly_usd`	gold.cost_impact.projected_monthly_usd

Some scenarios have negative savings_monthly_usd. That is expected. For those scenarios the right action increases cost to fix a performance or reliability problem (for example, adding a read replica).

Schema

Scenario inputs

Each scenarios/NN/ folder has these files.

File	What it is
`metadata.json`	scenario name, narrative, fixtures
`main.tf`	Terraform for the deployed infra
`compute_telemetry.json`	per-window CPU, memory, latency
`database_telemetry.json`	per-window DB query rate, pool, slow queries
`cache_telemetry.json`	per-window hit rate, evictions
`network_telemetry.json`	per-window bandwidth, packet loss
`correlation_evidence.json`	cross-tier correlation pairs
`handcrafted_recommendation.json`	the gold answer

Recommendation shape

{
  "scenario_id": "01",
  "finding_type": "issue_found",
  "specific_change": "...",
  "primary_tier": "compute",
  "secondary_tier": null,
  "action_category": "rightsizing",
  "conclusion": { ... },
  "evidence": {
    "telemetry_observations": [ ... ],
    "infrastructure_context": [ ... ],
    "correlation_observations": [ ... ]
  },
  "reasoning": "...",
  "projected_state": { ... },
  "cost_impact": { ... },
  "risk_assessment": { ... }
}

Allowed values

finding_type: issue_found, no_issue_found, diagnostic_deferral, insufficient_data
primary_tier: compute, database, cache, network, deferred, or null
secondary_tier: same set as primary_tier
action_category: rightsizing, scaling_policy_change, query_cache_optimization, cache_capacity_adjustment, pool_sizing, replica_adjustment, load_balancer_reconfiguration, network_topology_change, sla_review, or null

The deferred tier sentinel is used in diagnostic-deferral scenarios where the agent explicitly cannot pick a tier yet (scenarios 15 and 17). insufficient_data is reserved for future scenarios where the dataset is too sparse to support any finding; no current scenario uses it.

Scenario coverage

ID	Type	Description
01	single-tier	compute over-provisioned
02	single-tier	compute peak windows, needs scheduled scaling
03	single-tier	database over-provisioned
04	single-tier	slow queries plus exhausted pool
05	single-tier	ALB round-robin causing uneven CPU
06	no-action	all tiers healthy
07	single-tier	cache hit ratio degraded
08	cross-tier	slow DB queries cascade to compute
09	cross-tier	weekday bimodal peaks, needs scheduled scaling
10	cross-tier	network latency cascades to compute
11	cross-tier	all three tiers over-provisioned
12	mixed	healthy compute, over-provisioned database
13	cross-tier	compute spike strains database
14	cross-tier	compute and database both over-provisioned
15	reliability	99.99% SLA via over-provisioning
16	mild	partial compute optimization
17	deferral	all tiers rise in lockstep, need more diagnosis
18	mostly healthy	minor compute inefficiency

How to use it

You can use this dataset two ways.

Train or fine-tune. Treat each scenario's telemetry plus metadata as input. Use the handcrafted_recommendation.json as the target output.

Evaluate AI agents. Run your agent on the scenario inputs. Compare its output to the hand-crafted recommendation in that scenario's folder.

Quick sanity check

python eval.py --predictions sample_predictions.json

This runs the bundled Floor sanity check. It confirms your predictions parse, have the required fields, and use allowed category values. It does NOT score recommendation quality. See EVAL.md for what is checked.

Prediction shape

See sample_predictions.json for a worked example. Required fields per prediction: scenario_id, finding_type, specific_change, primary_tier, action_category. Optional but useful for deeper scoring: secondary_tier, reasoning, evidence, projected_state, cost_impact, risk_assessment.

How to score beyond the Floor check

The dataset ships gold answers and a Floor sanity check. It does not ship a quality scorer. Beyond the Floor check, the scoring method is up to you. Common options:

Exact match on the enum fields (finding_type, primary_tier, action_category).
Keyword or substring checks on specific_change.
Semantic similarity on the prose fields.
A custom rubric per scenario, comparing prediction fields against the matching handcrafted_recommendation.json.

Intended uses

Train or fine-tune a model that maps cloud telemetry to an optimization recommendation.
Evaluate AI agents on cloud-optimization reasoning.
Compare single-shot vs orchestrated agent designs.

License

MIT. See LICENSE.

Citation

@misc{synthesized_cloud_optimization_recommendations_2026,
  title = {Synthesized Cloud-Optimization Recommendations},
  author = {Alexander Meau},
  year = {2026},
  version = {1.0.0}
}

Downloads last month: 305

Number of rows:

Total file size:

11.6 MB