Dataset Viewer
Auto-converted to Parquet Duplicate
scenario_id
stringlengths
2
2
scenario_name
stringlengths
18
38
scenario_type
stringclasses
7 values
what_this_demonstrates
stringlengths
177
343
finding_type
stringclasses
3 values
primary_tier
stringclasses
5 values
secondary_tier
stringclasses
3 values
action_category
stringclasses
6 values
specific_change
stringlengths
288
1.06k
savings_monthly_usd
float64
-3,000
10.4k
current_monthly_usd
float64
1.9k
18.4k
projected_monthly_usd
float64
1.68k
10.8k
sla_availability_preserved
bool
1 class
01
Chronic Underutilization
single_tier_negative
A compute fleet that has been systematically over-provisioned: utilization is far below healthy across all 14 days with no spikes, no business-hours variation, and no SLA pressure. The right action is straightforward rightsizing.
issue_found
compute
null
rightsizing
Right-size the compute fleet from t3.large × 8 to t3.medium × 4 (fixed, no scaling). Update the aws_launch_template instance_type from "t3.large" to "t3.medium" and reduce aws_instance count from 8 to 4. This yields ~62% monthly compute cost reduction (~$2,850/month savings) with no SLA impact.
2,850
4,600
1,750
true
02
Spiky Compute Load
single_tier_negative
A fleet that sits at a healthy baseline most of the time but is catastrophically under-provisioned during two predictable daily spike windows on weekdays. The fleet is fixed-capacity; scheduled scaling would absorb the spikes without affecting baseline cost much.
issue_found
compute
null
scaling_policy_change
Replace the fixed 6× m5.large fleet with a scheduled auto-scaling group: maintain a baseline of 3× m5.large during off-peak hours (weekdays 20:00–09:00 UTC and weekends all day), scale to 6× m5.large at 09:00 UTC on weekdays, and burst to 9× m5.large during the two predictable spike windows (weekdays 09:45–11:15 UTC an...
2,496
5,200
2,704
true
03
Over-provisioned Database
single_tier_negative
A database that has been sized for a workload it doesn't have. CPU, connections, and IO wait are all flat low across all 14 days. The fix is a one-step instance-class downgrade.
issue_found
database
null
rightsizing
Downsize the primary RDS instance from db.r6g.xlarge (4 vCPU, 32 GiB RAM) to db.r6g.large (2 vCPU, 16 GiB RAM) and enable storage auto-scaling on the 200 GB volume. This halves the database compute cost, saving approximately $1,680/month (50% reduction). Update the Terraform resource aws_db_instance.database_primary by...
1,680
3,360
1,680
true
04
Database Connection Bottleneck
single_tier_negative
A database under pressure from both a slow-query problem and a connection pool that's exhausted during business hours. Symptoms: connection counts push above pool limits during business hours, db_query_p95_latency_ms spikes during the same windows.
issue_found
database
null
query_cache_optimization
Optimize the top 5 slowest queries: (1) add index on users(email) for the email-lookup query (820ms p95), (2) add index on sessions(token, expires_at) for the session-token query (640ms p95), (3) add composite index on profiles(user_id) and ensure covering columns for the users-profiles join query (510ms p95), (4) add ...
-2,300
2,100
4,400
true
05
Load Balancer Inefficiency
single_tier_negative
A fleet behind an Application Load Balancer using round-robin distribution. Cluster-level aggregates show a wide p50-to-p95 spread because some instances are persistently overloaded while others are persistently idle. The fix is a load-balancing algorithm change, not a capacity change.
issue_found
network
compute
load_balancer_reconfiguration
Reconfigure the ALB target group (app05-tg) load-balancing algorithm from round_robin to least_outstanding_requests. No capacity change is needed — the existing 8 × c5.xlarge fleet is sufficient. In Terraform, set `load_balancing_algorithm_type = "least_outstanding_requests"` on the `aws_lb_target_group.main` resource.
0
3,800
3,800
true
06
Healthy Application
healthy
An application where every tier is correctly sized, all metrics stay in healthy bands across all 14 days, no SLA pressure, no patterns of concern. The correct recommendation is no action.
no_issue_found
null
null
null
No changes recommended. Each tier was evaluated against its healthy operating bands and SLA targets (99.5% availability, P95 < 500ms). Compute CPU p95 peaks at 78% with comfortable headroom, memory stays under 72%, application P95 latency at 194.6ms max is well within the 500ms target. Database query P95 latency maxes ...
null
null
null
null
07
Cache Miss Cascade
cross_tier_negative
A cache tier that degrades (hit ratio drops below the healthy band) cascades into elevated database load and elevated application latency, even though the database and compute tiers themselves are correctly sized. The fix is in the cache layer, not in the downstream tiers — a sophisticated recommender must see past the...
issue_found
cache
database
cache_capacity_adjustment
Scale the Redis cache cluster from 3 to 6 cache.r6g.large nodes to relieve memory pressure (currently 88–95% used) and reduce evictions. Implement cache warming logic targeting the three hottest key patterns—rec:user:* (27% miss rate), rec:trending:* (30% miss rate), and rec:similar:* (38% miss rate)—and redesign these...
-700
5,800
6,500
true
08
Database Bottleneck Impact
cross_tier_negative
A database with slow queries during business hours that cascade into elevated application latency on the compute tier. Compute itself is correctly sized; the problem is downstream.
issue_found
database
compute
query_cache_optimization
Optimize the top 6 slowest SQL queries and add 2 read replicas with read/write splitting. Specifically: (1) Add a composite index (user_id, id) on carts and (cart_id) on cart_items for the carts-by-user query (p95 820ms, 6.05M calls); (2) Add a composite index (warehouse_id, product_id) on inventory for the inventory-w...
-2,400
6,400
8,800
true
09
Peak Hours Cost vs Reliability
cross_tier_negative
A high-criticality e-commerce platform with a clear bimodal weekday pattern: heavy use during peak hours, very light use off-peak and on weekends. Currently provisioned at peak capacity 24/7. Scheduled scaling would dramatically reduce off-peak cost without affecting peak SLA.
issue_found
compute
database
scaling_policy_change
Replace the fixed 20× m5.xlarge compute fleet with scheduled Auto Scaling: maintain 20 instances during peak windows (weekdays 09:00–11:00 UTC and 14:00–16:00 UTC), scale down to 7 instances (65% reduction) during off-peak hours (weekdays 16:00–09:00 UTC and all day weekends 00:00–23:59 UTC). Similarly, stop one databa...
10,400
18,400
8,000
true
10
Network Latency Impact
cross_tier_negative
A payment service whose external-provider integration uses basic cross-region VPC peering. Network latency to the provider spikes during business hours, cascading into elevated compute-tier application latency. Compute itself is correctly sized; the bottleneck is at the network boundary.
issue_found
network
compute
network_topology_change
Replace basic VPC peering for the payment-provider integration with AWS PrivateLink to eliminate cross-region latency spikes during business hours (weekdays 09:00–18:00 UTC). Add application-level retries with exponential backoff (initial 100ms, max 3 retries, 2× multiplier, jitter) to the payment-provider client. Do n...
-150
4,100
4,250
true
11
Multi-Tier Over-provisioning
cross_tier_negative
An internal analytics platform that has been globally over-provisioned — all three tiers (compute, database, network) sit at low utilization across all 14 days with no spikes. The right action is comprehensive rightsizing across all tiers, not a single-tier adjustment.
issue_found
compute
database
rightsizing
Right-size compute from m5.2xlarge × 12 to m5.large × 6 and database from db.r6g.4xlarge to db.r6g.xlarge. Retain the existing ALB and network tier unchanged, as network throughput and latency are healthy. This coordinated multi-tier downsize yields ~$4,200/month savings while preserving the 99.7% SLA.
4,200
11,800
7,600
true
12
Healthy Compute, Problematic Database
mixed
A user profile service where compute is correctly sized and operating in healthy ranges, but the database is significantly over-provisioned and operating at very low utilization. The right action is downsize the database only — compute and read replicas should not change.
issue_found
database
null
rightsizing
Downsize the RDS primary instance from db.r6g.2xlarge to db.r6g.large. Leave compute tier (m5.large × 6 with target-tracking ASG min=6/max=10) completely unchanged — compute is correctly sized and operating in healthy ranges. Update the Terraform aws_db_instance.database_primary instance_class from "db.r6g.2xlarge" to ...
1,400
7,200
5,800
true
13
Compute Spike + Database Strain
cross_tier_negative
A search service where weekday peak-hour compute spikes drive database connection counts roughly 3x above baseline, exhausting the connection pool. Both tiers need attention: compute needs predictive scaling, database needs replicas and a larger pool.
issue_found
compute
database
scaling_policy_change
Replace step scaling with predictive auto-scaling on the compute ASG (trigger scale-out at cpu_p95 > 65%, raise max_size from 12 to 14), add 2 read replicas (db.r6g.xlarge) with read/write splitting to the database tier, and increase the connection pool from 150 to 300. Schedule predictive scaling to pre-warm capacity ...
-3,000
7,800
10,800
true
14
Good Performance, High Cost
mixed
A checkout flow whose latency and error metrics are excellent — well inside SLA — but utilization is far below the healthy band on both compute and database. The system is over-provisioned to deliver performance that exceeds requirements. Right action is to rightsize while preserving the SLA buffer.
issue_found
compute
database
rightsizing
Right-size compute from m5.2xlarge × 12 to m5.large × 8 (reducing from 96 vCPUs / 384 GiB to 16 vCPUs / 64 GiB total) and downsize the database primary and replicas from db.r6g.4xlarge to db.r6g.xlarge. Retain both read replicas and the PrivateLink/ALB network tier unchanged to preserve the reliability posture for this...
5,500
14,400
8,900
true
15
Reliability Focused Over-provisioning
mixed
A payment-processing platform configured for 99.99% SLA via heavy over-provisioning across all tiers and multi-AZ redundancy. Achieves near-zero error rate at substantial cost. The right next step is a business-context question, not a technical rightsizing.
diagnostic_deferral
deferred
deferred
null
Before making any rightsizing changes, confirm with the business whether the 99.99% availability SLA target is contractually required or aspirational. The current infrastructure is heavily over-provisioned (CPU p95 at 34%, memory p95 at 42%, DB connections p95 at 60) to achieve near-zero error rates and comfortable lat...
null
null
null
null
16
Partial Optimization
single_tier_mild_negative
A reporting dashboard whose compute tier shows mild under-utilization (slightly below the healthy band) while database and network sit in healthy ranges. The right action is a targeted single-step compute adjustment — not aggressive multi-tier rightsizing.
issue_found
compute
null
rightsizing
Reduce the compute fleet from 4 × m5.large to 3 × m5.large by setting count = 3 in the aws_instance.compute resource. Keep the instance class unchanged (m5.large). Do not modify the database (db.r6g.large) or network (ALB with least_outstanding_requests) tiers, which are correctly sized.
320
2,400
2,080
true
17
Cross-Tier Performance Degradation
diagnostic_deferral
A core API platform whose latency rises simultaneously across all three tiers during peak hours, with no clear lead-lag relationship. CPU and connection counts are within normal ranges on all tiers — the problem is latency-distributed, not capacity-driven. Root cause is ambiguous from the observable signals alone.
diagnostic_deferral
deferred
deferred
null
Defer any scaling, rightsizing, or infrastructure changes until a full end-to-end distributed trace analysis is deployed and analyzed across the compute, database, and network tiers. The simultaneous latency rise across all three tiers with zero lead-lag (Pearson coefficients 0.963 and 0.975 at lag 0 minutes) indicates...
null
null
null
null
18
Mostly Healthy with Minor Inefficiency
mostly_healthy
An internal tool that is mostly correctly sized. Compute shows slightly low utilization (below the ideal but not below the threshold that justifies aggressive action). All other tiers are in healthy ranges. The right action is a minor compute adjustment, no other changes.
issue_found
compute
null
rightsizing
Reduce the compute fleet from t3.medium × 5 to t3.medium × 4 by updating `aws_instance.compute` count from 5 to 4 in main.tf. This is a minor refinement — the system is mostly well-optimized and only this single adjustment is warranted. No changes to database, cache, or network tiers are needed.
110
1,900
1,790
true

Synthesized Cloud-Optimization Recommendations

18 scenarios that pair cloud telemetry with a hand-crafted optimization recommendation. Use them to train models or to evaluate AI agents.

Summary

Each scenario has multi-tier telemetry, a Terraform file describing the deployed infrastructure, and a gold-standard recommendation.

The dataset is built around a simple input-output mapping. The input is telemetry plus the infrastructure. The output is an optimization recommendation that says what to change and what the impact will be.

The dataset is synthesized. Telemetry was generated procedurally to match each scenario's narrative. Gold recommendations were hand-crafted and verified.

The dataset uses AWS vocabulary throughout. Instance types, service names, and field names match AWS. This makes the scenarios concrete instead of vendor-neutral.

Folder layout

README.md                                # this file
LICENSE                                  # MIT
EVAL.md                                  # what eval.py checks
eval.py                                  # Floor sanity check (smoke test)
sample_predictions.json                  # worked example of submission shape
scenarios_summary.jsonl                  # one row per scenario (viewer table)
scenarios/
  01/
    metadata.json                        # scenario summary + fixtures
    main.tf                              # Terraform for the infra
    compute_telemetry.json               # CPU, memory, latency
    database_telemetry.json              # query rates, pool stats
    cache_telemetry.json                 # hit rate, eviction
    network_telemetry.json               # bandwidth, packet loss
    correlation_evidence.json            # cross-tier correlations
    handcrafted_recommendation.json      # the gold answer
  02/
    ...

Each scenario covers a different optimization situation. Some are single-tier (only compute is wrong). Some span tiers (database problem that surfaces in compute). Some are no-action cases. Two are diagnostic deferral cases. One asks for an SLA review instead of an infra change.

The summary table (scenarios_summary.jsonl)

The Hugging Face Dataset Viewer renders scenarios_summary.jsonl as a browsable table. Each row is one scenario and includes the headline fields from that scenario's metadata and gold recommendation.

The summary is for discovery only. The full inputs (telemetry, Terraform, correlation evidence) live in scenarios/NN/. Always train or evaluate on the full files, not on the summary.

Columns in the summary table:

Column Source
scenario_id folder name
scenario_name metadata.scenario_name
scenario_type metadata.scenario_type
what_this_demonstrates metadata.narrative.what_this_demonstrates
finding_type gold.finding_type
primary_tier gold.primary_tier
secondary_tier gold.secondary_tier
action_category gold.action_category
specific_change gold.specific_change
savings_monthly_usd gold.cost_impact.savings_monthly_usd
current_monthly_usd gold.cost_impact.current_monthly_usd
projected_monthly_usd gold.cost_impact.projected_monthly_usd

Some scenarios have negative savings_monthly_usd. That is expected. For those scenarios the right action increases cost to fix a performance or reliability problem (for example, adding a read replica).

Schema

Scenario inputs

Each scenarios/NN/ folder has these files.

File What it is
metadata.json scenario name, narrative, fixtures
main.tf Terraform for the deployed infra
compute_telemetry.json per-window CPU, memory, latency
database_telemetry.json per-window DB query rate, pool, slow queries
cache_telemetry.json per-window hit rate, evictions
network_telemetry.json per-window bandwidth, packet loss
correlation_evidence.json cross-tier correlation pairs
handcrafted_recommendation.json the gold answer

Recommendation shape

{
  "scenario_id": "01",
  "finding_type": "issue_found",
  "specific_change": "...",
  "primary_tier": "compute",
  "secondary_tier": null,
  "action_category": "rightsizing",
  "conclusion": { ... },
  "evidence": {
    "telemetry_observations": [ ... ],
    "infrastructure_context": [ ... ],
    "correlation_observations": [ ... ]
  },
  "reasoning": "...",
  "projected_state": { ... },
  "cost_impact": { ... },
  "risk_assessment": { ... }
}

Allowed values

  • finding_type: issue_found, no_issue_found, diagnostic_deferral, insufficient_data
  • primary_tier: compute, database, cache, network, deferred, or null
  • secondary_tier: same set as primary_tier
  • action_category: rightsizing, scaling_policy_change, query_cache_optimization, cache_capacity_adjustment, pool_sizing, replica_adjustment, load_balancer_reconfiguration, network_topology_change, sla_review, or null

The deferred tier sentinel is used in diagnostic-deferral scenarios where the agent explicitly cannot pick a tier yet (scenarios 15 and 17). insufficient_data is reserved for future scenarios where the dataset is too sparse to support any finding; no current scenario uses it.

Scenario coverage

ID Type Description
01 single-tier compute over-provisioned
02 single-tier compute peak windows, needs scheduled scaling
03 single-tier database over-provisioned
04 single-tier slow queries plus exhausted pool
05 single-tier ALB round-robin causing uneven CPU
06 no-action all tiers healthy
07 single-tier cache hit ratio degraded
08 cross-tier slow DB queries cascade to compute
09 cross-tier weekday bimodal peaks, needs scheduled scaling
10 cross-tier network latency cascades to compute
11 cross-tier all three tiers over-provisioned
12 mixed healthy compute, over-provisioned database
13 cross-tier compute spike strains database
14 cross-tier compute and database both over-provisioned
15 reliability 99.99% SLA via over-provisioning
16 mild partial compute optimization
17 deferral all tiers rise in lockstep, need more diagnosis
18 mostly healthy minor compute inefficiency

How to use it

You can use this dataset two ways.

Train or fine-tune. Treat each scenario's telemetry plus metadata as input. Use the handcrafted_recommendation.json as the target output.

Evaluate AI agents. Run your agent on the scenario inputs. Compare its output to the hand-crafted recommendation in that scenario's folder.

Quick sanity check

python eval.py --predictions sample_predictions.json

This runs the bundled Floor sanity check. It confirms your predictions parse, have the required fields, and use allowed category values. It does NOT score recommendation quality. See EVAL.md for what is checked.

Prediction shape

See sample_predictions.json for a worked example. Required fields per prediction: scenario_id, finding_type, specific_change, primary_tier, action_category. Optional but useful for deeper scoring: secondary_tier, reasoning, evidence, projected_state, cost_impact, risk_assessment.

How to score beyond the Floor check

The dataset ships gold answers and a Floor sanity check. It does not ship a quality scorer. Beyond the Floor check, the scoring method is up to you. Common options:

  • Exact match on the enum fields (finding_type, primary_tier, action_category).
  • Keyword or substring checks on specific_change.
  • Semantic similarity on the prose fields.
  • A custom rubric per scenario, comparing prediction fields against the matching handcrafted_recommendation.json.

Intended uses

  • Train or fine-tune a model that maps cloud telemetry to an optimization recommendation.
  • Evaluate AI agents on cloud-optimization reasoning.
  • Compare single-shot vs orchestrated agent designs.

License

MIT. See LICENSE.

Citation

@misc{synthesized_cloud_optimization_recommendations_2026,
  title = {Synthesized Cloud-Optimization Recommendations},
  author = {Alexander Meau},
  year = {2026},
  version = {1.0.0}
}
Downloads last month
305