Datasets:

neurips26-PSML
/

SIIB-Time

Tasks:

Languages:

Tags:

License:

Dataset card Files Files and versions

xet

Community

Dataset Viewer

The dataset viewer is not available because its heuristics could not detect any supported data files. You can try uploading some data files, or configuring the data files location manually.

1- Scope

The increasing penetration of inverter-based resources (IBRs), e.g, renewable and energy storage systems, is fundamentally reshaping power grid dynamics. Unlike conventional resources, IBRs interact with the grid through power electronics operating at microsecond timescales, introducing ultrafast dynamic phenomena that conventional time-domain simulation methods, e.g., RMS techniques, fail to capture [1]. Electromagnetic transient (EMT) simulations can capture these fast dynamics but require integration time steps of 1–50 microseconds, making system-wide studies computationally intractable. This creates a critical bottleneck for stability analysis, contingency planning, and control design in modern power systems, as time-domain simulation has been a fundamental tool for analyzing system stability and dynamic performance [2]. Recent grid incidents, such as the April 2025 blackout in Spain and Portugal, underscore these limitations and the need for scalable analysis tools. Overcoming this computational barrier is crucial to the stable integration of renewable energy sources as mandated by climate change mitigation policies [3]. In this context and in recent years, researchers have shifted their focus toward machine learning (ML)-based surrogate modeling for power systems time-domain simulation. This dataset is created to support such efforts.

Several datasets have been proposed to study the time domain dynamics of the grid, including those that are released as is [4-8], released through scripts that regenerate the dataset [9, 10], or where the authors have stated that the dataset will be released [11]. Several other datasets have been proposed, but the datasets or scripts have not been released [12, 13]. There is a diverse set of grids used as the basis for the available datasets, including: the IEEE 9-bus system with 3 synchronous generators; the IEEE 36-bus system including several IBRs [19]; the New York–New England power grid model [18]; Kundur’s two-area system [9]; a detailed 4th-order synchronous machine connected to a bus with varying voltage [10]; an inverter-based microgrid digital twin [11]; and a set of different models for IBRs [14]. Of particular interest to our problem setting is [20], which generates trajectories using a temporal resolution sampled from the range [1, 40] ms, but only using an RMS-based simulation and does not include IBRs. All other datasets release data for a fixed temporal resolution. [21] produce data in both the EMT and RMS regimes, but the released data contains only a few trajectories and is not designed with machine learning applications in mind. Finally, several simulation platforms have been proposed that enable the joint production of RMS and EMT simulation trajectories [15, 16], but no specific datasets have been released from these platforms.

This dataset is specifically designed for the problem of simulation time step-invariance, wherein a model trained on coarse-resolution data can generalize to fine-resolution dynamics without retraining. This is a nascent research direction with only a handful of publications to date, and no publicly available dataset exists that provides paired EMT and RMS simulation trajectories of an inverter-based system under both grid-forming (GFM) and grid-following (GFL) control modes across a large and diverse set of operational scenarios. This dataset fills that gap by providing: (1) high-fidelity EMT trajectories from PSCAD alongside RMS trajectories from MATLAB Simulink for the same scenarios, enabling cross-domain resolution studies; (2) coverage of both GFM and GFL control modes; and (3) 3,000 distinct operational scenarios spanning a wide range of disturbance types and initial conditions, making it suitable for training and benchmarking data-driven surrogate models.

Reuse of this dataset is most naturally suited to researchers working at the intersection of power systems and machine learning for ML-based surrogate modeling, using approaches such as neural operators and physics-informed learning, with power systems as the sole application domain. Researchers outside the power systems field may find the paired coarse/fine-resolution structure valuable as a benchmark for resolution-invariant time-domain simulation of physical systems. However, users should be aware that the system studied, a single inverter infinite bus (SIIB), is a canonical but simplified test case. Conclusions drawn from models trained on this dataset may not directly generalize to large-scale multi-machine or multi-inverter systems without further validation. The reuse of the dataset is unlikely to be impacted by changes in the social, political, or historical context. That said, this dataset supports research relevant to energy transition. Hence, the popularity of use may vary across space and time, as the social and political support for energy transition mandates changes.

The synthetic dataset creation plan was driven by three requirements derived from the formulation as follows:

Paired resolution: To study time step-invariance, the dataset must contain trajectories of the same scenarios simulated at fundamentally different resolutions. EMT simulation in PSCAD captures fast switching and electromagnetic dynamics at 50 microsecond resolution; RMS simulation in MATLAB Simulink operates at 1 millisecond resolution capturing electromechanical dynamics. Pairing these for identical scenarios enables direct study of cross-domain generalization.
Control mode coverage: Both GFM and GFL inverter control modes are included because they represent qualitatively different dynamic behaviors; GFM converters regulate voltage and frequency autonomously while GFL converters synchronize to an existing grid [17], and surrogate models must be evaluated across both.
Scenario diversity: To support generalizable learning, 3,000 distinct operational scenarios were generated by stochastically sampling initial conditions (powers references, grid impedance, grid voltage) and disturbance types (load change and short circuit faults), spanning both stable and unstable system responses.

The choice of the SIIB as the test system was deliberate; it is the canonical reduced-order representation of an IBR connected to a stiff grid, sufficiently rich to exhibit practical stability phenomena while simple enough to generate large simulation datasets at tractable computational cost. That said, all data are generated from simulation models rather than real-world measurements, so results depend on model fidelity and parameter assumptions. One intrinsic biases are worth noting. The choice of disturbances, parameter ranges, and operating conditions defines the distribution of system behaviors represented in the dataset, which does not fully cover all real-world scenarios. Despite these limitations, the dataset is intentionally designed to provide a controlled, diverse, and physically grounded benchmark for developing and evaluating ML methods for power system time domain simulation.

2- Ethicality and Reflexivity

This dataset contains no data about human subjects. All data is generated entirely from physics-based simulation environments, i.e., PSCAD and MATLAB Simulink, under controlled operational scenarios. Accordingly, informed consent is not applicable.

The primary benefit of this dataset is the acceleration of research on computationally tractable power systems simulation and analysis tools. The inability to efficiently simulate IBR-dominated grids at high resolution is a practical barrier to the stable integration of renewable energy resources. Datasets that enable surrogate modeling research support the development of tools that grid operators, planners, and researchers need to manage the energy transition safely. This dataset is made publicly available without restriction to maximize the benefits to the research community. The potential harms of releasing this dataset are minimal. The SIIB test system is a canonical academic test case with no direct correspondence to any real grid infrastructure. No proprietary control parameters, real network topology, or operational data from any utility or grid operator is included. The dataset does not contain information that could be used to identify vulnerabilities in real infrastructure, facilitate cyberattacks, or compromise grid security. Furthermore, the dataset explicitly focuses on a simplified system (SIIB), and this limitation is documented to discourage inappropriate use in more complex settings.

An alternative approach to constructing this dataset would have been to collect real operational measurement data from grid-connected inverters using, e.g., synchro-waveform recordings [18]. This approach was not pursued for two main reasons. First, real-world data is unnecessary for the scope of the dataset and is nearly impossible to obtain such that it satisfies the paired EMT and RMS representation of the system as well as the diversity of operational scenarios. Inverter control parameters are typically proprietary to manufacturers and are not disclosed, making it impossible to construct the well-characterized scenarios that surrogate model training requires. Moreover, and critically for the intended ML application, real-world grid operation rarely produces trajectories that are unstable or near the stability boundary. Second, even if such well-characterized data can be obtained from real-world systems, it would be subject to data sharing agreements, confidentiality obligations, and regulatory constraints that would most probably prevent open public release, limiting the dataset's utility to the broader research community. The synthetic simulation approach was therefore chosen for its scientific advantages, i.e., full control over scenario parameters, ground truth availability, and the ability to generate paired EMT/RMS trajectories, as well as accessibility benefits.

Overall, the benefits of enabling research in scalable and accurate power system simulation are considered to outweigh the potential risks, provided that users are aware of the dataset’s scope and limitations and apply appropriate validation when extending results to real-world systems.

2.1- Domain Knowledge Requirements

The synthetic data is generated based on the model shown in Figure 1. Developing the dataset required expertise spanning multiple technical domains, as follows:

Power system modeling: Expertise in IBRs modeling in both EMT and RMS domains is essential, including the distinction between GFM (switch state 1 in Figure 1) and GFL (switch state 2 in Figure 1) control architectures, the structure and parameterization of inner and outer control loops, PLL controllers, and the design of LCL filters at the inverter terminal. Understanding the SIIB system, including its governing differential-algebraic equations and the conditions under which it exhibits oscillatory or unstable behavior was a prerequisite for meaningful scenario design.
Power system time domain simulation: Proficiency in EMT simulation using PSCAD was required to construct the high-fidelity EMT model, configure appropriate integration time steps, and implement signal logging. Proficiency in MATLAB Simulink was required to construct the equivalent RMS phasor-domain model and ensure that both simulators represented the same physical system under matched scenario conditions. Cross-validating simulation outputs between the two platforms required the ability to interpret and reconcile differences arising from differences in modelling resolution rather than modelling error.
Data processing: The dataset required a systematic design of a stochastic scenario sampling procedure, structured file organization across thousands of simulation runs, consistent naming and indexing conventions, and the development of automated logging and verification pipelines to ensure that saved outputs correctly correspond to their intended simulation scenarios. Data processing, wrangling, and packaging were performed in Python, including parsing and aligning time-series outputs across simulators, organizing scenario metadata, and preparing the dataset in a structured format.

Users of this dataset are expected to have foundational knowledge of power systems and an applied ML skill set. These are necessary to leverage the dataset for its primary intended purpose of training and benchmarking surrogate models for time-domain simulation. Understanding time-series modeling, sequence-to-sequence learning, and the concept of discretization invariance will be essential for interpreting model behavior along the paired EMT/RMS resolution axis. Moreover, users should understand the physical meaning of the signals recorded, i.e., voltages and currents in the d-q reference frame, active and reactive power injections, and phase angle, as well as the significance of the control mode distinction between GFM and GFL operation [17]. On the data side, users with basic data handling skills in any language should be able to work with the dataset directly, as all signal data is provided in CSV format. Python is not a requirement for access or use.

Figure 1: SIIB physical and control layers

2.2- Positionality

This dataset was created by researchers working at the intersection of ML and power systems engineering. The team brings expertise in power systems modeling, simulation, and ML-based applications, combined with experience in both academic research and industry deployment. This positions us to make informed choices about scenario design, simulation fidelity, and the selection of a canonical test system; it also means that our framing of what constitutes a meaningful and representative dataset is shaped by the conventions and priorities of the power systems engineering community, which may differ from those of researchers approaching this problem from a pure ML or scientific computing perspective. Our work is motivated by the practical challenge of enabling stable integration of IBRs through scalable simulation tools. This motivation reflects a specific orientation toward the energy transition and the computational needs of grid operators and planners. Researchers with different priorities, for example, those focused on social and economic dimensions of power systems, may find the dataset's scope and framing less directly applicable to their needs.

Two distinct field epistemologies are in tension in the design of this dataset. The power systems engineering tradition prioritizes physical fidelity, grounded in first-principles differential-algebraic equation models, validated against known physical phenomena, and evaluated against established test cases. This epistemology is evident in our choice of the SIIB system as the test case, in the use of industry-standard simulation platforms, and in the cross-validation of outputs. The assumption embedded in this tradition is that a well-modeled simplified system is a valid proxy for studying fundamental dynamic phenomena, an assumption that is widely accepted within power systems but deserves explicit acknowledgment. The machine learning tradition, by contrast, prioritizes statistical generalization, benchmark comparability, and scale. This epistemology shaped our decision to generate 3,000 diverse scenarios through stochastic sampling, to provide paired data across two simulation domains to support resolution-invariance studies, and to structure the dataset for straightforward ingestion by standard ML pipelines. Users should be aware that this dataset was designed with ML for surrogate time-domain modeling as the primary downstream task. Design choices that appear neutral, the parameter sampling ranges, the disturbance types included, and the choice of signals logged reflect judgments made from within this dual epistemological framing. Different choices would have produced a different dataset, and those differences would matter for downstream applications.

2.3- Carbon Footprint

On average, each scenario simulation requires approximately 7 minutes of CPU execution time. Assuming an average computational power draw of 100 W for the host system [19], the energy consumption per simulation run is 7 min × (1 hr / 60 min) × 0.1 kW = 0.0117 kWh per simulation. The data was generated in a region with an emission intensity of 0.47 kg CO₂e/kWh for electricity generation, thus the carbon footprint per simulation run is approximately 0.0117 kWh × 0.47 kg CO₂e/kWh ≈ 5.5 g CO₂e. Across all 3,000 simulation runs, the total estimated carbon footprint is 3,000 × 5.5 g ≈ 16.5 kg CO₂e. For reference, this is roughly equivalent to driving a passenger vehicle approximately 80 kilometers [20], making the dataset generation process environmentally modest relative to the potential research impact of enabling more computationally efficient grid simulation tools.

The dataset creation involves a trade-off between simulation fidelity and computational and environmental costs. EMT simulations were necessary to accurately capture fast inverter dynamics, provide reliable ground truth for machine learning models, and enable cross-resolution learning between EMT and RMS domains. At the same time, the dataset size of 3,000 scenarios was selected to provide sufficient diversity for machine learning applications, while keeping the overall computational footprint within a manageable range.

3- Data Pipeline

All data were produced by running 3,000 distinct operational scenarios across GFM and GFL control modes, each simulated in both the EMT (PSCAD) and RMS phasor domains (MATLAB/Simulink), yielding matched trajectory pairs for each scenario. Each simulation starts from the system being in steady-state and runs for 6.5 simulated seconds. This disturbance is applied at t = 0.5 s across the scenarios.

The simulations are grounded in established models of inverter-based systems and standard control strategies. That said, the dataset does not capture the nuances and complexities of a real-world power system dynamics, and must be interpreted as a significant approximation of real-world behavior. Moreover, the EMT and phasor domain models are not perfectly equivalent representations of the same physical system, as they differ in modeling fidelity by design. The EMT model captures electromagnetic dynamics through the explicit representation of the inverter’s DC link and switching stage, and allows three-phase unbalanced operation. The phasor domain model operates at a coarser resolution, averaging switching behavior and representing the system in the positive sequence. This resolution gap is the primary axis of variation the dataset is designed to study, and the differences between EMT and phasor trajectories for matched scenarios are therefore a feature, not a defect.

3.1- Model Parameters:

All simulation models share a common SIIB system architecture, consisting of the following:

A three-phase upstream grid source:
- Nominal voltage: 3.3 kV LL RM
- Nominal frequency: 60 Hz
- Grounded Neutral
Three-phase transformer:
- Nominal rating: 5 MVA
- Leakage reactance (EMT): 0.05 pu
- Leakage reactance (RMS): L_1 = 0.03 pu, L_2 = 0.02 pu
- Grid-side winding: Delta
- Load-side winding: Yg
Inverter parameters:
- V_DC = 1.5 kV, C_DC = 3900 µF (EMT).
- Switching frequency: 8000 Hz (EMT).
- The active power reference P_ref (EMT and RMS) and the reactive power reference Q_ref (RMS) are varied across scenarios to sample a wide range of operating points. They are drawn from a uniform distribution over the interval [0.5, 1.7] pu.
Inverter LCL filter:
- L_1f = 60 µH, C_f = 1 mF, L_2f = 35 µH,
- Series damping resistor, R_f = 0.01 Ω
A constant baseline load (R_L = 2 Ω, L_l = 0.01 H).

The GFM control architecture implements a cascaded voltage-current control structure with active power droop [17]. The outer voltage control loops (V_d and V_q) use PI controllers (K_p = 14, T_i = 0.0007 s in EMT; K_p = 15, K_i = 1500 in RMS) with anti-windup back-calculation, feed-forward, and saturation blocks that enforce current limits of 2.365 pu. The inner current control loops (I_d and I_q) use PI controllers (K_p = 0.14, T_i = 0.07 s in EMT; K_p = 0.15, K_i = 15 in RMS). Active power droop is implemented with a droop coefficient of 1.5 (EMT) and 1.5×10^-6 (RMS).

The GFL control architecture synchronizes to the grid via a Phase-Locked Loop (PLL; K_p = 90, K_i = 1500, base frequency 60 Hz in EMT). Active and reactive power are controlled independently through separate power control loops that produce d- and q-axis current references, which are then tracked by inner PI current controllers (K_p = 1, T_i = 0.1 s in EMT; K_p = 1, K_i = 10 in phasor domain).

Both modes implement a black-start ramp-up procedure at initialization; the active power reference is gradually increased from zero to its setpoint via an integrator and saturation block to avoid large transient overshoots that would corrupt the early portion of the trajectory data.

The control gains, filter components, and droop coefficients listed above are representative but not universal. Real inverters from different manufacturers implement variations of these structures with proprietary parameterizations, and the dataset does not capture that diversity. Models trained on this dataset will be most directly applicable to inverters whose control structure and parameter ranges are compatible with those implemented here.

3.2- Scenario Sampling:

Scenario diversity is achieved through stochastic parameterization of initial conditions and two disturbance categories, all generated in Python and injected into the simulation models via their respective APIs.

Initial conditions are sampled as follows:

Power references: to be filled.
Grid impedance scaling: to be filled.
Grid voltage sag scaling factor: to be filled.

Scenarios start from a steady-state point. A disturbance is applied at t = 0.5 s. Disturbances are sampled as follows:

Load disturbances: A random load of stochastically sampled magnitude is connected to the network. The three-phase random load can be unbalanced across phases in the EMT models. The load parameters are independently drawn from uniform distributions over predefined ranges: R_L ∈ [0.2,2] Ω, L_L ∈ [0.001,0.05] H, and C_L ∈ [1×10^(-6),50×10^(-6)] F. To account for phase imbalance, per-phase parameters are independently resampled from uniform distributions within ±15% of their respective average values, yielding a maximum inter-phase imbalance of 30%. The load connection time is uniformly sampled over the interval [0.5, 5] s.
Fault disturbances: Short-circuit events are introduced with randomly sampled durations. The fault type is randomly selected among all the possible 10 three-phase fault types. A fault ride-through (FRT) behavior is implemented; upon fault detection, the active power reference is set to zero and the inverter prioritizes reactive current injection for voltage support. The fault occurrence time is uniformly sampled over the interval [0.5, 5] s, and the fault duration is uniformly sampled within [0.02, 0.2] s.

Measurements are taken at the LCL filter output shown in Figure 1 using the default multimeters embedded in the simulation platforms (PSCAD and MATLAB/Simulink). Signals recorded include three-phase voltages and currents transformed to the d-q reference frame, active and reactive powers, and voltage phase angle. The data collection process introduces several intrinsic biases as follows:

Model-Based bias: The dataset reflects the assumptions, structure, and parameterization of the underlying simulation models. Any modeling inaccuracies will propagate into the dataset.
Scenario selection bias: The choice of disturbances, parameter ranges, and operating modes defines the distribution of system behaviors represented, which does not fully capture real-world variability.
Simplified system bias: The SIIB system abstracts away network-level complexity, thus limiting representation of multi-node interactions and large-scale grid behavior.

3.3- Data Processing, Wrangling, and Annotation

Data processing for this dataset consists exclusively of post-simulation wrangling; no transformations, normalizations, or alterations of signal values are applied at any stage. The goal is purely organizational: to convert, structure, and package raw simulation outputs into a consistent and reusable format without modifying their physical content.

The Python API for the RMS simulation models in MATLAB Simulink write five signal files directly to disk in CSV format at the end of each simulation run: voltage magnitude in d-q axis (Vd and Vq in kV), current (id and iq in kA), powers (P in MW, Q in kvar), and voltage angle (in radians). These files are sampled at 1 ms resolution, yielding approximately 6,501 rows per file for a 6.5-second simulation. No conversion step is required, as MATLAB writes these directly in the target format. The decision to write CSVs directly from Simulink was made to minimize pipeline wrangling and reduce the risk of conversion errors. The Python API for PSCAD writes recorded signals in COMTRADE format first, and a dedicated Python conversion script converts these files to the same 5-CSV structure as the MATLAB outputs. For each simulation run in both MATLAB and PSCAD, Python scripts assemble one scenario descriptor files, recording the initial conditions and the disturbance information. This files are generated directly from the Python scenario-sampling scripts and written alongside the simulation outputs, ensuring that every scenario folder is self-describing. Note that users working with paired EMT/RMS data must account for this resolution difference explicitly, which is the primary intended use case of this dataset.

No labels were created prior to simulation or by external human annotators. Annotation is fully automated and embedded within the simulation models. The dataset contains 3,000 scenarios, each identified by a four-digit, zero-padded index (0001-3000). Each scenario contributes 9 files: one metadata file and 8 signal CSV files, four from the EMT simulation and four from the RMS phasor-domain simulation. Signal files follow the naming convention xxxx_[M|L][EMT|RMS][A|I|P|V].csv, where M denotes GFM control mode, L denotes GFL control mode, and the final letter denotes signal type: A for voltage angle, I for current, P for powers, and V for voltage. All signal files share a common time column, t, in seconds, and the rows correspond to simulation time steps; simulations run from t=0 to t=6.5 s.

xxxx_meta_data.csv: This file provides information needed to reconstruct or verify the simulation conditions for any scenario, and serves as the primary scenario-level annotation for downstream tasks. It is the authoritative annotation of each scenario's initial and disturbance conditions and records 16 rows, as follows:
- Pref: float, active power reference (per unit)
- Qref: float, reactive power reference (per unit; GFL mode only)
- grid_impedance_scale: float, initial condition of grid impedance scaling factor (null if not applicable).
- voltage_sag_factor: float, initial condition of grid voltage magnitude sag factor (null if not applicable)
- disturbance type: string, either "Short circuit" or "Load Step up"
- disturbance duration: float, duration of the short circuit in seconds, otherwise 0
- sc type: integer, 0 for load step up, 1-10 for short circuit type identifier
- R1, R2, R3: float, per-phase resistance values of the stochastically sampled random load (Ω), used for load disturbance.
- L1, L2, L3: float, per-phase inductance values of the random load (H), used for load disturbance.
- C1, C2, C3: float, per-phase capacitance values of the random load (F), used for load disturbance.
xxxx_[M|L]_[EMT|RMS]_V.csv: This file includes two additional columns, Vd and Vq, measuring the d- and q-axis voltages in kV, respectively.
xxxx_[M|L]_[EMT|RMS]_I.csv: This file includes two additional columns, id, iq, measuring the d- and q-axis currents in kA, respectively.
xxxx_[M|L]_[EMT|RMS]_P.csv: This file includes two additional columns, P and Q, measuring active power in MW and reactive power in Mvar, respectively.
xxxx_[M|L]_[EMT|RMS]_A.csv: This file includes the additional column Theta, measuring phase angle of the voltage in radians, derived from the inverter's internal synchronization signal (PLL output for GFL; droop-based frequency integration for GFM).

Since annotation is fully automated, inter-annotator disagreement does not apply.

4- Data Quality

4.1- Suitability

This dataset was designed specifically to support research on operator learning as a surrogate modeling approach for power systems resolution-invariant time-domain simulation. Its suitability for this purpose rests on three structural properties. First, the paired EMT/RMS design provides matched trajectory pairs for the same 3,000 scenarios at two fundamentally different simulation resolutions, EMT at 50 µs and RMS at 1 ms. No existing public dataset provides this pairing for inverter-based systems. Second, both GFM and GFL control modes are included, enabling benchmarking of surrogate models across qualitatively different inverter dynamics. Third, the 3,000 scenarios span a wide range of disturbance types, load conditions, grid impedance values, and power references, providing the diversity needed to train and evaluate generalizable operator learning models. Beyond operator learning, the dataset is also suitable for several adjacent research tasks: physics-informed machine learning for inverter dynamics, stability classification benchmarking, engineering education, and general ML-based surrogate modeling for power systems time-domain simulation. The dataset is particularly well-suited for machine learning due to: - A large number of scenarios provides diversity in system behavior. - Trajectory-level data enables sequence modeling and operator learning. - Paired multi-resolution data enables supervised learning across simulation fidelities. - Multi-signal observability allows models to capture complex system dynamics.

The physical accuracy of the dataset is grounded in the fidelity of the underlying simulation models. The EMT model in PSCAD is a high-fidelity representation of inverter physics. The RMS phasor-domain models in MATLAB Simulink represent the same system at a coarser level of fidelity appropriate to electromechanical timescales. A subset of exported signal values are verified against scope outputs in both platforms prior to packaging, confirming that stored CSV values accurately represent the simulation trajectories. For the intended purpose of studying resolution-invariance, the accuracy of the EMT simulation is the ground truth against which RMS and surrogate model outputs should be evaluated.

Each of the 3,000 scenarios is represented by a complete set of 9 files, one metadata file and 8 signal CSVs. No partial scenarios or missing files are present in the distributed dataset. The metadata file records all parameters needed to fully characterize the simulation conditions, including initial operating point and disturbance type, duration, and configuration. The signal files collectively cover all physically meaningful quantities at the point of common coupling, voltage, current, power, and phase angle, providing a complete observational record for each scenario. The internal converter states are not reported, as these are not required for the intended operator learning application and were excluded by design to keep file numbers tractable.

The dataset was generated using simulation models reflecting the current state of practice in inverter control for renewable-integrated power systems. The cascaded voltage-current control structure for GFM and PLL-based current control for GFL are the dominant architectures in both academic research and industry deployment. The disturbance types included, short circuit faults and load steps, represent typical events studied in power system stability analysis. The dataset does not include dynamics associated with emerging control structures such as grid-forming virtual oscillator control or advanced grid-support functions, which are active research areas. Users should assess whether the control architectures represented remain current for their specific application at the time of use.

Consistency across the dataset is maintained through three mechanisms. First, the file naming convention (xxxx_[M|L][EMT|RMS][A|I|P|V].csv) is applied uniformly across all 3,000 scenarios, ensuring machine-readable structure. Second, the metadata file format (xxxx_meta_data.csv) uses a fixed 16-field key-value structure for every scenario, with consistent field names and units. The primary cross-scenario consistency consideration for users is that the GFM and GFL scenarios differ structurally in one metadata field (Qref is present for GFL, null for GFM), which should be accounted for in any joint modeling or analysis across control modes.

While suitable for its intended purpose, the dataset has some limitations as follows: - Simplified system scope: The SIIB system does not capture large-scale grid interactions, limiting suitability for network-level studies. - Synthetic nature: The dataset reflects simulated behavior and may not fully capture real-world measurement noise, parameter uncertainty, or unmodeled dynamics.

4.2- Representativeness

The target population for this dataset is the space of all possible time-domain dynamic trajectories of a single inverter connected to a stiff grid under GFM and GFL control, subject to disturbances representative of real grid operation, specifically short circuit faults and load steps. This population is parameterized by three dimensions: (1) the system's initial operating point (Pref, Qref, grid impedance, voltage sag); (2) the nature of the disturbance (type, duration, configuration); and (3) the simulation domain (EMT vs. RMS). The dataset samples 3,000 distinct scenarios from this population, paired across EMT and RMS domains.

Despite its diversity, the dataset has inherent limitations. The SIIB system is a reduced-order abstraction of a single IBR connected to a strong grid; it is not a model of any specific real grid or installation. The population sampled is therefore the population of trajectories producible by this specific model family under the parameter ranges encoded in the scenario generation scripts, as compared to the population of trajectories observable in real inverter installations. The produced sample does not represent multi-inverter interactions, network topology effects, and large-scale grid dynamics, among others. Moreover, the data representativeness depends on parameter selection and the underlying assumptions in control and system design. In this context, the distribution of scenarios is determined by the dataset design process rather than real-world statistical distributions. Some operating conditions may be over- or under-represented, and rare or extreme events may not be fully captured. Additionally, the 3,000 scenarios are not stratified to guarantee any meaningful representation of stable and unstable outcomes. While some trial and error has been done to ensure unstable cases are present, the proportions of each are emergent properties of the physics rather than design targets, and users performing classification tasks should assess class balance before training.

In synthetic engineering datasets, extrinsic bias operates differently than in datasets derived from human-generated text or behavioral data. That said, several structural biases warrant explicit acknowledgment. The dataset is built around power system conventions, standards, and test cases that predominantly originate from North American infrastructure traditions, specifically 60 Hz nominal frequency. Grids in the Global South, particularly in Sub-Saharan Africa, South Asia, and rural and remote communities in the Arctic and northern regions, often operate under fundamentally different conditions; they are weaker grids with lower short-circuit ratios, some with 50 Hz nominal frequency, and less standardized control methods. The SIIB system, as parameterized here, may not fully reflect those conditions.

4.3- Authenticity and Reliability

Every file in the dataset is a direct output of simulation models built and executed by the dataset creators, using the wrangling pipeline described previously. The identity of the dataset's creators and their institutional affiliations are verifiable through the associated publication and Hugging Face repository metadata; these are currently anonymized to support a double-blind review policy. The dataset is registered with a globally unique and persistent Digital Object Identifier (DOI). Users should cite this DOI rather than the Hugging Face URL, as DOIs provide permanent access independent of repository infrastructure changes.

The nature of the signals provided by each CSV file is discussed previously. All wrangling operations are purely structural and do not modify signal values. The integrity of the exported files was verified by randomly cross-checking the CSV files associated with a subset of scenarios against the corresponding scope outputs in MATLAB Simulink and PSCAD, confirming full agreement between stored values and simulation ground truth. In summary, as a synthetic dataset, both authenticity and reliability are established through controlled simulation pipelines, the use of well-defined physical models, and complete traceability from scenario definition to output signals. Users can verify the provenance of any scenario by cross-referencing the xxxx_meta_data.csv file with the simulation model documentation provided in this dataset card. Note that no cryptographic integrity checks, e.g., hashes, are currently provided to verify file-level integrity after distribution.

Users can independently verify the dataset's reliability through several mechanisms. The physical relationship between d-q signals provides a built-in consistency check. Active power P should equal V_d × i_d + V_q × i_q, and reactive power Q should equal V_q × i_d – V_d × i_q at every time step. Deviations beyond floating-point precision would indicate a data integrity issue. Under steady-state conditions, V_q should converge toward zero and P and Q should stabilize at values consistent with P_ref and Q_ref recorded in xxxx_meta_data.csv. Users can verify this for any scenario by inspecting the steady-state portion of the trajectory.

5- Data Management

This dataset is publicly available on Hugging Face without authentication or registration requirements. Anyone can access, browse, and download the dataset directly from its Hugging Face repository page, navigated to via the dataset's persistent DOI. No request, approval, or data sharing agreement is required. All 3,000 scenarios are available in a single repository. No auxiliary scripts are archived with the dataset; all files are in CSV and are readable without specialist tools. The COMTRADE source files from PSCAD are not distributed; only the converted CSV outputs are included. Signal naming in this dataset follows the standard terminology and conventions in power systems engineering.

The dataset was generated between November 2025 and March 2026 using PSCAD and MATLAB Simulink simulation environments. The scenario generation scripts, simulation model construction, data wrangling pipeline, and dataset packaging were all carried out by the authors. No third-party data is incorporated. This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The CC BY 4.0 license was selected to maximize reusability while ensuring that the dataset creators receive appropriate attribution. The dataset is designed to satisfy the FAIR principles (Findable, Accessible, Interoperable, and Reusable).

Contact information for the dataset maintainers will be provided upon publication.

[1] Nikos Hatziargyriou, Jovica Milanovic, Claudia Rahmann, Venkataramana Ajjarapu, Claudio Canizares, Istvan Erlich, David Hill, Ian Hiskens, Innocent Kamwa, Bikash Pal, Pouyan Pourbeik, Juan Sanchez-Gasca, Aleksandar Stankovic, Thierry Van Cutsem, Vijay Vittal, and Costas Vournas. 2021. Definition and Classification of Power System Stability – Revisited & Extended. IEEE Transactions on Power Systems 36, 4 (2021), 3271–3281. https://doi.org/10.1109/TPWRS.2020.3041774

[2] Jose Daniel Lara, Rodrigo Henriquez-Auba, Deepak Ramasubramanian, Sairaj Dhople, Duncan S. Callaway, and Seth Sanders. 2024. Revisiting Power Systems Time-Domain Simulation Methods and Models. IEEE Transactions on Power Systems 39, 2 (2024), 2421–2437. https://doi.org/10.1109/TPWRS.2023.3303291

[3] Peter Lopion, Peter Markewitz, Martin Robinius, and Detlef Stolten. 2018. A review of current challenges and trends in energy systems modeling. Renewable and Sustainable Energy Reviews 96, (2018), 156–166. https://doi.org/10.1016/j.rser.2018.07.045

[4] Sunil Subedi, Manisha Rauniyar, Saima Ishaq, Timothy M. Hansen, Reinaldo Tonkoski, Mariko Shirazi, Richard Wies, and Phylicia Cicilio. 2021. Review of Methods to Accelerate Electromagnetic Transient Simulation of Power Systems. IEEE Access 9, (2021), 89714–89731. https://doi.org/10.1109/ACCESS.2021.3090320

[5] Christian Moya, Shiqi Zhang, Guang Lin, and Meng Yue. 2023. DeepONet-grid-UQ: A trustworthy deep operator framework for predicting the power grid’s post-fault trajectories. Neurocomputing 535, (2023), 166–182. https://doi.org/10.1016/j.neucom.2023.03.015

[6] Matthew Bossart, Jose Daniel Lara, Ciaran Roberts, Rodrigo Henriquez-Auba, Duncan S. Callaway, and Bri-Mathias Hodge. 2025. Acceleration of Power System Dynamic Simulations Using a Deep Equilibrium Layer and Neural ODE Surrogate. IEEE Transactions on Energy Conversion 40, 4 (2025), 2710–2722. https://doi.org/10.1109/TEC.2025.3563142

[7] Ignasi Ventura Nadal, Jochen Stiasny, and Spyros Chatzivasileiadis. 2025. Physics-Informed Neural Networks: a Plug and Play Integration into Power System Dynamic Simulations. Electric Power Systems Research 248, (2025), 111885. https://doi.org/10.1016/j.epsr.2025.111885

[8] Muhammad Sharjeel Javaid, Balarko Chaudhuri, Fei Teng, and Zohaib Akhtar. 2026. EMT-RMS Modeling Trade-Off for IBR-Driven Sub-Synchronous Oscillations. IEEE Transactions on Power Systems 41, 1 (2026), 425–437. https://doi.org/10.1109/TPWRS.2025.3588893

[9] Jochen Stiasny, Georgios S. Misyris, and Spyros Chatzivasileiadis. 2023. Transient Stability Analysis with Physics-Informed Neural Networks. https://doi.org/10.48550/arXiv.2106.13638

[10] Ioannis Karampinis, Petros Ellinas, Johanna Vorwerk, and Spyros Chatzivasileiadis. 2025. Neural Operators for Power Systems: A Physics-Informed Framework for Modeling Power System Components. https://doi.org/10.48550/arXiv.2511.05216

[11] Osasumwen Cedric Ogiesoba-Eguakun, Kaveh Ashenayi, and Suman Rath. 2026. High-Fidelity Digital Twin Dataset Generation for Inverter-Based Microgrids Under Multi-Scenario Disturbances. https://doi.org/10.48550/arXiv.2603.10262

[12] Jiaming Li, Meng Yue, Yue Zhao, and Guang Lin. 2020. Machine-Learning-Based Online Transient Analysis via Iterative Computation of Generator Dynamics. In 2020 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), 2020. 1–6. https://doi.org/10.1109/SmartGridComm47815.2020.9302975

[13] Tianshi Cheng, Ruogu Chen, Ning Lin, Tian Liang, and Venkata Dinavahi. 2025. Machine-Learning-Reinforced Massively Parallel Transient Simulation for Large-Scale Renewable-Energy-Integrated Power Systems. IEEE Transactions on Power Systems 40, 1 (2025), 970–981. https://doi.org/10.1109/TPWRS.2024.3409729

[14] Sunil Subedi, Nischal Guruwacharya, Bidur Poudel, Jesus D. Vasquez-Plaza, Fabio Andrade, Robert Fourney, Hossein Moradi Rekabdarkolaee, Timothy M. Hansen, and Reinaldo Tonkoski. 2023. Leveraging Data-Driven Models for Accurate Analysis of Grid-Tied Smart Inverters Dynamics. https://doi.org/10.48550/arXiv.2310.02056

[15] Qiuhua Huang and Vijay Vittal. 2016. OpenHybridSim: An open source tool for EMT and phasor domain hybrid simulation. In 2016 IEEE Power and Energy Society General Meeting (PESGM), 2016. 1–5. https://doi.org/10.1109/PESGM.2016.7741233

[16] 2026. Energinet-SimTools/MTB. Retrieved April 29, 2026 from https://github.com/Energinet-SimTools/MTB

[17] Nagaraju Pogaku, Milan Prodanovic, and Timothy C. Green. 2007. Modeling, Analysis and Testing of Autonomous Operation of an Inverter-Based Microgrid. IEEE Transactions on Power Electronics 22, 2 (2007), 613–625. https://doi.org/10.1109/TPEL.2006.890003

[18] Hamed Mohsenian-Rad and Wilsun Xu. 2023. Synchro-Waveforms: A Window to the Future of Power Systems Data Analytics. IEEE Power and Energy Magazine 21, 5 (2023), 68–77. https://doi.org/10.1109/MPE.2023.3288583

[19] Kurtis McKenney, Matthew Guernsey, Ratcharit Ponoum, and Jeff Rosenfeld. 2010. Commercial Miscellaneous Electric Loads: Energy Consumption Characterization and Savings Potential in 2008 by Building Type. TIAX LLC. Retrieved from https://www.energy.gov/sites/prod/files/2016/07/f33/2010-05-26%20TIAX%20CMELs%20Final%20Report_0.pdf

[20] Natural Resources Canada. 2019. Personal vehicles. Transportation energy efficiency. Retrieved April 29, 2026 from https://natural-resources.canada.ca/energy-efficiency/transportation-energy-efficiency/personal-vehicles

Downloads last month: 50

Total file size:

73.6 kB