Dataset Viewer
The dataset viewer is not available for this dataset.
Cannot get the config names for the dataset.
Error code:   ConfigNamesError
Exception:    FileNotFoundError
Message:      Couldn't find any data file at /src/services/worker/Tejaskumar/Emergent-NCA-Sequences-5M. Couldn't find 'Tejaskumar/Emergent-NCA-Sequences-5M' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/Tejaskumar/Emergent-NCA-Sequences-5M@08f639ae2a37d4bee7b58a7fe4579cbe6355cb94/preview.jsonl' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.hdf5', '.h5', '.eval', '.lance', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.3gp', '.3g2', '.avi', '.asf', '.flv', '.mp4', '.mov', '.m4v', '.mkv', '.webm', '.f4v', '.wmv', '.wma', '.ogm', '.mxf', '.nut', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.3GP', '.3G2', '.AVI', '.ASF', '.FLV', '.MP4', '.MOV', '.M4V', '.MKV', '.WEBM', '.F4V', '.WMV', '.WMA', '.OGM', '.MXF', '.NUT', '.pdf', '.PDF', '.nii', '.NII', '.zip', '.idx', '.manifest', '.txn']
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 66, in compute_config_names_response
                  config_names = get_dataset_config_names(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 161, in get_dataset_config_names
                  dataset_module = dataset_module_factory(
                                   ^^^^^^^^^^^^^^^^^^^^^^^
                File "/usr/local/lib/python3.12/site-packages/datasets/load.py", line 1203, in dataset_module_factory
                  raise FileNotFoundError(
              FileNotFoundError: Couldn't find any data file at /src/services/worker/Tejaskumar/Emergent-NCA-Sequences-5M. Couldn't find 'Tejaskumar/Emergent-NCA-Sequences-5M' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/Tejaskumar/Emergent-NCA-Sequences-5M@08f639ae2a37d4bee7b58a7fe4579cbe6355cb94/preview.jsonl' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.hdf5', '.h5', '.eval', '.lance', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.3gp', '.3g2', '.avi', '.asf', '.flv', '.mp4', '.mov', '.m4v', '.mkv', '.webm', '.f4v', '.wmv', '.wma', '.ogm', '.mxf', '.nut', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.3GP', '.3G2', '.AVI', '.ASF', '.FLV', '.MP4', '.MOV', '.M4V', '.MKV', '.WEBM', '.F4V', '.WMV', '.WMA', '.OGM', '.MXF', '.NUT', '.pdf', '.PDF', '.nii', '.NII', '.zip', '.idx', '.manifest', '.txn']

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

Typing SVG

Sample Rollout

Status

License: MIT Dataset Size Format

✨ Why this dataset?

Emergent NCA Sequences 5M generates complex global behaviors entirely from frozen random Neural Cellular Automata. What makes this approach powerful?

  • Controlled Diversity: Each rollout uses a fresh set of random weights, creating massive diversity in dynamical systems without hand-crafting rules.
  • Stable Semantics: Continuous hidden states are compressed into a global 32-token vocabulary (centroids.pt), guaranteeing structurally comparable but dynamically unique sequences.
  • No Memorization: Because dynamics are deterministic-given-weights but highly diverse across rollouts, sequence models must genuinely internalize transition rules.

📊 Dataset Dynamics Analysis

The Emergent NCA Sequences dataset represents a rich, spatiotemporal dynamical system. Rather than static text or images, it encodes complex local rules that evolve over 500 steps. Below is a comprehensive analysis of the emergent behaviors, dimensional distributions, and temporal dynamics.


1. Global Dynamics & Behavioral Taxonomy

We categorize and evaluate the emergent behaviors of the cellular automata across the dataset. The dynamics are overwhelmingly active and structured:


Emergent Behavior Distribution (Donut Chart)

NCA Phase Space: Volatility vs. Token Entropy
  • Static vs. Dynamic Rollouts: The dataset is designed to be highly active. Almost all rollouts remain continuously active or chaotic, with less than 0.1% collapsing into completely static attractors (frozen states).
  • Chaos vs. Periodic (Oscillators): A significant portion of the sequences fall into robust periodic loops (breathers), allowing models to learn recurring temporal rhythms. Other sequences exhibit chaotic, turbulent, or wave-like diffusion across the grid.

2. Geometry & Spatial Dimensionality

Grid sizes vary dynamically across rollouts, presenting an ideal testbed for variable-length sequence modeling and multi-scale generalization.


Distribution of Heights, Widths, Shape Ratios, and Tokens per Frame
  • Grid Dimensions: Heights range from 11 to 45 cells; widths range from 9 to 46 cells.
  • Token Length Profile:
    • Average Tokens per Frame: 724.4 cells (Median: 495.0).
    • Average Sequence Tokens per Rollout: 362,178.2 total tokens (reaching up to a maximum of 880,000 tokens over 500 frames).
    • Variable-shaped grid dimensions ensure sequence models must dynamically allocate memory and scale attention mechanisms across varying lengths.

3. Symbolic Vocabulary & Token Distribution

The continuous hidden states are mapped to a discrete 32-token vocabulary using MiniBatch KMeans. The state distribution reveals the density and structures of the active shapes:


Discrete Token Vocabulary Frequency (%)

Token Complexity (Unique Tokens used per Rollout)
  • Top Vocab Occupancy: The discrete cell states are non-uniformly distributed. The top 5 most frequent states dominate:
    • Token 19: 24.43%
    • Token 31: 16.62%
    • Token 7: 10.87%
    • Token 17: 9.83%
    • Token 24: 7.77%
  • Empty Space Mapping: Token 0 and dominant low-activity tokens represent background space, enabling sparse active boundaries and localized structures.

4. Rollout Evolution & Spatiotemporal Physics

Let's look at a selected rollout (Rollout 0, size 11x38) as it evolves over the 500-frame horizon:


Physical grid state evolution across time steps t ∈ [0, 499]
  • Initial Chaotic Phase: The automaton starts in an active state and rapidly diffuses local updates through 3x3 residual convolutions.
  • Attractor/Periodic State: Over time, the local transitions converge into highly structured, repeating spatial shapes and periodic breathers.

5. Transition Curves & Temporal Decay

By calculating frame-to-frame change dynamics, we can quantify the entropy and rate of change of the NCA:


Entropy, Novelty (Frame-to-Frame change), and Cumulative Difference

Temporal Similarity Decay: Chaotic phase (t = 0) vs. Stabilized phase (t = 100)
  • Entropy Curves: Quantifies cellular state diversity. The system maintains high structural complexity, with clear indicators of the transition to stable states.
  • Transition Activity (Novelty): Shows the frame-to-frame changes. The novelty peaks early during the emergent phase and then stabilizes as the system converges to attractors.
  • Lag Similarity Decay: In the early chaotic phase (t = 0), matching states decay rapidly over a short time lag, showing high volatility. In contrast, once the attractor phase (t = 100) is reached, matching remains high even over hundreds of steps, verifying robust periodic or stable attractors.

🧬 Neural Cellular Automata Architecture

The dataset employs a lightweight Residual NCA architecture, uniquely initialized for every single rollout:

Weights
⬇️ Inject
🧱 Local 3x3 Interaction Convolutions
⬇️ Flow
🔄 Residual Hidden-State Updates
⬇️ Add Noise
🌫️ Stochastic Perturbation Noise
⬇️ Produce
Channels

Why randomize? Each sequence uses a fresh set of random weights, creating unparalleled diversity in the dynamical systems while strictly sharing a common symbolic vocabulary.

🔤 Symbolic Vocabulary

Continuous hidden states are intelligently compressed into discrete symbolic tokens using MiniBatch KMeans clustering and cosine-similarity assignments.

🎲 Random NCA ➡️ 🎞️ 500 Frame Rollout ➡️ 🧩 Hidden-State Extraction ➡️ 🎯 KMeans Quantization ➡️ ✨ Symbolic Sequences

What is a token? A token represents a specific, quantized combination of the 16 hidden channels. By assigning each cell a discrete ID from 0 to 31, we map high-dimensional continuous dynamics into a text-like representation.

  • Vocabulary Size: 32 distinct symbols.
  • Shared Reference: The centroids.pt file defines this global vocabulary across all 5M+ rollouts. This means Token 7 in sequence A means exactly the same structural latent state as Token 7 in sequence B.

📊 Dataset Statistics

Property Value Property Value
Total Samples 5M+ Grid Sizes 8×8 → 48×48
Rollout Length 500 Frames Quantization MiniBatch KMeans
Hidden Channels 16 Storage Format .npz Shards
Vocabulary 32 Tokens Dynamics Frozen Random NCA

Shard Information: The full dataset is split into manageable .npz shards. Ensure your pipeline streams or handles shard loading efficiently to avoid memory bottlenecks.

📁 Repository Structure & Scripts

Data & Labels Mapping

The dataset is generated in massive chunks. Each data folder has a corresponding CSV file containing the computed behavioral metrics (e.g., activity, complexity, stable states) for every rollout:

  • dataset_labels_set.csv ➡️ Describes rollouts in nca_dataset/
  • dataset_labels_set2.csv ➡️ Describes rollouts in nca_dataset_set2/
  • dataset_labels_set3.csv ➡️ Describes rollouts in nca_dataset_set3/

Utility Scripts

The repository includes several Python scripts to help you generate, load, and visualize the data:

  • generate_local.py: The core dataset generator. It initializes a random TinyNCA model, runs the dynamics, quantizes the continuous hidden states using centroids.pt, and writes the 32-token symbolic sequences into compressed .npz shards.
  • sample_usage.py: A lightweight snippet demonstrating how to iterate through the data. It streams the .npz shards and yields individual frame transitions (frame[t] to frame[t+1]), which is the standard format for training sequence or world models.
  • visualize_dataset.py: A helper script that picks a random rollout from the shards, maps the symbolic tokens to grayscale values, and renders an animated .gif to let you visually inspect the emergent patterns.

🧠 The Sparse Long-Horizon Prediction Objective

To bypass simple identity-mapping shortcuts (where sequence models learn to simply copy-paste consecutive frames), the dataset is optimized for a sparse long-horizon prediction task:

  • Context Inputs: t₀, t₁₆, t₃₂, t₄₈ (spaced 16 frames apart)
  • Prediction Target: t₁₁₂ (a large 64-frame gap after the context)

Sparse Long-Horizon Prediction: Mapping t₀, t₁₆, t₃₂, t₄₈ → t₁₁₂

🎯 Task Difficulty & Mechanics

  • Bypassing the Continuity Shortcut: Between t₄₈ and t₁₁₂, approximately 83.01% of the cells undergo state changes. Standard copy-paste operations or identity mappings yield terrible cross-entropy loss, forcing models to genuinely model and internalize the underlying ResNCA transition dynamics.
  • Temporal Abstraction: Models must learn high-level temporal transitions over the 64-step interval, testing their capacity for long-term reasoning, scale-generalization, and dynamic system emulation.

🎯 Use Cases

  • Sequence Reasoning & Pretraining: Train/fine-tune small transformers on structured reasoning. The dataset acts as a synthetic "physics" engine for sequence models.
  • World Model Learning: Multi-scale grids (8×8 → 48×48) make this a perfect testbed for scale-generalization in predictive world models.
  • Evaluating Abstraction: Test if your SSM (Mamba, etc.) or Transformer generalizes rules instead of memorizing patterns.
  • Artificial Life Research: Study how lifelike behaviors (oscillators, diffusion) emerge from simple localized rules.
  • Anomaly Detection: Train a model on "normal" NCA dynamics and probe its detection of out-of-distribution transitions.

⚠️ Limitations

  • Uncontrolled Diversity: Because the NCA weights are completely random and frozen, the emergent phenomena are heavily diverse but not systematically curated or balanced.
  • Coarse Vocabulary: The 32-token limit compresses high-dimensional behavior heavily. Certain fine-grained structural changes might be smoothed out.

📄 Citation

If you use this dataset in your research, please cite it:

@misc{nca_sequences_5m,
  author = {Tejaskumar Reddy J},
  title = {Emergent NCA Sequences 5M: Massive-Scale Synthetic Symbolic Dynamics},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/Tejaskumar/Emergent-NCA-Sequences-5M}},
}

Footer Animation

“Complexity emerging from locality.”

🌀 Local rules → emergent worlds.

Downloads last month
1,346