The dataset viewer is not available for this dataset.
Error code: ConfigNamesError
Exception: FileNotFoundError
Message: Couldn't find any data file at /src/services/worker/Tejaskumar/Emergent-NCA-Sequences-5M. Couldn't find 'Tejaskumar/Emergent-NCA-Sequences-5M' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/Tejaskumar/Emergent-NCA-Sequences-5M@08f639ae2a37d4bee7b58a7fe4579cbe6355cb94/preview.jsonl' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.hdf5', '.h5', '.eval', '.lance', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.3gp', '.3g2', '.avi', '.asf', '.flv', '.mp4', '.mov', '.m4v', '.mkv', '.webm', '.f4v', '.wmv', '.wma', '.ogm', '.mxf', '.nut', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.3GP', '.3G2', '.AVI', '.ASF', '.FLV', '.MP4', '.MOV', '.M4V', '.MKV', '.WEBM', '.F4V', '.WMV', '.WMA', '.OGM', '.MXF', '.NUT', '.pdf', '.PDF', '.nii', '.NII', '.zip', '.idx', '.manifest', '.txn']
Traceback: Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 66, in compute_config_names_response
config_names = get_dataset_config_names(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/inspect.py", line 161, in get_dataset_config_names
dataset_module = dataset_module_factory(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/datasets/load.py", line 1203, in dataset_module_factory
raise FileNotFoundError(
FileNotFoundError: Couldn't find any data file at /src/services/worker/Tejaskumar/Emergent-NCA-Sequences-5M. Couldn't find 'Tejaskumar/Emergent-NCA-Sequences-5M' on the Hugging Face Hub either: FileNotFoundError: Unable to find 'hf://datasets/Tejaskumar/Emergent-NCA-Sequences-5M@08f639ae2a37d4bee7b58a7fe4579cbe6355cb94/preview.jsonl' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.hdf5', '.h5', '.eval', '.lance', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.3gp', '.3g2', '.avi', '.asf', '.flv', '.mp4', '.mov', '.m4v', '.mkv', '.webm', '.f4v', '.wmv', '.wma', '.ogm', '.mxf', '.nut', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.3GP', '.3G2', '.AVI', '.ASF', '.FLV', '.MP4', '.MOV', '.M4V', '.MKV', '.WEBM', '.F4V', '.WMV', '.WMA', '.OGM', '.MXF', '.NUT', '.pdf', '.PDF', '.nii', '.NII', '.zip', '.idx', '.manifest', '.txn']Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
✨ Why this dataset?
Emergent NCA Sequences 5M generates complex global behaviors entirely from frozen random Neural Cellular Automata. What makes this approach powerful?
- Controlled Diversity: Each rollout uses a fresh set of random weights, creating massive diversity in dynamical systems without hand-crafting rules.
- Stable Semantics: Continuous hidden states are compressed into a global 32-token vocabulary (
centroids.pt), guaranteeing structurally comparable but dynamically unique sequences. - No Memorization: Because dynamics are deterministic-given-weights but highly diverse across rollouts, sequence models must genuinely internalize transition rules.
📊 Dataset Dynamics Analysis
The Emergent NCA Sequences dataset represents a rich, spatiotemporal dynamical system. Rather than static text or images, it encodes complex local rules that evolve over 500 steps. Below is a comprehensive analysis of the emergent behaviors, dimensional distributions, and temporal dynamics.
1. Global Dynamics & Behavioral Taxonomy
We categorize and evaluate the emergent behaviors of the cellular automata across the dataset. The dynamics are overwhelmingly active and structured:
Emergent Behavior Distribution (Donut Chart) |
NCA Phase Space: Volatility vs. Token Entropy |
- Static vs. Dynamic Rollouts: The dataset is designed to be highly active. Almost all rollouts remain continuously active or chaotic, with less than 0.1% collapsing into completely static attractors (frozen states).
- Chaos vs. Periodic (Oscillators): A significant portion of the sequences fall into robust periodic loops (breathers), allowing models to learn recurring temporal rhythms. Other sequences exhibit chaotic, turbulent, or wave-like diffusion across the grid.
2. Geometry & Spatial Dimensionality
Grid sizes vary dynamically across rollouts, presenting an ideal testbed for variable-length sequence modeling and multi-scale generalization.
Distribution of Heights, Widths, Shape Ratios, and Tokens per Frame
- Grid Dimensions: Heights range from 11 to 45 cells; widths range from 9 to 46 cells.
- Token Length Profile:
- Average Tokens per Frame: 724.4 cells (Median: 495.0).
- Average Sequence Tokens per Rollout: 362,178.2 total tokens (reaching up to a maximum of 880,000 tokens over 500 frames).
- Variable-shaped grid dimensions ensure sequence models must dynamically allocate memory and scale attention mechanisms across varying lengths.
3. Symbolic Vocabulary & Token Distribution
The continuous hidden states are mapped to a discrete 32-token vocabulary using MiniBatch KMeans. The state distribution reveals the density and structures of the active shapes:
Discrete Token Vocabulary Frequency (%) |
Token Complexity (Unique Tokens used per Rollout) |
- Top Vocab Occupancy: The discrete cell states are non-uniformly distributed. The top 5 most frequent states dominate:
- Token 19: 24.43%
- Token 31: 16.62%
- Token 7: 10.87%
- Token 17: 9.83%
- Token 24: 7.77%
- Empty Space Mapping: Token 0 and dominant low-activity tokens represent background space, enabling sparse active boundaries and localized structures.
4. Rollout Evolution & Spatiotemporal Physics
Let's look at a selected rollout (Rollout 0, size 11x38) as it evolves over the 500-frame horizon:
Physical grid state evolution across time steps t ∈ [0, 499]
- Initial Chaotic Phase: The automaton starts in an active state and rapidly diffuses local updates through 3x3 residual convolutions.
- Attractor/Periodic State: Over time, the local transitions converge into highly structured, repeating spatial shapes and periodic breathers.
5. Transition Curves & Temporal Decay
By calculating frame-to-frame change dynamics, we can quantify the entropy and rate of change of the NCA:
Entropy, Novelty (Frame-to-Frame change), and Cumulative Difference |
Temporal Similarity Decay: Chaotic phase (t = 0) vs. Stabilized phase (t = 100) |
- Entropy Curves: Quantifies cellular state diversity. The system maintains high structural complexity, with clear indicators of the transition to stable states.
- Transition Activity (Novelty): Shows the frame-to-frame changes. The novelty peaks early during the emergent phase and then stabilizes as the system converges to attractors.
- Lag Similarity Decay: In the early chaotic phase (t = 0), matching states decay rapidly over a short time lag, showing high volatility. In contrast, once the attractor phase (t = 100) is reached, matching remains high even over hundreds of steps, verifying robust periodic or stable attractors.
🧬 Neural Cellular Automata Architecture
The dataset employs a lightweight Residual NCA architecture, uniquely initialized for every single rollout:
⬇️ Inject
🧱 Local 3x3 Interaction Convolutions
⬇️ Flow
🔄 Residual Hidden-State Updates
⬇️ Add Noise
🌫️ Stochastic Perturbation Noise
⬇️ Produce
Why randomize? Each sequence uses a fresh set of random weights, creating unparalleled diversity in the dynamical systems while strictly sharing a common symbolic vocabulary.
🔤 Symbolic Vocabulary
Continuous hidden states are intelligently compressed into discrete symbolic tokens using MiniBatch KMeans clustering and cosine-similarity assignments.
🎲 Random NCA ➡️ 🎞️ 500 Frame Rollout ➡️ 🧩 Hidden-State Extraction ➡️ 🎯 KMeans Quantization ➡️ ✨ Symbolic Sequences
What is a token? A token represents a specific, quantized combination of the 16 hidden channels. By assigning each cell a discrete ID from 0 to 31, we map high-dimensional continuous dynamics into a text-like representation.
- Vocabulary Size: 32 distinct symbols.
- Shared Reference: The
centroids.ptfile defines this global vocabulary across all 5M+ rollouts. This means Token 7 in sequence A means exactly the same structural latent state as Token 7 in sequence B.
📊 Dataset Statistics
| Property | Value | Property | Value |
|---|---|---|---|
| Total Samples | 5M+ | Grid Sizes | 8×8 → 48×48 |
| Rollout Length | 500 Frames | Quantization | MiniBatch KMeans |
| Hidden Channels | 16 | Storage Format | .npz Shards |
| Vocabulary | 32 Tokens | Dynamics | Frozen Random NCA |
Shard Information:
The full dataset is split into manageable .npz shards. Ensure your pipeline streams or handles shard loading efficiently to avoid memory bottlenecks.
📁 Repository Structure & Scripts
Data & Labels Mapping
The dataset is generated in massive chunks. Each data folder has a corresponding CSV file containing the computed behavioral metrics (e.g., activity, complexity, stable states) for every rollout:
dataset_labels_set.csv➡️ Describes rollouts innca_dataset/dataset_labels_set2.csv➡️ Describes rollouts innca_dataset_set2/dataset_labels_set3.csv➡️ Describes rollouts innca_dataset_set3/
Utility Scripts
The repository includes several Python scripts to help you generate, load, and visualize the data:
generate_local.py: The core dataset generator. It initializes a randomTinyNCAmodel, runs the dynamics, quantizes the continuous hidden states usingcentroids.pt, and writes the 32-token symbolic sequences into compressed.npzshards.sample_usage.py: A lightweight snippet demonstrating how to iterate through the data. It streams the.npzshards and yields individual frame transitions (frame[t]toframe[t+1]), which is the standard format for training sequence or world models.visualize_dataset.py: A helper script that picks a random rollout from the shards, maps the symbolic tokens to grayscale values, and renders an animated.gifto let you visually inspect the emergent patterns.
🧠 The Sparse Long-Horizon Prediction Objective
To bypass simple identity-mapping shortcuts (where sequence models learn to simply copy-paste consecutive frames), the dataset is optimized for a sparse long-horizon prediction task:
- Context Inputs: t₀, t₁₆, t₃₂, t₄₈ (spaced 16 frames apart)
- Prediction Target: t₁₁₂ (a large 64-frame gap after the context)
Sparse Long-Horizon Prediction: Mapping t₀, t₁₆, t₃₂, t₄₈ → t₁₁₂
🎯 Task Difficulty & Mechanics
- Bypassing the Continuity Shortcut: Between t₄₈ and t₁₁₂, approximately 83.01% of the cells undergo state changes. Standard copy-paste operations or identity mappings yield terrible cross-entropy loss, forcing models to genuinely model and internalize the underlying ResNCA transition dynamics.
- Temporal Abstraction: Models must learn high-level temporal transitions over the 64-step interval, testing their capacity for long-term reasoning, scale-generalization, and dynamic system emulation.
🎯 Use Cases
- Sequence Reasoning & Pretraining: Train/fine-tune small transformers on structured reasoning. The dataset acts as a synthetic "physics" engine for sequence models.
- World Model Learning: Multi-scale grids (8×8 → 48×48) make this a perfect testbed for scale-generalization in predictive world models.
- Evaluating Abstraction: Test if your SSM (Mamba, etc.) or Transformer generalizes rules instead of memorizing patterns.
- Artificial Life Research: Study how lifelike behaviors (oscillators, diffusion) emerge from simple localized rules.
- Anomaly Detection: Train a model on "normal" NCA dynamics and probe its detection of out-of-distribution transitions.
⚠️ Limitations
- Uncontrolled Diversity: Because the NCA weights are completely random and frozen, the emergent phenomena are heavily diverse but not systematically curated or balanced.
- Coarse Vocabulary: The 32-token limit compresses high-dimensional behavior heavily. Certain fine-grained structural changes might be smoothed out.
📄 Citation
If you use this dataset in your research, please cite it:
@misc{nca_sequences_5m,
author = {Tejaskumar Reddy J},
title = {Emergent NCA Sequences 5M: Massive-Scale Synthetic Symbolic Dynamics},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/Tejaskumar/Emergent-NCA-Sequences-5M}},
}
“Complexity emerging from locality.”
🌀 Local rules → emergent worlds.
- Downloads last month
- 1,346