Data Infrastructure for AI at Scale¶

AGI systems are fundamentally bound by their data layers. Training frontier models requires petabyte-scale data pipelines, and deploying them requires infrastructure that can serve millions of requests with sub-second latency. The SRE and data engineering challenges behind superintelligence are as hard as the ML itself.

Data Lakehouse & Analytical Processing¶

Name	Description	Links
Apache Iceberg	Open table format for massive analytic datasets. ACID transactions, time travel, schema evolution, and partition evolution on data lakes. The emerging standard for AI training data management (adopted by Netflix, Apple, Snowflake). V4 adds row-lineage tracking.	iceberg.apache.org, GitHub
Apache Spark	Unified analytics engine for large-scale data processing. Powers the data pipelines behind most frontier model training -- ETL, feature engineering, and distributed data transformation at petabyte scale.	spark.apache.org, GitHub
Delta Lake	Open-source storage layer providing ACID transactions on data lakes. Originally Databricks; now open ecosystem. Delta UniForm provides interoperability with Iceberg and Hudi.	delta.io, GitHub
Greenplum	Massively Parallel Processing (MPP) database for large-scale analytics and AI workloads. Open-source, PostgreSQL-based, purpose-built for analytical queries across petabytes. Used in enterprise AI pipelines for feature computation and data preparation.	greenplum.org, GitHub
DuckDB	In-process analytical database that runs anywhere. Blazing-fast OLAP queries on local data. Increasingly used for dataset analysis, feature engineering, and rapid prototyping in ML workflows.	duckdb.org, GitHub

MLOps & Experiment Tracking¶

Name	Description	Links
MLflow	Open-source platform for the complete ML lifecycle: experiment tracking, model registry, deployment, and model evaluation. The standard MLOps platform. 20k+ stars.	mlflow.org, GitHub
Weights & Biases (W&B)	ML experiment tracking, dataset versioning, and model management. Used by OpenAI, DeepMind, and most frontier labs for training runs.	wandb.ai
KubeFlow	ML toolkit for Kubernetes. Manages ML workflows: training pipelines, hyperparameter tuning, model serving, and notebook environments on Kubernetes clusters.	kubeflow.org, GitHub
Ray	Unified framework for scaling AI applications. Ray Train for distributed training, Ray Serve for inference, Ray Data for preprocessing. Powers Anyscale and used by OpenAI, Uber, and Spotify.	ray.io, GitHub
Feast	Open-source feature store for ML. Bridges the gap between training and serving by providing consistent access to feature data across offline training and online inference.	feast.dev, GitHub
Label Studio	Open-source data labeling platform for text, image, audio, video, and multi-modal tasks. Critical infrastructure for creating the human-annotated data that drives RLHF and supervised fine-tuning. 20k+ stars.	labelstud.io, GitHub