Baby VLM: Democratizing Research on the Pretraining of Vision Large Language Models
Pretraining vision foundation models (VFMs) is prohibitively expensive, making it a privilege for institutions with abundant resources and leaving independent researchers to downstream tasks, such as benchmarking, interpreting, and aligning VFMs. This situation is a crisis for computer vision research — “What I cannot create, I do not understand,” quoted Richard Feynman. Independent researchers and the public cannot gain a true understanding, trust, and safe use of VFMs passively from open weights or APIs. Meanwhile, the few privileged VFM creators could momentarily reach a plateau without the broad research community’s nurturing.
Hence, we propose democratizing VFM pretraining by scaling it down to a developmentally plausible framework that is scientifically reasonable and computationally friendly to university budgets, aiming to promote exploration rather than exploitation of the pretraining and enable independent researchers to build general-purpose VFMs that approach “baby intelligence” to benefit efforts towards “grown-up” AI. This framework will closely mimic the minimal yet highly informative sensory experiences of human infants, encompassing 1) Pretraining data curated from longitudinal, egocentric audiovisual recordings of babies, 2) A suite of developmentally aligned evaluation benchmarks assessing VFM capabilities against cognitive milestones like object permanence, social skills, and language acquisition, and 3) A user-friendly pretraining codebase and baseline models.