This AI learned language by observing the world from a baby’s perspective

The artificial intelligence was taught using video and audio from a helmet-mounted camera worn by Sam — who was 18 months old in the recordings.Credit: Wai Keen Vong

An artificial intelligence (AI) model has learned to recognize words like ‘crib’ and ‘ball’ by examining headcam recordings of a small portion of a single baby’s life. The results indicate that AI can assist in understanding how humans learn, as stated by Wai Keen Vong, co-author of the study and an AI researcher at New York University. It was previously unclear as other language-learning models, such as ChatGPT, rely on billions of data points, which do not compare to the real-world experiences of an infant, Vong points out. “We don’t get given the internet when we’re born.” The researchers hope that the findings, published in Science on 1 February, will contribute to long-standing discussions on how children acquire language. The AI learned solely by establishing associations between the images and words it encountered; it had no prior knowledge of language programmed into it. This challenges certain cognitive-science theories which suggest that babies require innate knowledge about how language functions to attach meaning to words, Vong notes.
The study represents “a fascinating approach” to comprehend early language acquisition in children, according to Heather Bortfeld, a cognitive scientist at the University of California, Merced.

Baby’s-eye view
Vong and his team utilized 61 hours of recordings from a camera attached to a helmet worn by a baby boy named Sam to capture experiences from the infant’s perspective. Sam, who resides near Adelaide in Australia, wore the camera for about an hour twice each week (about 1% of his waking hours), from six months to around two years old. The neural network, an AI inspired by the structure of the brain, was trained by the researchers using frames from the video and transcribed words spoken to Sam. The model was exposed to 250,000 words and corresponding images captured during activities such as playing, reading, and eating. The model utilized a technique known as contrastive learning to discern which images and text tend to go together and which do not, in order to acquire information that can be used to predict which images certain words, such as ‘ball’ and ‘bowl’, refer to.
To evaluate the AI, the researchers asked the model to match a word with one of four candidate images, a test also used to assess children’s language abilities. It successfully classified the object 62% of the time — much better than the 25% expected by chance, and comparable to a similar AI model trained on 400 million image–text pairs from outside this dataset. For certain words, such as ‘apple’ and ‘dog’, the model accurately identified previously unseen examples — something humans generally find relatively easy. On average, it did so successfully 35% of the time. The AI excelled at identifying objects that occur frequently in the training data and show little variation in appearance. However, it was challenged by words that can refer to a variety of different items, such as ‘toy’.

Lessons about learning
The study’s reliance on data from a single child might raise questions about the generalizability of its findings due to variations in children’s experiences and environments, says Bortfeld. Nevertheless, the exercise revealed that a significant amount can be learned in the early stages of infancy through the formation of associations between different sensory sources, she adds. The findings also challenge the views of scientists, such as US linguist Noam Chomsky, who argue that language is too complex and the input of information is too sparse for language acquisition to occur through general learning processes. “These are among the strongest data I’ve seen showing that such ‘special’ mechanisms are not necessary,” states Bortfeld.

DeepMind AI learns simple physics like a baby
Real-world language learning is much richer and varied than the AI experienced. The researchers note that because the AI was limited to training on still images and written text, it could not experience the interactions inherent to a real baby’s life. According to Vong, the AI struggled to learn the word ‘hand’, which is typically learned early in an infant’s life. “Babies have their own hands, they have a lot of experience with them. That’s definitely a missing component of our model.” “The potential for further refinements to make the model more aligned with the complexities of human learning is vast, offering exciting avenues for advancements in cognitive sciences,” states Anirudh Goyal, a machine learning scientist at the University of Montreal, Canada.