Researchers from Brown University presented a study at the International Conference on Learning Representations in Rio de Janeiro, Brazil that probed whether language models encode knowledge about the real world. Michael Lepori, the PhD candidate who led the work, reports "some evidence that language models have encoded something like the causal constraints of the real world," and that the models' internal representations predict human plausibility judgements.
The team designed an experiment with sentences of varying plausibility — commonplace items like "Someone cooled a drink with ice," improbable examples such as "Someone cooled a drink with snow," impossible cases like "Someone cooled a drink with fire," and nonsensical lines like "Someone cooled a drink with yesterday." Using mechanistic interpretability, described by the authors as "neuroscience for AI systems," the researchers examined the models' internal mathematical states to see what the systems encode.
The study tested several open-source models, including Open AI's GPT 2, Meta's Llama 3.2 and Google's Gemma 2. They found that sufficiently large models developed distinct internal vectors that corresponded to plausibility categories and could even separate similar categories, such as improbable versus impossible, with roughly 85% accuracy. These vectors also reflected human uncertainty on ambiguous statements and began to appear in models with more than 2 billion parameters, a size small compared with today's trillion-plus-parameter systems.
- Mechanistic interpretability can reveal what models encode.
- Vectors map to human plausibility judgments.
- Findings may aid development of smarter, more trustworthy models.
Difficult words
- encode — store information in a different formencoded
- plausibility — how likely or believable something seems
- mechanistic interpretability — study of internal mechanisms in machine learning models
- representation — internal model state that stores informationrepresentations
- vector — a mathematical list of numbers used by modelsvectors
- parameter — a numeric value that controls model behaviorparameters
- causal constraint — a rule linking cause and effect in real worldcausal constraints
Tip: hover, focus or tap highlighted words in the article to see quick definitions while you read or listen.
Discussion questions
- How could knowledge of internal vectors help developers make language models more trustworthy?
- What risks or limitations might remain even if models encode causal constraints?
- Given that these vectors appeared in models over two billion parameters, how should teams balance model size and explainability when choosing a model?
Related articles
People with AMD Judge Car Arrival Times Like Others
A virtual reality study compared adults with age-related macular degeneration (AMD) and adults with normal vision. Both groups judged vehicle arrival times similarly; vision and sound were used together, and a multimodal benefit did not appear.
AI coach helps medical students learn suturing
Researchers at Johns Hopkins developed an explainable AI tool that gives immediate text feedback to medical students practicing suturing. A small randomized study found faster learning for students with prior experience; beginners showed less benefit.
Algae-based synthetic gel supports mammary tissue growth
In 2020 a PhD student and her adviser at UC Santa Barbara developed an algae-based synthetic membrane to support mammary epithelial cells. Their tunable gel, reported in Science Advances, can direct cell growth by changing mechanical and biochemical cues.
Reducing unsafe responses in large language models
Researchers studied how large language models (LLMs) handle safety and tested training methods to reduce unsafe outputs while keeping performance. They identified key challenges and a technique that preserves safety during fine-tuning.