New research explains why modern large language models struggle with a seemingly simple task: multiplying multi-digit numbers. The study examines how current training methods affect a model’s ability to store and reuse intermediate results, a capability required for long calculations and long-range dependencies such as holding partial products and running sums through many steps.
Researchers led by Xiaoyan Bai and Chenhao Tan at the University of Chicago, with collaborators from MIT, Harvard, the University of Waterloo and Google DeepMind, compared standard fine-tuning with an alternative training method called Implicit Chain of Thought (ICoT). Under standard fine-tuning, models with two to 12 layers achieved less than 1% accuracy on four-digit multiplication because they fell into a local optimum: they learned superficial patterns in the data but did not develop a mechanism to store intermediate values for later steps. By contrast, the ICoT-trained model reached 100% accuracy.
Probes of the models’ internal states showed that the ICoT model encodes intermediate values: the researchers could decode running sums from its hidden states. ICoT also organizes attention along distinct temporal pathways. Early layers compute and store digit-pair products at specific locations, while later layers retrieve those values to form each digit of the final answer. The team observed digit representations that use Fourier-like bases and a geometric operation similar to a Minkowski sum that emerged during training.
The authors then added a training objective that explicitly teaches a model to track running sums at each step. Applied to a two-layer model, this objective raised accuracy to 99% without explicit chain-of-thought supervision; the model developed attention mechanisms and new strategies for tracking multiple digit pairs similar to ICoT. The study highlights that scaling alone does not fix some limits and that architectural guidance and targeted objectives can enable multi-step reasoning. "As AI is increasingly integrated into critical decision-making, it’s essential to understand its unique ways of learning and thinking," says Tan.
- Key mechanisms: encoded intermediate values
- Distinct attention pathways across time
- Fourier-like digit representation observed
- Targeted objectives can greatly improve performance
Difficult words
- intermediate — values or steps between start and end
- fine-tuning — adjusting a trained model with new data
- local optimum — a solution that is not globally best
- encode — store information in an internal formencodes
- attention — mechanism to weigh or focus information
- temporal — related to time or sequence order
- objective — a specific goal used during trainingobjectives
- scale — increase model size or computing resourcesscaling
- mechanism — a structure or process that produces behaviormechanisms
Tip: hover, focus or tap highlighted words in the article to see quick definitions while you read or listen.
Discussion questions
- How might targeted training objectives like the running-sum objective change the reliability of AI in real-world decisions?
- Can the attention and intermediate-value strategies described for multiplication be useful for other multi-step tasks? Why or why not?
- The article says scaling alone does not fix some limits. What trade-offs should researchers consider between scaling models and adding architectural guidance?
Related articles
Tool that Reorders X Feed Reduces Partisan Rancor
Researchers built a browser extension that scans an X feed for antidemocratic and partisan posts and reorders them without removing content or platform cooperation. Tests during the 2024 election showed small but measurable improvements in attitudes toward the other party.