A team led by Xiaoyan Bai and Chenhao Tan at the University of Chicago, with collaborators from MIT, Harvard, the University of Waterloo and Google DeepMind, studied why state-of-the-art language models fail at long multiplication. They focused on long-range dependencies: the need to hold partial products and running sums to reach a correct final answer.
Under standard fine-tuning, models with two to 12 layers achieved less than 1% accuracy on four-digit multiplication; the researchers concluded these models fell into a local optimum by learning surface patterns rather than storing intermediate values. In contrast, a model trained with Implicit Chain of Thought (ICoT) reached 100% accuracy. Probing the ICoT model showed that its hidden states encoded intermediate values and that running sums could be decoded.
The team also tested a simple training objective that teaches a model to track running sums at each step. Adding that objective to a two-layer model raised accuracy to 99% and produced attention patterns similar to ICoT. The study argues that architectural guidance and targeted objectives can enable multi-step reasoning.
Difficult words
- long-range dependency — need to keep information across many stepslong-range dependencies
- partial product — a number from one multiplication steppartial products
- running sum — a total that updates after each steprunning sums
- fine-tuning — training a model on new task data
- local optimum — a solution that is not best overall
- implicit chain of thought — training method that encourages stepwise reasoningImplicit Chain of Thought (ICoT)
Tip: hover, focus or tap highlighted words in the article to see quick definitions while you read or listen.
Discussion questions
- Why is it helpful for a model to store intermediate values when doing long multiplication?
- Do you think the same training objective (tracking running sums) could help models in other multi-step tasks? Why or why not?
- Which is more important for multi-step reasoning: model architecture or specific training objectives? Explain with simple reasons.