Researchers at North Carolina State University studied how safety alignment works in large language models and tested new training techniques to reduce unsafe outputs while keeping model performance. Jung-Eun Kim, the corresponding author and an assistant professor, said they do not want LLMs to tell people to harm themselves or give information that could harm others.
The team identified two main challenges. One is the alignment tax: safety training can reduce a model's accuracy. The other is superficial alignment, where a model treats a request as safe or unsafe early in response generation. Jianwei Li, the first author, gave an example about requests for instructions to steal money and how motives can change the model's reply.
The researchers proposed the Superficial Safety Alignment Hypothesis (SSAH). They searched models for safety-critical neural components and showed that freezing those components during fine-tuning helps preserve original safety behavior while the model learns domain tasks. The work will be presented at ICLR2026 and supporting code is available online.
Difficult words
- safety alignment — methods to make a model behave safely
- alignment tax — loss of model accuracy after safety training
- superficial alignment — early, surface-level safety decisions during reply generation
- fine-tuning — small additional training to adapt a model
- freeze — prevent parts of a model from changingfreezing
- unsafe output — text from a model that could cause harmunsafe outputs
Tip: hover, focus or tap highlighted words in the article to see quick definitions while you read or listen.
Discussion questions
- Do you think it is a good idea to freeze parts of a model during training? Why or why not?
- How could reducing the alignment tax help people who use language models?
- Have you ever seen a model give unsafe advice? What did you do then?
Related articles
AI coach helps medical students learn suturing
Researchers at Johns Hopkins developed an explainable AI tool that gives immediate text feedback to medical students practicing suturing. A small randomized study found faster learning for students with prior experience; beginners showed less benefit.
Mechanical tipping point behind sudden fibrosis
Scientists found a mechanical "tipping point" that makes groups of cells switch quickly from healthy to fibrotic states. Collagen fibers, cell spacing and crosslinking control this abrupt change and affect how far mechanical signals travel.