Large language models (LLMs) can give advice or instructions, so it is important that their responses are safe. A research team at a university studied how safety training works and tested new training ideas to reduce unsafe outputs while keeping good performance.
The researchers found two main problems. First, safety training can lower a model's accuracy, a problem called the alignment tax. Second, many models use a simple safety check that can be bypassed. The team proposed a hypothesis about this simple check and tested a method that freezes some model parts during fine-tuning to keep safety while the model learns new tasks. The work will be shown at an international conference.
Difficult words
- model — a computer program that generates or predicts textmodels, model's
- safety — being protected from dangerous or harmful outputssafety training, safety check
- alignment tax — loss of model accuracy after safety training
- fine-tuning — additional training to adapt a model to new tasks
- freeze — stop changing some parts during model trainingfreezes
- accuracy — how correct or precise a model's outputs are
Tip: hover, focus or tap highlighted words in the article to see quick definitions while you read or listen.
Discussion questions
- Do you think it is okay if safety training lowers a model's accuracy? Why or why not?
- Have you ever used advice from an AI or chatbot? Was it safe and helpful?
- How would you check if a model's safety check can be bypassed?
Related articles
AI coach helps medical students learn suturing
Researchers at Johns Hopkins developed an explainable AI tool that gives immediate text feedback to medical students practicing suturing. A small randomized study found faster learning for students with prior experience; beginners showed less benefit.
Mechanical tipping point behind sudden fibrosis
Scientists found a mechanical "tipping point" that makes groups of cells switch quickly from healthy to fibrotic states. Collagen fibers, cell spacing and crosslinking control this abrupt change and affect how far mechanical signals travel.