Large language models (LLMs) can give advice or instructions, so it is important that their responses are safe. A research team at a university studied how safety training works and tested new training ideas to reduce unsafe outputs while keeping good performance.
The researchers found two main problems. First, safety training can lower a model's accuracy, a problem called the alignment tax. Second, many models use a simple safety check that can be bypassed. The team proposed a hypothesis about this simple check and tested a method that freezes some model parts during fine-tuning to keep safety while the model learns new tasks. The work will be shown at an international conference.
Difficult words
- model — a computer program that generates or predicts textmodels, model's
- safety — being protected from dangerous or harmful outputssafety training, safety check
- alignment tax — loss of model accuracy after safety training
- fine-tuning — additional training to adapt a model to new tasks
- freeze — stop changing some parts during model trainingfreezes
- accuracy — how correct or precise a model's outputs are
Tip: hover, focus or tap highlighted words in the article to see quick definitions while you read or listen.
Discussion questions
- Do you think it is okay if safety training lowers a model's accuracy? Why or why not?
- Have you ever used advice from an AI or chatbot? Was it safe and helpful?
- How would you check if a model's safety check can be bypassed?
Related articles
AI audio summaries of research can help — and err
Researchers tested Google’s NotebookLM, which turns research papers into podcast-style audio. The summaries were engaging and clearer for teaching, but every audio overview contained mistakes, so the authors advise reading the original papers to check claims.