A disturbing policy shift inside a major tech company is raising alarms about the future of AI safety. The human trainers who act as the system’s conscience have been informed that the AI is now permitted to repeat racial slurs and other hate speech, a significant loosening of previous safety guardrails. This change signals a troubling new phase in the trade-off between capability and caution.
According to workers and internal documents, the new guideline distinguishes between the AI “generating” harmful content and simply “replicating” it. “It used to be that the model could not say racial slurs whatsoever,” explained one rater. “In February, that changed, and now, as long as the user uses a racial slur, the model can repeat it.” The same logic applies to sexism, stereotypes, and other harassing speech.
This policy change aligns with a quiet update to the company’s prohibited use policy, which now allows for exceptions “where harms are outweighed by substantial benefits to the public,” such as in art or education. However, a rater working under a tight deadline may not have the context to make such a nuanced judgment, leaving them to approve content that would have previously been flagged.
Researchers see this as a classic pattern in the tech industry: safety is paramount only until it starts to hinder market dominance. By creating this loophole, the company can expand the AI’s capabilities while claiming it doesn’t “generate” hate speech. For the workers on the front lines, however, it feels like they are being asked to rubber-stamp a system that is becoming a more effective tool for amplifying toxicity.