Mitigating harm in language models with conditional-likelihood filtration

made and submitted by mathemakitten
Language models built on datasets scraped from the open web have become foundational in natural language processing, but they reflect and amplify the biases and harms of their training data. We created a system which lets us filter training data to build more value-aligned models. It's imperfect—the internet is really big, harmful text is ever-evolving, and human labels include human biases. But little-by-little, we get closer to friendlier models!