Three prompting schemas vanilla, identity and identity-cot that are used to elicit toxicity scores from our models. Each schema introduces an additional aspect of context to the model. Bold fields include examples. Credit: arXiv (2024). DOI: 10.48550/arxiv.2406.00020

Study finds bias in language models against non-binary users

14 Nov 2024, 15:56 by Julia Cohen, University of Southern California · Tech Xplore

What happens when the technology meant to protect marginalized voices ends up silencing them? Rebecca Dorn, a research assistant at USC Viterbi's Information Sciences Institute (ISI) has uncovered how large language models (LLMs) that are used to moderate online content are failing queer communities by misinterpreting their language.

Non-binary visibility and algorithmic bias

In the paper, "Non-Binary Gender Expression in Online Interactions," Dorn, who is a fourth-year Ph.D. student in computer science at USC Viterbi School of Engineering looked at non-binary users on social media platforms like X (formerly Twitter) and found that they often receive less engagement—such as likes or followers—than their binary counterparts. Additionally, their posts are frequently flagged as being more toxic by content moderation algorithms, even when they contain no harmful content.

Dorn presented these findings virtually at the 16th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2024), held in Calabria, Italy from September 2-5, 2024.

The research revealed that non-binary users tend to be less active on platforms like X, potentially due to their underrepresentation in social media data, and that non-binary users receive fewer likes, retweets, and followers than binary users. This lack of visibility is alarming, as it can lead to non-binary voices being marginalized in important conversations, limiting their social influence and hindering their ability to advocate for issues important to their community.

Dorn's research also uncovered a troubling trend: that tweets from non-binary users are more likely to be misclassified as toxic. Dorn said, "We found that the less representation of a gender group, the higher the toxicity scores for their tweets."

The researchers posit that is likely the result of bias in the algorithms, which mistakenly interpret language commonly used in queer communities as harmful. This aligns with prior evidence showing that social media content from gender-variant groups, such as drag queens, is disproportionately flagged as hate speech, further highlighting the need for more nuanced and fair content moderation systems.

It was this finding that led to her follow-up paper, "Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias." The findings are published on the arXiv preprint server.

The problem with reclaimed slurs

In this second paper, Dorn and her co-author Lee Kezar, also a Ph.D. student in computer science at USC Viterbi, explored how LLMs routinely mislabel non-binary and queer speech—particularly the use of reclaimed slurs—as harmful. Reclaimed slurs, once used as insults, have been repurposed by the LGBTQ+ community as symbols of pride and empowerment.

However, AI-powered content moderation systems are failing to grasp these nuances, often mistaking empowering language for offensive content and silencing the voices of those they aim to protect.

"We found that existing models tend to flag these terms, even when they are not used in harmful ways. It's frustrating because it means that these systems are reinforcing the marginalization of these communities," Dorn explained.

"Queer people often use reclaimed slurs in ways that are affirming and positive, but the models aren't able to detect that context. That's a problem when those same models are being used to moderate platforms where queer voices are already marginalized."

To investigate this issue, Dorn and Kezar created QueerReclaimLex, a dataset of non-derogatory uses of LGBTQ+ slurs, annotated by gender-queer individuals. They tested five popular language models, revealing that these systems were often unable to discern the positive or neutral context of these terms when used by the very people they are intended to represent.

Across all models tested, the systems struggled the most when trying to identify reclaimed slurs used in a positive or neutral way by queer individuals. In some cases, the models were right less than 24% of the time, showing just how poorly they understood the context of these words.

What's next?

Dorn's work highlights a critical issue in AI-driven content moderation: while these systems are designed to protect users from harmful speech, they frequently misinterpret the language of historically marginalized communities, particularly queer and non-binary individuals. As these models continue to shape the digital spaces where these communities gather for support and self-expression, addressing these biases is essential.

ISI Senior Principal Scientist Kristina Lerman, who is a Research Professor in the USC Viterbi School of Engineering's Thomas Lord Department of Computer Science and a co-author of both papers underscored the importance of this research, "This work reminds us as researchers that we cannot blindly trust the outputs of our AI models. The observations we are making of the world—in this case, online speech in gender-queer communities—may not accurately reflect reality."

More information: Non-Binary Gender Expression in Online Interactions. imyday.github.io/pub/asonam202 … /papers/1207_094.pdf
Rebecca Dorn et al, Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias, arXiv (2024). DOI: 10.48550/arxiv.2406.00020
Journal information: arXiv

Provided by University of Southern California