Skip to content

SaladSlayer00/Bias_in_LLMs

Repository files navigation

Uncovering Bias in Large Language Models

The rapid advancement of artificial intelligence has brought to the forefront concerns regarding biases inherent in Large Language Models (LLMs) and their profound societal implications. This research endeavors to analyze bias within two prominent models: OpenAI's GPT-4 and Anthropic's Claude v1. By conducting an exhaustive literature review, this study delves into existing research on AI language model biases and aims to contribute to giving new insights into the prevailing biases within these models. Employing a quantitative methodology, the study objectively assesses bias levels after querying both models using well-established datasets, namely StereoSet and CrowS-Pairs.

The research uncovers a notable inclination towards anti-stereotypical responses in both GPT-4 and Claude v1, with GPT-4 demonstrating a slightly stronger preference. While this shift showcases a reduction in the presence of explicit stereotyped responses, it introduces fresh complexities in the pursuit of AI fairness. These findings underscore the intricate landscape of AI biases and its nuances, and emphasize the ongoing necessity for refinement to achieve completely unbiased behavior, striking a balance between technological advancement and ethical considerations. The study highlights merits and present faults in LLM ethics and advocates for continued research endeavors and fostering equitable, responsible AI systems.

Introduction

Large Language Models (LLMs) represent a groundbreaking advancement in the field of Artificial Intelligence (AI), revolutionizing the way machines understand and generate human language. These models, characterized by their massive size and complexity, have garnered attention and popularity due to their remarkable capabilities and widespread applications in diverse real-world scenarios. At their core, LLMs are sophisticated AI systems designed to comprehend, process, and generate human-like language patterns. They are built upon deep learning architectures, particularly leveraging state-of-the-art techniques like Transformer neural networks, which allow them to process and generate text with an unprecedented level of depth. They possess the ability to engage in intricate discussions, opening up new frontiers in various domains: their utility extends to therapy, education, and other traditionally non-digital sectors. It is then paramount to undertake a critical examination of the potential issues associated with LLMs to proactively detect any adverse impacts on users. Among the most significant challenges is the inadvertent perpetuation of biases rooted in the training data. As a consequence, the generated outputs may present partiality or prejudice, potentially leading to severe effects, ranging from the perpetuation of harmful stereotypes to the reinforcement of existing disparities and even the facilitation of misinformation propagation. Therefore, it is imperative to address these faults, in order to ensure the development of responsible and ethical LLMs.

Methodology

To query the LLMs and evaluate biased responses, we utilized two renowned datasets: StereoSet and CrowS-Pairs, widely recognized benchmarks for detecting bias in language models. Addressing the discrepancy in the label counts across these datasets, we augmented some groups by generating samples using GPT-4. Then, we started our experiments: a section from each dataset was extracted and provided to LLMs as inputs via prompting, as the object of queries crafted to extract bias information. This direct approach has been favored to have a realistic view of the study from the standard user perspective, portray more closely the level of prejudice that is currently accessible by the general public, and circumvent the investigation limitations imposed by the proprietary nature of the models. Dataset Processing:

Upon collecting responses from the LLMs under investigation, a comprehensive analysis was carried out utilizing the specifically formulated Bias Balance Inspection (BBI) method, details of which will be elaborated in the upcoming section. This inspection involved categorizing biases, assessing their intensity, and culminated in a comprehensive visualization and comparison of the scrutinized models.

Conclusions

Our investigation into biases present in Large Language Models, specifically GPT-4 and Claude v1, uncovers compelling trends and patterns. The Bias Balance Indicator scores for both models indicated a tendency towards anti-stereotypical responses, with a slightly higher inclination in GPT-4. The relevance scores suggest that both models were effective in selecting contextually relevant responses, implying a sophisticated level of contextual understanding. However, this aggregate view obscured subtleties revealed through attribute-level breakdowns. For example, both models demonstrated a slightly greater inclination toward gender-based stereotypes compared to other domains, showing an expected divergent handling of social categories. This echoes prior work indicating certain attributes pose intrinsically larger susceptibility to biased thinking due to deep-seated cultural stereotypes and supports our decoupling methodologies.

Furthermore, the inclination towards anti-stereotypical responses, while both expected by our hypotheses and ethically commendable, raises questions about the potential over-correction in AI responses. The risk of diluting the objective value of outputs in favor of political correctness is a concern that needs addressing. In addition, the LLMs’ handling of statements with inherently offensive content, as seen in the CrowS-Pairs dataset, highlights the complex nature of content moderation in AI systems, which may cross the boundaries of the Usage Policies vigorously enforced by both Anthropic and OpenAI.

Frequency of targets in selected biased sentences, aggregated view of ChatGPT4 and Claude v1:

About

Insights into religious, sex and racial bias of LLMs with a quantitative approach and literature benchmarks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published