Biases in LLMs

In recent years, the utilization of language models (LMs) has seen exponential growth across diverse applications, ranging from customer service chatbots to content generation tools. As these models become more integrated into our daily lives, the opinions they reflect have significant implications. Not only can they shape individual user views, but they can also influence broader societal norms.

One of the primary concerns surrounding the deployment of LMs is the potential for biases within these systems to propagate skewed perspectives. These biases can originate from various sources, such as the training data or the model's design, and can lead to the reinforcement of stereotypes or the marginalization of certain groups.

Facing bias issues is not an easy task, a famous example of failing to solve this problem is what recently happened with the image generation feature of Gemini [5]. While trying to prevent stereotypes in his pictures (for example, drawing all white men when you ask for a doctor) they produce an anti-white men bias where all their pictures have no white men without mattering if it should represent a historical moment or a real case. For example, in the following picture, we can see what it produced when it was asked to draw an image of Greek philosophers.

Gemini image generator anti-white men bias

The goal of this article is to delve into the extent to which the opinions expressed by language models align with the demographics of human populations. By examining this alignment, we can better understand the impact of LMs on society and identify areas for improvement to ensure that these powerful tools contribute positively to our collective discourse. The information shown here is based on the work of Santurkar et al. [1]

Research

The central question driving this investigation is: Whose opinions do language models reflect? This query is pivotal in understanding the broader implications of LMs on public discourse and societal norms. To address it, the objective is to develop a robust framework that can effectively evaluate the opinions generated by language models in comparison to public opinion polls. This framework aims to provide a systematic method for assessing the alignment and potential discrepancies between the perspectives offered by LMs and those held by the general public.

The scope of this study is concentrated on the demographics and opinions within the United States. By narrowing the focus to U.S. demographics, the research can provide a detailed and context-specific analysis that highlights how well LMs capture the diversity of opinions present in a heterogeneous society.

The importance of this research lies in ensuring that language models reflect a diverse and representative set of opinions. As LMs become increasingly influential, it is crucial that they do not perpetuate biases or misrepresent certain demographic groups. A representative LM can contribute to a more equitable and balanced discourse, fostering a more inclusive digital environment.

Dataset

To evaluate the alignment of language model opinions with those of the U.S. population, they utilized the OpinionQA dataset[2], derived from Pew Research’s American Trends Panels. This dataset comprises 1498 questions that span a wide array of topics, capturing nuanced opinions and perspectives on various issues.

The dataset includes responses from 60 different U.S. demographic groups, ensuring extensive demographic coverage that reflects the diversity of the U.S. population in terms of age, gender, race, education, and socioeconomic status. The primary purpose of the OpinionQA dataset is to serve as a benchmark for assessing the opinions generated by language models. By comparing LM responses to real human opinions captured in this dataset, researchers can identify areas of alignment and misalignment, providing insights into how well LMs represent the diversity of public opinion. This benchmarking process is crucial for understanding the strengths and limitations of language models in reflecting societal views accurately.

Methodology

The methodology revolves around a framework that leverages public opinion surveys to assess the responses of language models. The core idea is to evaluate how closely LMs align with various demographic groups.

The evaluation pipeline for assessing the alignment of language model opinions with human opinions involves several key steps:

Prompting the Language Model: An LM, specifically text-davinci-003, is prompted with a multiple-choice survey question from the dataset. This question may be preceded by an optional context (QA/BIO/PORTRAY) designed to steer the model towards a specific persona, such as Democrats.
Obtaining Model Predictions: The next-token log probabilities from the LM are obtained for each of the answer choices, excluding refusal. These probabilities are then normalized to derive the model’s opinion distribution.
Comparing with Human Opinions: The model’s opinion distribution is compared to reference human opinion distributions. These human distributions are aggregated from responses to the same survey question at both the population level and by specific demographics.
Analyzing Refusal Rates: In addition to comparing opinion distributions, model and human refusal rates are analyzed separately to understand the model’s propensity to decline answering a question compared to humans.

This methodology provides a structured and quantitative framework for evaluating how well language models align with human opinions, allowing for precise identification of areas where the models may need improvement to better reflect diverse and representative viewpoints. In the following graph they picture this process:

Survey Utilization

Leveraging surveys offers a structured and systematic approach to evaluate language model opinions. One significant advantage of using surveys is that they provide a well-defined framework for capturing and comparing opinions across different demographic groups. This structured format is crucial for ensuring that the evaluation process is comprehensive and standardized.

The multiple-choice format of surveys is particularly beneficial as it adapts well to language model prompting. By using multiple-choice questions, researchers can precisely gauge the model's responses against predefined options, making it easier to quantify alignment and discrepancies between LM-generated opinions and actual survey results.

To implement this approach, survey data from sources like the OpinionQA dataset is used to create realistic LM evaluation scenarios. By posing survey questions to the language models and analyzing their responses, researchers can simulate real-world applications and contexts in which these models operate.

The outcome of this methodology is a more accurate assessment of LM opinion alignment. Using structured survey data helps in systematically identifying areas where language models perform well and where they fall short. This detailed evaluation is crucial for understanding the strengths and limitations of language models and for guiding efforts to improve their ability to reflect diverse and representative opinions.

Quantitative evaluation

To do that, the responses generated by language models are directly compared to the distributions of human opinions obtained from the OpinionQA dataset. They apply quantitative metrics to measure the degree of alignment between LM responses and human opinions. One key metric used is the 1-Wasserstein distance[3]. This distance, also known as the Earth Mover's Distance (EMD), is a measure used to quantify the difference between two probability distributions. It provides an intuitive way to compare how "far apart" two distributions are, considering both the value differences and the probability masses.

The 1-Wasserstein distance between two probability distributions P and Q over a domain Ω is defined as:

where:

Γ(P,Q) is the set of all joint distributions γ(x,y) whose marginals are P and Q respectively.
E(x,y)∼γ denotes the expected value with respect to the distribution γ.
∣x−y∣ represents the cost of moving probability mass from x to y.

Imagine two piles of earth (representing the probability distributions P and Q) and a series of shovels. The 1-Wasserstein distance measures the minimum effort required to transform one pile into the other, where the effort is quantified by the amount of earth moved multiplied by the distance it is moved.

The primary goal of using this distance is to provide a clear and objective measure of how closely language models' responses align with real human opinions. By quantifying this similarity, researchers can identify specific areas where LMs diverge from public opinion.

By analyzing the 1-Wasserstein distance, developers can pinpoint discrepancies and focus on adjustments that bring LMs closer to accurately reflecting the diversity of human perspectives. This approach is designed to evaluate how closely LMs align with different demographic groups within the U.S. population. By focusing on this aspect, the researchers aim to understand whether LMs can capture the diversity of opinions present across various segments of society, including differences in age, gender, race, education, and socioeconomic status. This comprehensive evaluation helps in identifying potential biases and areas for improvement in language model development.

Prompting Techniques

The research explores various prompting techniques to understand how effectively language models can be steered to emulate the opinions of different demographic groups. The key strategies investigated include the Standard QA Template and Steering Contexts.

The Standard QA Template involves using a basic question-and-answer format without any additional context. This method serves as a baseline to evaluate the natural responses of the language model without any steering influences.

Steering Contexts encompass methods like QA, Bio, and Portray. These techniques involve providing additional contextual information to the model in an attempt to influence its responses to better reflect the opinions of specific demographic groups. For instance:

QA Contexts involve framing questions with demographic-specific phrasing.
Bio Contexts provide background information about a hypothetical respondent from a particular demographic group.
Portray Contexts use descriptive narratives to set the scene or persona that the LM should adopt in its responses.

The evaluation phase assesses the effectiveness of these different prompting strategies. By comparing the outcomes of these techniques, researchers can determine how well each method steers the LM's responses to align with the opinions of the targeted demographic groups.

The insights gained from this evaluation help identify best practices for steering language models. This knowledge is crucial for developing more nuanced and representative LMs that can accurately reflect the diverse views of different demographic segments in society.

Robustness Check

The robustness of the findings was ensured through various methods, including the use of different prompt templates and order permutations. This approach was designed to validate that the results remain consistent across various prompting strategies and configurations.

The robustness check involved systematically altering the prompt templates and the order in which questions and context were presented to the language models. By applying these variations, the research aimed to identify whether the alignment and performance metrics would hold steady, irrespective of the specific prompt design used.

The findings indicate that the results were consistent across these variations, demonstrating that the language model's responses were stable and reliable under different conditions. This consistency is crucial for validating the robustness of the evaluation framework, ensuring that the insights gained are not artifacts of a particular prompt format or sequence.

This rigorous validation process implies a high level of confidence in the reliability of the evaluation framework. It confirms that the methodologies used to assess the alignment of language model opinions with human demographics are sound and can be trusted to produce accurate and repeatable results.

Key Findings

The research reveals significant gaps between the opinions generated by language models and those held by various U.S. demographic groups. Despite efforts to correct or steer language models towards more representative opinions, this misalignment persists, indicating that current mitigation strategies may be insufficient to address the inherent biases within these models. This ongoing misalignment suggests that language models may not accurately represent the diversity of opinions, which can have significant consequences, especially as LMs are increasingly used in applications that shape public perception and discourse. A major consequence of this misalignment is the potential reinforcement of existing biases. If LMs continue to reflect skewed perspectives, they may perpetuate stereotypes and marginalize certain groups, exacerbating social inequalities and undermining efforts to promote inclusivity and fairness in digital communications.

Misalignment

The analysis of misalignment between language model opinions and those of the U.S. population reveals significant discrepancies. In terms of agreement, LM opinions align with the general U.S. populace to a degree comparable to the agreement between Democrats and Republicans on climate change—indicating notable divergence.

For example, individuals aged 65 and older and those who are widowed are poorly represented in the LM's opinion distribution. These specific demographic discrepancies highlight areas where language models fail to capture the views of certain groups accurately.

The consequence of such misalignment is that certain groups may feel misrepresented by LMs. This lack of accurate representation can lead to further marginalization and a sense of exclusion, particularly when LMs are used in applications that influence public discourse and decision-making. Addressing these discrepancies is crucial for developing more inclusive and representative language models that better reflect the diversity of human opinions.

Representativeness

Representativeness refers to the alignment of language model opinion distributions with those of the general U.S. population. The findings indicate that current LMs show substantial misalignment in this regard. The analysis identifies which demographic groups are overrepresented or underrepresented in LM responses, shedding light on the disparities within the models. This misalignment underscores the need for language models to reflect a more balanced range of opinions. Ensuring that LMs accurately represent the diversity of perspectives within the U.S. population is crucial for promoting fair and inclusive discourse in digital environments.

In the following graph, you can see that all LLMs (especially the OpenAI ones) opinion distributions are not similar to the average human opinions.

Overall representativeness of LMs: A higher score (lighter) indicates that, on average across the dataset, the LM’s opinion distribution is more similar to that of the total population of survey respondents

The representativeness of the models is calculated using the opinion distributions of the model (D_m) and that of the overall population (D_o) on a set of questions Q as follows:

Where N is the number of answer choices (excluding refusal), WD is the 1-Wasserstein distance, and the normalization factor N − 1 is the maximum WD between any pair of distributions in this metric space.

Influence of Human Feedback

The influence of reinforcement learning with human feedback(RLHF)[4] on language models has been shown to amplify misalignment between LM opinions and those of the broader U.S. population. This observation points to a trend where fine-tuning through human feedback often results in a shift toward the perspectives of more liberal, educated, and wealthy demographics.

This shift can introduce potential biases into the model, as the feedback mechanisms may inadvertently favor the opinions and viewpoints of these specific groups over others. Consequently, the model becomes less representative of the diverse array of opinions held by the general population.

The impact of such biases is significant, as it can lead to further skewing of the model's outputs, exacerbating existing disparities and potentially marginalizing underrepresented groups. To address this issue, future work should focus on developing methods to mitigate the biases introduced through human feedback mechanisms. This includes creating more balanced feedback processes and incorporating diverse viewpoints to ensure that language models remain fair and representative.

In the following graph, you can see that models with strong RLHF training, like the adas from Openai, are highly correlated with a high level of education, liberal and high income.

Alignment of LMs (columns) across topics (rows) on different demographic attributes (panels). Each dot indicates an LM-topic pair, with the color indicating the group to which the model is best aligned, and the size of the dot indicates the strength of this alignment. Source [1]

Conclusion

In summary, language models tend to reflect the opinions of certain demographic groups more than others, leading to significant misalignment with diverse segments of the population. This disparity highlights a major challenge in the development and deployment of LMs, as their outputs can disproportionately represent the views of specific groups while marginalizing others.

Future work in this area should focus on exploring methods to improve both the representativeness and steerability of LMs. This involves refining prompting techniques, enhancing feedback mechanisms, and developing new strategies to ensure that LMs can more accurately capture and reflect the diverse opinions present in society.

The ultimate goal is to strive for fair and balanced LMs that better serve the needs of all demographic groups. By addressing the current limitations and biases, we can work towards creating language models that contribute positively to public discourse, foster inclusivity, and support the diverse perspectives that enrich our society.

References

[1] Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023, July). Whose opinions do language models reflect?. In International Conference on Machine Learning (pp. 29971-30004). PMLR.

[2] OpinionQA https://paperswithcode.com/dataset/opinionqa

[3] Wasserstein metric

[4] Reinforcement learning from human feedback

[5] Google Chatbot’s A.I. Images Put People of Color in Nazi-Era Uniforms