AI Hiring Tools Pay Women Less Despite Recommending More

TL;DR

LLMs used in hiring favor women for jobs but suggest lower salaries, exposing deep bias that simple fixes cannot solve.

As artificial intelligence increasingly assists in hiring decisions, a new study reveals that these systems can perpetuate and even create new forms of gender bias. Researchers from MIT have found that large language models (LLMs), the technology behind tools like ChatGPT, exhibit a contradictory pattern when evaluating job candidates: they are more likely to hire women and rate them as more qualified, yet simultaneously recommend lower salaries for them compared to identical male candidates. This double standard emerges even as approximately one in four organizations now use some form of automation or AI in their hiring processes, according to the Society for Human Resource Management, raising urgent questions about fairness in automated employment tools.

The core finding is a stark inconsistency in how AI judges candidates based on gender. Across multiple state-of-the-art models—including LLaMA 2, Mistral, Gemini, GPT-3.5, and GPT-4—the aggregated data shows women are 14.46% more likely to be hired than otherwise identical male candidates. Some models, like LLaMA 13B, were up to three times more likely to hire women when using certain prompts. Furthermore, women consistently received higher qualification ratings on a scale from 1 to 10. However, this apparent favoritism vanishes when it comes to compensation. On average, aside from GPT-4 and Gemini, each LLM recommended lower pay for women. The disparity was most pronounced with Mistral in the consulting field, where, for every dollar a man would earn, a woman would earn only 89 cents on average—a gap close to the 84-cent national gender pay gap in the United States.

The researchers employed a ology inspired by classic social science experiments to isolate gender bias. They used over 2,000 real, anonymous resumés from 24 different industries, such as business development, construction, and engineering. For each test, they presented an LLM with an identical resumé but varied only the candidate's name. They used two name datasets: one with 25 popular male-sounding and 25 popular female-sounding names from U.S. Census data, and another with 25 gender-neutral names (like Jordan or Avery) followed by explicit pronouns (e.g., "(she/her)", "(he/him)", "(they/them)"). This approach, similar to Bertrand and Mullainathan's landmark 2004 study on racial bias in hiring, allowed the team to measure bias by comparing outcomes across genders while holding all other factors constant. Each LLM was asked three specific questions as a simulated hiring manager: how qualified the candidate was on a 1-10 scale, whether to hire them, and what total annual compensation to offer.

, Detailed in figures throughout the paper, show statistically significant differences. For the hiring decision, analyzed using Chi-Square tests, p-values often fell far below 0.05, indicating the distributions between genders were not due to chance. For qualification ratings, analyzed with Wilcoxon tests, similar statistical significance was observed. Compensation data, analyzed with Kolmogorov-Smirnov tests after standardizing salaries to account for different job ranges, revealed that male candidates frequently had a higher percentage of salaries above the mean compared to females. For instance, Figure 3 in the paper illustrates percentage differences in compensations above the mean, with bars showing gaps as high as nearly 16% for LLaMA 7B between male and non-binary candidates. The study also tested two common prompt-based bias mitigation techniques: asking the model to explain its reasoning, and adding a statement about diversity, equity, and inclusion (DEI). Neither technique consistently reduced bias; in some cases, they made outcomes less fair. Asking for reasoning even decreased hiring likelihood by up to 40%, as shown in Table 2.

These have immediate for the growing use of AI in high-stakes decisions like hiring. With laws like New York City's Local Law 144 now requiring bias audits for Automated Employment Decision Tools, the study provides a concrete ology for quantifying bias that organizations can replicate. suggest that simply instructing an AI to "be fair" is insufficient, and that biases may be deeply embedded in the models' training data or architecture. The researchers note that their testing scenario—using names as a proxy for gender—is a simplification, but it highlights risks that could affect real applicants. The contrast between hiring likelihood and pay recommendations points to a complex bias where women might be seen as more competent yet less deserving of equal compensation, a pattern that could reinforce existing societal inequalities if unchecked.

The study acknowledges several limitations. Gender is a spectrum, and using binary name categories or pronouns may not capture all identities. Names can also be poor proxies due to cultural variations. The bias metric defined in the paper—based on inverting p-values from statistical tests—is a quantitative simplification that might not capture all nuances of bias. Additionally, the experiments were conducted in a controlled setting without real-world context like interviews, and the models' responses might reflect an "overcompensation" or affirmative-action-like effect rather than deep-seated bias. The researchers did not test non-binary candidates extensively due to inconsistent , possibly because models struggled with "they/them" pronouns. Future work could expand to test biases related to race, using names as proxies, and explore more robust mitigation techniques beyond simple prompt adjustments.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn