Research summary · arXiv:2509.00462

Hiring AIs prefer resumes written by AI.

A 2026 study of 9 leading language models found that LLM screeners systematically favor resumes generated by themselves, even when the human-written version is objectively better. The shortlist gap reaches 60%.

arXiv:2509.00462v3cs.CY9 Feb 2026

AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights

Jiannan Xu (University of Maryland) · Gujie Li (National University of Singapore) · Jane Yi Jiang (The Ohio State University)

Read the preprint on arXiv

82%

Self-preference bias

GPT-4o picked its own resume over the human one 82% of the time, even after controlling for content quality.

+60%

Shortlist boost

Candidates whose resume matched the screener's model were up to 60% more likely to be invited to interview.

8/9

Models affected

Eight of the nine LLMs tested showed positive bias. Only LLaMA-3.2-1B, the tiniest model, was neutral.

How they tested it

The authors ran a controlled correspondence experiment: take a real human-written resume, have an LLM rewrite the executive summary, then ask another LLM to pick the better version. Same candidate, same facts, same job, only the wording differs.

2,245

Real resumes from LiveCareer.com

LLMs tested as screeners

Occupational categories

Human annotators on Prolific

Each pair was scored under two fairness metrics: statistical parity (raw selection rate) and equal opportunity (after controlling for content quality with conditional logistic regression and human-graded ground truth).

To rule out any first-position effect, every comparison was randomly counterbalanced. To rule out a verbosity effect, all summaries were length-matched.

Bigger model, stronger bias

Self-preference is not a quirk of one model. It is widespread across model families, and it scales with size. Below, each bar shows how much more likely a model is to pick its own resume over a human one of equivalent quality.

Evaluator modelEqual-opportunity bias %

GPT-4o
+81.9%
LLaMA-3.3-70B
+78.9%
Qwen-2.5-72B
+78.0%
DeepSeek-V3
+71.6%
GPT-4o-mini
+67.9%
GPT-4-turbo
+66.9%
Mistral-7B
+28.0%
LLaMA-3.2-3B
+11.6%
LLaMA-3.2-1B
-1.4%

Production-scale model, used in real hiring toolsSub-7B parameter model, research-only

Every model with enough capacity to be deployed in real screening pipelines shows over 65% bias. The smallest model is the only roughly neutral one, but it is not what employers run.

Which jobs get hit hardest

The authors simulated 30 hiring rounds across 24 occupations. Top values are the worst-affected fields, where AI-polished resumes were most disproportionately shortlisted. Business-facing roles take the heaviest hit.

OccupationShortlist boost for AI resumes

Sales
+60%
Accountant
+58%
Business development
+56%
Finance
+53%
Teacher
+49%
HR
+44%
Engineering
+32%
Consultant
+30%
Agriculture
+24%
Automobile
+23%

Values are average shortlist gaps across GPT-4o, DeepSeek-V3, and LLaMA-3.3-70B as evaluators. Read directly off Figure 7 of the paper.

Why this happens

The mechanism is self-recognition. Modern LLMs can implicitly identify text they have generated, even without being told the source. The stronger that recognition ability, the stronger the preference.

Crucially, this is not a content-quality story. In human-graded comparisons, the human-written summary was often clearer, more coherent, more honest about the candidate. The model still picked its own version.

Repeat this across every screening cycle and you get a lock-in effect: the stylistic patterns of dominant LLMs become entrenched in applicant pools, slowly squeezing out anyone whose resume does not sound model-shaped.

Can the bias be mitigated?

Yes, but only if you control the screener. The authors tested two interventions and both reduced bias by 17-63% in relative terms.

Strategy 1

System prompting

Instruct the evaluator model: "You should not consider or infer whether the resumes were written by a human or by AI. Focus only on the quality of the content." Cheap, but does not eliminate the bias.

Bias reduction up to 63%

Strategy 2

Majority voting ensemble

Combine the main evaluator with smaller models that have weaker self-recognition. The smaller models dilute the dominant model's preference for its own outputs.

Stable across model families

The catch. Both fixes require the employer to change their screening pipeline. As a candidate, you have no visibility into which model is reading your resume, so the only lever you actually control is matching your writing to whatever the screener was trained to like.

What this means for you

Three things follow directly from the paper, with no marketing spin.

A human-only resume is now a structural disadvantage

If the screener is GPT-4o and your competitor used GPT-4o, the screener prefers your competitor by 82% even when your resume is objectively better. This is a measurement, not an opinion.

Which model you use matters

Self-preference is asymmetric across model pairs. Tools that polish through a single fixed model bet on which screener you will meet. Tools that route through multiple models hedge that bet.

Substance still matters, but presentation pays the toll

The paper is explicit: bias persists even when content is held constant. So the goal is not to fake credentials, it is to express real ones in a register the screener was trained to favor.

Original paper

Read AI Self-preferencing in Algorithmic Hiring on arXiv

See what your resume is missing

20 free credits. No card required.

Get your CV roasted