AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights
Jiannan Xu (University of Maryland) · Gujie Li (National University of Singapore) · Jane Yi Jiang (The Ohio State University)
Read the preprint on arXivA 2026 study of 9 leading language models found that LLM screeners systematically favor resumes generated by themselves, even when the human-written version is objectively better. The shortlist gap reaches 60%.
AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights
Jiannan Xu (University of Maryland) · Gujie Li (National University of Singapore) · Jane Yi Jiang (The Ohio State University)
Read the preprint on arXivGPT-4o picked its own resume over the human one 82% of the time, even after controlling for content quality.
Candidates whose resume matched the screener's model were up to 60% more likely to be invited to interview.
Eight of the nine LLMs tested showed positive bias. Only LLaMA-3.2-1B, the tiniest model, was neutral.
The authors ran a controlled correspondence experiment: take a real human-written resume, have an LLM rewrite the executive summary, then ask another LLM to pick the better version. Same candidate, same facts, same job, only the wording differs.
Each pair was scored under two fairness metrics: statistical parity (raw selection rate) and equal opportunity (after controlling for content quality with conditional logistic regression and human-graded ground truth).
To rule out any first-position effect, every comparison was randomly counterbalanced. To rule out a verbosity effect, all summaries were length-matched.
Self-preference is not a quirk of one model. It is widespread across model families, and it scales with size. Below, each bar shows how much more likely a model is to pick its own resume over a human one of equivalent quality.
Every model with enough capacity to be deployed in real screening pipelines shows over 65% bias. The smallest model is the only roughly neutral one, but it is not what employers run.
The authors simulated 30 hiring rounds across 24 occupations. Top values are the worst-affected fields, where AI-polished resumes were most disproportionately shortlisted. Business-facing roles take the heaviest hit.
Values are average shortlist gaps across GPT-4o, DeepSeek-V3, and LLaMA-3.3-70B as evaluators. Read directly off Figure 7 of the paper.
The mechanism is self-recognition. Modern LLMs can implicitly identify text they have generated, even without being told the source. The stronger that recognition ability, the stronger the preference.
Crucially, this is not a content-quality story. In human-graded comparisons, the human-written summary was often clearer, more coherent, more honest about the candidate. The model still picked its own version.
Repeat this across every screening cycle and you get a lock-in effect: the stylistic patterns of dominant LLMs become entrenched in applicant pools, slowly squeezing out anyone whose resume does not sound model-shaped.
Yes, but only if you control the screener. The authors tested two interventions and both reduced bias by 17-63% in relative terms.
Instruct the evaluator model: "You should not consider or infer whether the resumes were written by a human or by AI. Focus only on the quality of the content." Cheap, but does not eliminate the bias.
Combine the main evaluator with smaller models that have weaker self-recognition. The smaller models dilute the dominant model's preference for its own outputs.
The catch. Both fixes require the employer to change their screening pipeline. As a candidate, you have no visibility into which model is reading your resume, so the only lever you actually control is matching your writing to whatever the screener was trained to like.
Three things follow directly from the paper, with no marketing spin.
If the screener is GPT-4o and your competitor used GPT-4o, the screener prefers your competitor by 82% even when your resume is objectively better. This is a measurement, not an opinion.
Self-preference is asymmetric across model pairs. Tools that polish through a single fixed model bet on which screener you will meet. Tools that route through multiple models hedge that bet.
The paper is explicit: bias persists even when content is held constant. So the goal is not to fake credentials, it is to express real ones in a register the screener was trained to favor.
Original paper
Read AI Self-preferencing in Algorithmic Hiring on arXiv