Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

Sai Suresh Macharla Vasu^*,1,2, Ivaxi Sheth^*,2, Hui-Po Wang², Ruta Binkyte², Mario Fritz²

¹Saarland University · ²CISPA Helmholtz Center for Information Security

^*Equal contribution

Why Does This Matter?

LLMs are increasingly used to assist academic peer review — but do they judge papers on merit alone? We investigate whether LLMs replicate well-known human biases by reviewing the same paper under different author profiles across four axes: institutional affiliation, author gender, academic seniority, and publication history. Using a counterfactual design on 252 ICLR 2025 papers reviewed by 9 LLMs, we find that most models systematically assign higher scores to papers from prestigious institutions, senior author profiles, and authors with many top-tier publications. Gender bias is inconsistent in direction across models. These results call for bias-aware evaluation protocols before deploying LLMs in high-stakes scholarly decisions.

Key Findings

Up to 27%

of papers rated higher when attributed to a prestigious institution

Gemini Flash Lite — affiliation bias

Up to 48%

of papers rated higher for Senior PI vs. undergraduate author

Gemini Flash Lite — seniority bias

Up to 52%

of papers rated higher for authors with many top-tier publications

QwQ / GPT-4o-mini — publication history bias

Mixed

gender bias — direction varies across models, no clear pattern

9 models tested — gender bias

Results

We evaluate 9 LLMs across four bias dimensions using ICLR 2025 papers. Each bar is split into: advantaged group scores higher (blue) · tied (gray) · disadvantaged group scores higher (red). The value on the right is the net bias score (blue% − red%).

Institutional Affiliation Bias

% of papers receiving a higher LLM score when attributed to a prestigious institution vs. a less-ranked one (same paper, same author name).

RS winsTieRW wins

RS × RW Affiliation Matrix

Each cell (RS row, RW column) shows the number of papers where the RS-affiliated author received a strictly higher LLM score. Affiliations sorted by net wins.

Model:

Academic Seniority Bias

% of papers receiving a higher LLM score when attributed to a Senior PI (20+ years post-PhD) vs. an undergraduate student.

Senior PI winsTieUG wins

Publication History Bias

% of papers receiving a higher LLM score when attributed to an author with 100 top-tier publications vs. 0 publications.

100 TTP winsTie0 TTP wins

Gender Bias

% of papers rated higher under a male vs. female author name. Results are mixed — neither direction dominates across all models.

Note: Gender bias direction varies by model. Blue bars = male-biased; red bars = female-biased.

Experimental Setup

Dataset

252 papers from ICLR 2025 (accepted & rejected). Each paper reviewed under multiple author profiles per bias dimension.

9 LLMs Evaluated

GPT-4o-mini Gemini Flash Lite LLaMA 3.1-70B LLaMA 3.1-8B Mistral-Small-22B Mistral-8B DeepSeek-Qwen-32B DeepSeek-R1-8B QwQ

4 Bias Dimensions

Affiliation — prestigious vs. less-ranked institutions
Gender — male vs. female author names
Seniority — Senior PI vs. undergraduate
Publication History — 100 vs. 0 top-tier papers

Evaluation Protocol

Counterfactual design: identical paper content reviewed under different author metadata. LLM score compared directly between conditions for each paper.