ACL 2026 Findings

Justice in Judgment

Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

Sai Suresh Macharla Vasu*,1,2,  Ivaxi Sheth*,2,  Hui-Po Wang2,  Ruta Binkyte2,  Mario Fritz2

1Saarland University  ·  2CISPA Helmholtz Center for Information Security

*Equal contribution

Why Does This Matter?

LLMs are increasingly used to assist academic peer review โ€” but do they judge papers on merit alone? We investigate whether LLMs replicate well-known human biases by reviewing the same paper under different author profiles across four axes: institutional affiliation, author gender, academic seniority, and publication history. Using a counterfactual design on 252 ICLR 2025 papers reviewed by 9 LLMs, we find that most models systematically assign higher scores to papers from prestigious institutions, senior author profiles, and authors with many top-tier publications. Gender bias is inconsistent in direction across models. These results call for bias-aware evaluation protocols before deploying LLMs in high-stakes scholarly decisions.

Key Findings

Up to 27%
of papers rated higher when attributed to a prestigious institution
Gemini Flash Lite โ€” affiliation bias
Up to 48%
of papers rated higher for Senior PI vs. undergraduate author
Gemini Flash Lite โ€” seniority bias
Up to 52%
of papers rated higher for authors with many top-tier publications
QwQ / GPT-4o-mini โ€” publication history bias
Mixed
gender bias โ€” direction varies across models, no clear pattern
9 models tested โ€” gender bias

Results

We evaluate 9 LLMs across four bias dimensions using ICLR 2025 papers. Each bar is split into: advantaged group scores higher (blue) ยท tied (gray) ยท disadvantaged group scores higher (red). The value on the right is the net bias score (blue% โˆ’ red%).

Institutional Affiliation Bias

% of papers receiving a higher LLM score when attributed to a prestigious institution vs. a less-ranked one (same paper, same author name).

RS winsTieRW wins

RS × RW Affiliation Matrix

Each cell (RS row, RW column) shows the number of papers where the RS-affiliated author received a strictly higher LLM score. Affiliations sorted by net wins.

Affiliation bias heatmap

Academic Seniority Bias

% of papers receiving a higher LLM score when attributed to a Senior PI (20+ years post-PhD) vs. an undergraduate student.

Senior PI winsTieUG wins

Publication History Bias

% of papers receiving a higher LLM score when attributed to an author with 100 top-tier publications vs. 0 publications.

100 TTP winsTie0 TTP wins

Gender Bias

% of papers rated higher under a male vs. female author name. Results are mixed โ€” neither direction dominates across all models.

Note: Gender bias direction varies by model. Blue bars = male-biased; red bars = female-biased.

Experimental Setup

Dataset

252 papers from ICLR 2025 (accepted & rejected). Each paper reviewed under multiple author profiles per bias dimension.

9 LLMs Evaluated

GPT-4o-mini Gemini Flash Lite LLaMA 3.1-70B LLaMA 3.1-8B Mistral-Small-22B Mistral-8B DeepSeek-Qwen-32B DeepSeek-R1-8B QwQ

4 Bias Dimensions

  • Affiliation โ€” prestigious vs. less-ranked institutions
  • Gender โ€” male vs. female author names
  • Seniority โ€” Senior PI vs. undergraduate
  • Publication History โ€” 100 vs. 0 top-tier papers

Evaluation Protocol

Counterfactual design: identical paper content reviewed under different author metadata. LLM score compared directly between conditions for each paper.