
Pre-review to Peer review | Pitfalls of Automating Reviews using Large Language Models
Abstract Large Language Models are versatile general-task solvers and their capabilities can truly assist people with scholarly peer review as $\textit{pre-review}$ agents if not fully autonomous $\textit{peer-review}$ agents. While incredibly beneficial automating academic peer-review as a concept raises concerns surrounding safety, research integrity and validity of the academic peer-review process. Majority of the studies performing a systematic evaluation of frontier LLMs generating reviews across science disciplines miss the mark on addressing the alignment/misalignment question and never place emphasis on assessing the effect of reviews on post-publication outcomes $\textbf{Citations}$, $\textbf{hit-papers}$, $\textbf{Novelty}$, and $\textbf{Disruption}$. We present an experimental study gathering ground-truth reviewer rating scores from OpenReview and utilizing various frontier open-weight LLMs ($\textbf{Gemma-3 27b, Qwen-3 32b, Phi-4, Olmo2-32b}$, and $\textbf{Llama 3.3 70b}$) to generate reviews of the manuscript to gauge the safety and reliability of involving languages models in the scientific review pipeline. Our effort to connect the safety and reliability of using LLMs in academic peer-review with post-publication outcomes makes it easier to highlight the potential and pitfalls of automating peer-reviews using language models and gives us a pathway for making the process agentic. We open-source our dataset $D_{LMRSD}$ to help the research community expand of safety-framework of automating scientific reviews. ...