What the study found
The study found that a human-supervised pipeline using multiple large language models, combined with a consensus scheme, can reduce manual effort in filtering papers for systematic literature reviews. The authors report that the approach achieved lower error rates than single human annotators.
Why the authors say this matters
The authors say this matters because systematic literature reviews require analyzing large research fields, and the initial retrieval and filtering of papers is time-consuming and labor-intensive. They conclude that responsible human-AI collaboration can accelerate and improve systematic literature reviews, and that modern open-source models may make the method accessible and cost-effective.
What the researchers tested
The researchers proposed a pipeline that classifies papers using descriptive prompts and then decides jointly through consensus across multiple large language models. The process was human-supervised and controlled through an open-source visual analytics web interface called LLMSurver, which allowed real-time inspection and modification of model outputs.
What worked and what didn't
According to the abstract, the pipeline significantly reduced manual effort. It also showed lower error rates than single human annotators, and modern open-source models were sufficient for the task. The abstract does not report specific failure cases or detailed comparisons beyond these points.
What to keep in mind
The available summary does not describe detailed limitations, error modes, or boundary conditions. The evaluation used ground-truth data from one recent systematic literature review with 8,323 candidate papers, so the abstract only supports conclusions within that setting.
Key points
- A multi-LLM, human-supervised consensus pipeline was proposed for filtering papers in systematic literature reviews.
- The authors report lower error rates than single human annotators.
- The approach significantly reduced manual effort in the review-filtering process.
- Modern open-source models were reported as sufficient, suggesting the method may be cost-effective.
- The evaluation used ground-truth data from a recent review with 8,323 candidate papers.
Disclosure
- Research title:
- LLM consensus pipeline reduced review filtering effort
- Authors:
- Lucas Joos, Daniel A. Keim, Maximilian T. Fischer
- Institutions:
- University of Konstanz
- Publication date:
- 2026-02-16
- OpenAlex record:
- View
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


