PaperBanana & ScholarPeer: Can AI Agents Actually Help Academic Research?

Google Research recently released two AI agents aimed at improving the academic workflow: PaperBanana (automated figure generation for AI papers) and ScholarPeer (AI-assisted peer review, arXiv: 2601.22638). Both are interesting enough to warrant a careful look โ€” not just at what they do, but at what that means for how we do research.

I've been spending time with both, and I have thoughts that go beyond the press release.

PaperBanana: Automating Academic Illustration

If you've written an ML paper, you've spent a disproportionate amount of time making figures. Architecture diagrams, training curves, attention maps, algorithm flowcharts โ€” figures are how ideas get communicated in papers, and making good ones is genuinely hard and time-consuming.

PaperBanana (arXiv:2601.23265) is an agent built to automate this. Given a paper's text, it generates publication-ready figures through a pipeline of five specialized sub-agents working in sequence:

  1. Retriever. Parses the manuscript to extract key claims, methods, and results that need visual support.
  2. Planner. Decides which figures are necessary, what type each should be, and what data they should encode.
  3. Stylist. Applies consistent formatting, color schemes, and labeling conventions across all figures.
  4. Visualizer. Generates the actual figure code (matplotlib, SVG, or similar).
  5. Critic. Reviews each output against the paper's claims and sends it back for refinement if it doesn't pass.

On PaperBananaBench โ€” a human-evaluated benchmark โ€” PaperBanana scores 60.2, above the 50.0 human baseline. The PaperVizAgent codebase is open source.

What It Gets Right

For standard figure types โ€” bar charts, line plots, confusion matrices, comparison tables โ€” PaperBanana produces reasonable results quickly. The five-agent pipeline's Critic step meaningfully improves consistency: generating 10 comparison figures that all maintain the same formatting and color conventions is tedious for humans, and the agent handles it well.

What It Misses

The harder figure problem in ML papers isn't "make a bar chart of these numbers" โ€” it's "create a figure that communicates a non-obvious conceptual point." Architecture figures for novel methods, diagrams showing why an existing approach fails, visualizations of emergent model behavior โ€” these require human judgment about what the reader needs to understand.

PaperBanana struggles with figures that require genuine conceptual clarity rather than data visualization. It can generate something, but it's often technically correct but communicatively mediocre. In papers, a mediocre figure that fails to convey the idea is worse than no figure.

ScholarPeer: AI-Assisted Peer Review

The second tool is more provocative. ScholarPeer (arXiv:2601.22638) is an AI agent that reads an academic paper and generates a structured review โ€” evaluating novelty, experimental rigor, clarity, and significance; identifying weaknesses; suggesting improvements.

It uses a more sophisticated architecture than a simple LLM prompt: a dual-stream context acquisition system that combines a historian agent (for background literature), a baseline scout (for identifying missing comparisons), and a multi-aspect Q&A engine grounded in live web-scale literature search. This grounds the review in current work rather than just training data.

Peer review is under enormous strain. Submission volumes at top venues (NeurIPS, ICML, ICLR) have grown 5โ€“10ร— in the last decade, but the pool of qualified reviewers hasn't grown proportionally. So the problem ScholarPeer targets is real. The question is whether AI reviews are useful or just plausible-sounding noise.

What ScholarPeer Can Do

For papers that are clearly weak, ScholarPeer is genuinely useful. Identifying missing baselines, inconsistent claims, results that don't support conclusions, obvious related work omissions โ€” these are pattern-matching tasks that an LLM with web-scale literature grounding does reasonably well.

The live literature search component is the differentiator here. Rather than relying solely on training data, ScholarPeer's baseline scout can flag recent competing work the authors may have missed. As an author-facing checklist before submission, this has real value.

The Hard Problem: Evaluating Novelty

The hardest part of peer review is evaluating genuine novelty. Is this contribution actually new? Does it advance the state of the art in a meaningful way, or is it an incremental variation of prior work?

This requires knowing the current frontier of a subfield โ€” not just indexed papers, but preprints, workshop papers, ongoing work, and informal community knowledge. It requires judgment about what constitutes a meaningful advance, which is inherently subjective.

Even with web-scale grounding, LLMs are unreliable at this. They can confidently claim novelty when there's direct overlap with a very recent preprint, or flag something as "not novel" based on superficial similarity to a different paper. These errors are hard to detect because the outputs sound authoritative.

My Take on AI Reviews

AI reviews are most valuable as author-facing tools, not reviewer-facing ones. Using ScholarPeer on your own paper before submission โ€” to catch missing related work, weak experimental design, unclear writing โ€” is a genuine workflow improvement.

Using AI reviews as actual peer review outputs is risky. The risk isn't that they're always wrong โ€” they're often not. The risk is that they're wrong in non-obvious ways: confident, well-structured reviews containing subtle errors about novelty or significance that a non-expert area chair can't detect.

The Broader Pattern: AI as Research Infrastructure

Both PaperBanana and ScholarPeer fit a broader pattern I'm watching: AI being used not to do research, but to handle the infrastructure around research. Figure generation, review writing, related work summarization, code generation for experiments โ€” these are the parts of academic work that consume time without requiring the core intellectual contribution.

If these tools genuinely save 20-30% of a researcher's time on non-core tasks, that's a real productivity multiplier. The question is whether researchers use that time to do more interesting work, or to produce more papers at the same intellectual depth.

In my own workflow at Dell's CSG CTO Lab, I use LLMs heavily for code generation, literature search, and drafting โ€” but not for the core ideas. The ideas still require sitting with a problem until something clicks. No agent has replaced that part.