Creating high-quality academic presentation videos is notoriously time-consuming. Researchers often spend hours designing slides, recording voiceovers, editing footage, and syncing visuals—just to produce a 2–10 minute video for conferences, social media, or supplementary materials. What if this entire process could be automated while preserving scientific fidelity and communicative clarity?
Enter Paper2Video, the first end-to-end system that automatically generates complete academic presentation videos directly from scientific papers. Given a LaTeX paper project, a reference speaker image, and an audio clip, Paper2Video produces a polished video featuring synchronized slides, subtitles, cursor guidance, synthetic speech, and—even optionally—a realistic talking-head presenter.
Developed by Show Lab at the National University of Singapore, Paper2Video isn’t just another video generator. It’s purpose-built for the unique demands of scholarly communication: dense multimodal content (text, figures, tables), precise information alignment, and audience accessibility. At its core lies PaperTalker, a multi-agent framework that coordinates specialized modules to ensure the final video is both informative and engaging.
Why Researchers Need Paper2Video
Academic videos serve a fundamentally different purpose than general-purpose videos: they must accurately convey complex ideas, highlight key contributions, and amplify visibility—not just look visually appealing. Yet traditional video-generation tools fall short because they lack understanding of scientific structure and discourse.
Paper2Video bridges this gap by:
- Eliminating manual labor: No more hours spent on slide design or video editing.
- Preserving scholarly integrity: Core ideas from the paper are faithfully reflected in slides and narration.
- Democratizing video creation: Non-native speakers or researchers without design/editing skills can produce professional-grade videos.
- Accelerating dissemination: Share your work faster on platforms like YouTube, X (Twitter), or conference portals.
How Paper2Video Works
Paper2Video follows a modular, agent-based pipeline called PaperTalker, which processes a scientific paper through five tightly integrated stages:
- Slide Generation: Converts LaTeX source into Beamer-style slides, intelligently extracting sections, figures, and equations.
- Layout Refinement: Uses a novel tree search–based visual choice mechanism to optimize slide layouts for clarity and visual flow.
- Subtitling & Speech Synthesis: Generates accurate subtitles aligned with synthesized speech derived from paper content.
- Cursor Grounding: Animates a cursor to guide viewer attention to key elements during playback.
- Talking-Head Rendering (Optional): Integrates with Hallo2 to generate a lifelike presenter using your reference image and voice.
Critically, the system supports parallelized slide-wise generation, dramatically improving efficiency without sacrificing quality.
Getting Started: From Paper to Video in Minutes
Using Paper2Video is straightforward if you have the right inputs and setup:
Input Requirements
- A complete LaTeX paper project (not just a PDF)
- A square reference image of the speaker (for talking-head mode)
- A short (~10-second) reference audio clip of the speaker’s voice
Two Generation Modes
- Fast Mode (
pipeline_light.py): Generates everything except the talking head—ideal for rapid prototyping or low-resource environments. - Full Mode (
pipeline.py): Includes talking-head video using Hallo2, requiring a separate Python environment but delivering maximum realism.
Technical Setup
- APIs: Best results use GPT-4.1 or Gemini 2.5 Pro for language and vision models. Open-source alternatives like Qwen are also supported.
- Hardware: Minimum recommended GPU is an NVIDIA A6000 (48GB VRAM) due to multimodal processing demands.
- Dependencies: Separate conda environments are advised if using talking-head generation to avoid package conflicts.
A typical command for fast mode looks like this:
python pipeline_light.py --model_name_t gpt-4.1 --model_name_v gpt-4.1 --result_dir /output/path --paper_latex_root /latex/project --ref_img speaker.png --ref_audio voice.wav
Beyond Visuals: How Paper2Video Evaluates Quality
Unlike generic video metrics (e.g., FVD or CLIP score), Paper2Video introduces four custom evaluation protocols designed specifically for academic content:
- Meta Similarity: Measures semantic alignment between the original paper and the video’s spoken/narrated content.
- PresentArena: Assesses how well the video communicates key claims compared to human-made baselines.
- PresentQuiz: Tests audience comprehension by generating QA pairs from the paper and evaluating video-based answers.
- IP Memory: Evaluates whether the video preserves the author’s intellectual identity and contributions.
These metrics ensure that success isn’t just about smooth animations—it’s about effective knowledge transfer.
Ideal Use Cases
Paper2Video shines in real-world scenarios such as:
- Conference submissions: Automate video supplements required by NeurIPS, CVPR, ACL, and other venues.
- Rapid research sharing: Quickly create videos for X, LinkedIn, or lab websites to boost visibility.
- Teaching and outreach: Generate accessible explanations of technical papers for students or interdisciplinary collaborators.
- Accessibility support: Help researchers with limited English fluency or video-editing experience produce professional materials.
Limitations and Practical Notes
While powerful, Paper2Video has important constraints to consider:
- LaTeX dependency: Only accepts full LaTeX source projects—not PDFs or Word documents.
- Talking-head complexity: Requires installing Hallo2 in a separate environment and managing additional dependencies.
- API reliance: Highest quality depends on commercial LLMs/VLMs (GPT-4.1, Gemini), though local models are partially supported.
- Hardware demands: The full pipeline is resource-intensive, potentially limiting use on consumer-grade GPUs.
These trade-offs reflect the system’s focus on quality and fidelity over minimalism—making it best suited for labs, institutions, or cloud-based deployments.
Summary
Paper2Video redefines what’s possible in automated academic communication. By transforming raw LaTeX papers into complete, multi-channel presentation videos—with slides, speech, subtitles, cursor guidance, and even a virtual presenter—it removes a major bottleneck in research dissemination.
Backed by a purpose-built benchmark and evaluation suite, it’s not just generating videos—it’s ensuring they teach, inform, and represent the original work faithfully. For researchers, educators, and institutions looking to scale scholarly outreach without sacrificing rigor, Paper2Video offers a compelling, ready-to-deploy solution.
With its open-source code, public dataset, and modular design, it’s not just a tool—it’s a foundation for the future of academic video.