๐Ÿ’ญ
POW!
๐Ÿ’ก
ZAP!

Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Andong Chen Wenxin Zhu Qiuyu Ding Yuchen Song Muyun Yang Tiejun Zhao
Harbin Institute of Technology
Benchmark results figure
๐Ÿ“–

TL;DR

We propose "Thinking with Comics" โ€” a novel reasoning paradigm that uses sequential comic panels as an intermediate reasoning medium, bridging the gap between static images and videos while preserving temporal logic, embedded text, and visual storytelling for enhanced VLM reasoning.

SEE IT IN ACTION Example Comic Outputs Across Different Tasks

GSM8K Example 1
GSM8K

Math problem solved through visual storytelling

GSM8K Example 2
GSM8K

Silice-of-life style

GSM8K Example 3
GSM8K

Multi-step arithmetic reasoning visualized

MATH500 Example 1
MATH500

Slice-of-life Style

MATH500 Example 2
MATH500

Role-playing Style

MATH500 Example 3
MATH500

Cartoon Style

View Full Gallery โ†’
CHAPTER 1

Abstract

Chain-of-Thought (CoT) has significantly improved the reasoning abilities of large language models, making "thinking with text" a core reasoning paradigm. Recent advances have extended this to "thinking with images" and "thinking with video," but each modality still has clear limitations.

To address these limitations, we introduce "Thinking with Comics" โ€” a distinctive narrative form that retains temporal logic, embedded text, dynamic reasoning, and imagination like video, yet with higher information density and lower computational cost than video-based reasoning.

Our experiments demonstrate that comics serve as an effective reasoning medium that outperforms image-based reasoning on multi-step temporal tasks while requiring significantly lower cost than video-based approaches.

KEY FINDINGS

What We Discovered

๐ŸŽญ

Narrative Alignment

Detective-style comics provide clearer causal structure for logical reasoning tasks, achieving +28.5% improvement over documentary style.

โšก

Efficiency

With only 13.4% of the computational cost, comics were able to surpass the accuracy of video reasoning.

๐Ÿ’ฌ

Text Anchoring

Embedded text in comics provides semantic anchoring, improving accuracy by +18.1% on cultural understanding tasks.

๐Ÿ”—

Gutter Reasoning

Models actively use the logic between panels (gutters), with 2.4% accuracy drop when panels are shuffled.

CHAPTER 2

Method

"Thinking with Comics" can be instantiated through two paths:

I

End-to-End Visualized Reasoning

Question
โ†’
Comic Generation
โ†’
Answer

Comic generation is the reasoning process. The answer is extracted from the final panel.

II

Comic as Conditioning Context

Question
โ†’
Comic
+
VLM
โ†’
Answer

Comic serves as explicit intermediate reasoning representation, processed by a VLM for joint reasoning.

๐ŸŽจ Narrative Style Alignment

Different comic styles act as "Visual System Prompts" for different task types:

๐Ÿ•ต๏ธ

Detective Style

Best for logical reasoning

+32% on GSM8K
๐ŸŽฌ

Documentary

Educational presentation

Baseline
โ˜•

Slice-of-Life

Relatable scenarios

+19.1% avg
CHAPTER 3

Results

๐Ÿ“Š Main Results on Reasoning and Context Understanding Benchmarks

* denotes results from Tong et al. (2025); โ˜… indicates evaluation on 50 sampled instances.

Category Model / Method Notes MATH-500 GSM8K MathVista DocVQA CulturalBench (E/H)
MLLM GPT-5.2 direct 99.0 100.0 67.5 72.8 88.3 / 84.4
Gemini-3-Pro direct 100.0 99.0 71.5 94.5 90.4 / 90.0
Claude-Sonnet 4.5 direct 99.0 100.0 72.5 92.6 87.2 / 76.5
Reasoning LLM DeepSeek-R1 CoT 90.4 96.1 โ€” โ€” 87.2 / 85.1
Qwen3-235B-A22B CoT 92.4 94.3 โ€” โ€” 83.1 / 82.5
Think with Image TWI-Generated Photo G-t-R 70.2 69.4 63.6 67.5 69.7 / 71.4
DREAMLLM G-t-R 12.6 18.4 35.9 65.5 52.3 / 42.8
Think with Video Sora 2 V-o-T 67.0* 75.7* 67.6โ˜… 50.5โ˜… 60.0โ˜… / 70.0โ˜…
Think with Comic TwC (Ours) - Path I direct 90.0 100.0 75.0 92.8 70.0 / 80.5
TwC (Ours) - Path II G-t-R 92.3 95.4 85.8 99.4 88.3 / 82.2

๐ŸŽญ Role-playing Narrative Alignment

Different comic styles act as "Visual System Prompts" for different task types:

Style (Visual Prompt) MathVista GSM8K Avg. ฮ”
Documentary (Base) 60.0 68.0 โ€”
Slice-of-Life 80.0 86.3 +19.1
๐Ÿ•ต๏ธ Detective Style 85.0 100.0 +28.5

Detective-style significantly outperforms the documentary baseline with +44.5% relative improvement. This confirms that role-playing narrative style is not merely visual decoration but a potent Visual System Prompt.

๐Ÿ“ˆ Scaling the Panels

The performance-cost curve across different panel counts N. Accuracy enters a plateau at N โˆˆ [4, 6]. On the MATH500 dataset, token cost ranges between 1100 and 1300. The shaded green region indicates the Pareto optimal range.

๐Ÿ“Š Panel Distribution Across Task Difficulties

Frequency distribution of generated panels across tasks with varying difficulty levels. The shift to the right indicates the model's adaptive allocation of reasoning steps for complex tasks.

โฑ๏ธ Effect of Temporal Structure on Reasoning

Model accuracy under two controlled manipulations of comic panel sequences: Complete Shuffle (blue), which disrupts temporal order, and Intermediate Deletion (orange), which removes intermediate panels while preserving relative order. Accuracy consistently decreases as perturbation intensity increases, with deletion causing a larger drop than shuffling.

๐Ÿ’ฌ Ablation on Textual Anchoring

Ablation results on textual anchoring. Embedded text (bubbles, narration) provides precise semantic cues that significantly improve accuracy across all benchmarks.

โšก Efficiency Analysis of TwC vs Think with Video

We define the visual signal generation cost function C(ยท):

Video: Cvideo(t) = ฮฑ ยท t (ฮฑ = $0.10/s, time-dependent)
TwC: Ccomic = ฮฒ (ฮฒ = $0.134/img, constant)

Comparing the image generation cost models. While video generation cost (Cvideo) scales linearly with task duration due to temporal redundancy, TwC maintains a low, constant cost (Ccomic) regardless of the event's temporal length. The shaded area represents the economic advantage of our approach. Break-even point at t โ‰ˆ 1.34s.

$1.00
Video Cost (10s task)
-86.6% Cost
โ†’
$0.134
TwC Cost (same task)
APPENDIX

Theoretical & Empirical Analysis

Why Comics Are a Privileged Visual Reasoning Medium ?

๐Ÿ“ Theoretical Justification

Definition: Information-Efficiency

We characterize an intermediate representation z by its information-efficiency for task solving:

ฮท(z) โ‰œ I(a; z | q)C(z)

where I(ยท;ยท|ยท) is conditional mutual information and C(z) is the media generation cost.

Proposition 1: Comics Outperform Single Images

If the answer a depends on multi-step temporal relations in latent trajectory s1:T, then any single snapshot x = h(st) may discard relevant states:

I(a; s1:T | q) > I(a; x | q)

Comics represent a structured summary zcomic = (c1:K, ฯ„) with K panels and embedded text ฯ„. By chain rule:

I(a; zcomic | q) = I(a; c1:K | q) + I(a; ฯ„ | q, c1:K)

The second term captures the additional semantic anchoring channel from embedded text.

Proposition 2: Comics Are More Efficient Than Videos Under a Budget

For a video v = (x1, ..., xT) with T frames:

I(a; v | q) = โˆ‘t=1T I(a; xt | q, x<t)

Due to temporal redundancy, I(a; v | q) grows sublinearly with T, while video cost grows linearly.

Comics select K โ‰ช T key states to maximize task-relevant information:

c1:K โ‰ˆ argmax|S|=K I(a; xS | q)

This leads to higher ฮท(zcomic) than ฮท(v) at the same budget.

๐Ÿ”ฌ Empirical Analysis A.1: Prompt-Induced Structural Stability

This experiment examines whether comics, compared to non-comic visual styles, more naturally and stably support multi-panel generation.

Comic Condition: "Draw a four-panel comic to solve the problem."
Non-Comic Condition: "Draw a four-step visual storyboard in a realistic style."
Metric Dataset Comic Non-Comic Improvement
Layout Success Rate (%)
(Panel Consistency)
MATH-500 95.0 70.0 +25.0
MathVista 90.0 65.0 +25.0
Reasoning Accuracy (%) MATH-500 75.0 60.0 +15.0
MathVista 70.0 55.0 +15.0

Table A.1. Comparison of structural stability and reasoning accuracy between Comic and Non-Comic prompts. Comic prompts consistently induce structurally complete multi-panel layouts, while Non-Comic instructions frequently suffer from layout collapse.

๐Ÿ”ฌ Empirical Analysis A.2: Global vs. Incremental Visual Reasoning

This experiment compares Global Comic generation (complete multi-panel comic in a single pass) and Incremental image chaining (panels generated sequentially conditioned on previous outputs).

Benchmark Method ACC (%) โ†‘ Logic โ†‘ State โ†‘ Quality โ†‘
MATH-500 Incremental 80.0 4.17 3.72 3.58
Global (Ours) 95.0 4.86 4.67 4.61
MathVista Incremental 50.0 3.50 3.50 3.40
Global (Ours) 85.0 4.47 4.45 4.58
Average Incremental 65.0 3.83 3.61 3.49
Global (Ours) 90.0 4.67 4.56 4.59

Table A.2. Human evaluation results comparing Global and Incremental generation. We evaluate Accuracy (ACC) and three structural metrics (1-5 scale): Logic (reasoning flow), State (consistency between panels), and Quality (visual-textual fidelity). Global generation shows significant superiority in both objective performance and structural coherence.

Qualitative Comparison

Global Comic vs Incremental Non-Comic Comparison

Figure A.1. Qualitative comparison between (a) Global Comic (Ours) and (b) Incremental Non-Comic generation for a mathematical reasoning task (finding divisors of 196). Global generation maintains a consistent character (FactoBot) and smooth logical flow, whereas the incremental baseline exhibits static scenes and lacks narrative coherence.

๐Ÿ’ก

Key Insight

Global generation yields significantly stronger cross-panel coherence with more stable entity representations and smoother reasoning progression, whereas incremental generation suffers from error accumulation. This suggests that treating comics as a holistic structured representation is crucial for preserving multi-step reasoning quality.

REFERENCE

Citation

@article{chen2025thinkingcomics,
  title   = {Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling},
  author  = {Chen, Andong and Zhu, Wenxin and Ding, Qiuyu and Song, Yuchen and Yang, Muyun and Zhao, Tiejun},
  journal = {arXiv preprint arXiv:2602.02453},
  year    = {2026}
}