- Training-free agentic system combining VLMs and LLMs for sketch-to-diagram conversion, addressing diffusion models' limitations in spatial precision and symbolic structure
- Three-component iterative loop: Critic VLM identifies 1-3 key discrepancies, multiple LLMs generate candidate solutions with diverse strategies, Judge VLM selects best result
- Prioritizes qualitative reasoning over numerical estimates, preserving global constraints (alignment, connectivity) and enabling human-in-the-loop corrections
- Generates editable SVG programs rather than raster images, enabling extensibility to presentation tools via APIs
- Open-sourced implementation demonstrates superior performance vs GPT-5 and Gemini-2.5-Pro on flowchart reconstruction
0
Total Papers
0
Datasets
0
Benchmarks
Quick Comparison Table
| Paper | Year | Domain | Method | Dataset | Benchmark | Key Contribution |
|---|---|---|---|---|---|---|
| See it. Say it. Sorted | 2025 | Flowcharts / Sketch-to-Diagram | Agentic VLM+LLM (Critic-Judge-Generate) | β No | β No | Training-free SVG generation with iterative refinement |
| DeTikZify | 2024 | Scientific Figures / TikZ Graphics | Multimodal LLM (LLaVA/Idefics3) + MCTS Inference | β Yes (DaTikZv2: 360K+, SketchFig, MetaFig) | β Yes (TikZ compilation, visual fidelity) | First sketch/figure-to-TikZ synthesis with MCTS refinement |
| AutomaTikZ | 2024 | Scientific Figures / TikZ Graphics | LLaMA + CLiMA (CLIP-augmented) with LoRA fine-tuning | β Yes (DaTikZ: 120K from arXiv, TeX SE, curated) | β Yes (CLIPScore, BWS human evaluation, code metrics) | First large-scale text-to-TikZ with multimodal CLIP integration |
| SGP-Bench | 2025 | SVG & CAD Programs / LLM Reasoning | LLM evaluation benchmark with Symbolic Instruction Tuning (SIT) | β Yes (1,085 SVG + 2,400 CAD programs; 72K SIT pairs) | β Yes (Semantic understanding, SE(2) consistency, SGP-MNIST) | First benchmark for LLM "visual imagination" from symbolic programs; SIT improves general reasoning |
| Draw with Thought | 2025 | Scientific Diagrams / Image-to-mxGraph XML | Training-free MLLM with cognitive CoT (PerceptualβSemanticβXML) | β No | β Yes (Plot2XML: 247 diagrams; CLIP, DINO, FID, human eval) | First training-free cognitively guided framework for scientific diagrams; 89% XML validity; +10-20% semantic alignment vs GPT-4o/Claude/Gemini |
| StarVector | 2024 | SVG Vectorization / Image-to-SVG & Text-to-SVG | Multimodal LLM (CLIP/SigLip + StarCoder) with code generation | β Yes (SVG-Stack: 2.1M samples; 4M text-image-SVG triplets) | β Yes (SVG-Bench: 10 datasets, 3 tasks; DinoScore metric) | First to reframe vectorization as code generation; generates semantic primitives (3-6Γ more compact); diagram generation with text |
| CircuitSense | 2025 | Circuit Diagrams / Visual-Symbolic Engineering Reasoning | Evaluation benchmark (tests GPT-4o, Claude, Gemini, Qwen, LLaMA-VL) | β No (evaluation-only) | β Yes (Hierarchical visual-symbolic tasks; thousands of annotated circuits) | First benchmark unifying visual comprehension and symbolic circuit reasoning; reveals MLLMs fail at topology extraction and engineering analysis |
| MAPS | 2025 | Circuit Analysis / SPICE Simulation + Multi-Modal Reasoning | Physical Perception Model (PPM: CogVLM-17B) + NgSPICE Simulator + Chain-of-Simulation | β Yes (ppm-syn-lprc: 20K synthetic pairs via CircuitTikz/LaTeX) | β Yes (SimpleCircuitEval: 79 college-level LPRC problems) | First to bridge diagram perception and physics simulation via SPICE; 4.3Γ improvement over GPT-4V (7.6%β32.9%); uses simulation results as intermediate reasoning steps |
See it. Say it. Sorted: Agentic System for Compositional Diagram Generation
Key Insights
Methods & Approach
- Critic VLM (Gemini-2.5-Pro): Compares target sketch with current image, provides targeted qualitative modifications focusing on spatial relationships
- Multi-candidate LLM Generation (Gemini-2.5-Flash): Generates multiple SVG solutions using conservative, moderate, aggressive, alternative, and focused strategies for exploration-exploitation balance
- Judge VLM (Gemini-2.5-Pro): Selects best candidate or reverts if no improvement; ensures stable convergence (typically 3 iteration steps)
- SVG Grammar: Supports basic primitives (circles, rectangles, ellipses, triangles) with parameters for position, scale, colors, stroke properties, rotation
- Constraint Preservation: LLMs maintain alignment and connectivity through instruction-based guidance
Dataset/Benchmark
No dedicated dataset proposed. Evaluation conducted on 10 flowchart sketches derived from published papers.
Evaluation approach: Qualitative comparison against GPT-5 and Gemini-2.5-Pro baseline systems, focusing on layout reconstruction, structural accuracy, and avoiding unwanted text insertion.
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
Key Insights
- First multimodal language model to automatically convert sketches and scientific figures into semantics-preserving TikZ graphics programs, eliminating manual figure recreation
- Introduces MCTS-based inference algorithm enabling iterative refinement without additional training, allowing extended generation (e.g., 10 min) to produce and score multiple candidates
- Creates three novel datasets: DaTikZv2 (360K+ TikZ graphics, largest to date), SketchFig (sketch-figure pairs), and MetaFig (diverse scientific figures with metadata)
- Outperforms commercial systems (GPT-4V, Claude 3) in both automatic and human evaluation of TikZ synthesis quality
- Generates vector graphics code rather than raster images, enabling editability, scalability, and semantic preservation
- DeTikZify v2.5-8B incorporates reinforcement learning from self-feedback (RLSF); TikZero enables zero-shot text-conditioned generation
Methods & Approach
- Model Architecture (v1): Built on LLaVA and AutomaTikZ foundations; accepts visual inputs (sketches/figures) and generates TikZ code
- Model Architecture (v2): Uses Idefics 3 (8B Llama3-based) as backbone for improved multimodal understanding
- Training Data: MetaFig and DaTikZv2 datasets plus synthetically generated sketches learned from SketchFig hand-drawn examples
- MCTS Inference: Adapted from VerMCTS methodology; generates multiple TikZ candidates, validates compilation, scores visual fidelity, selects best variant
- TikZero Adapters: Parameter-efficient fine-tuning (10B params) enabling zero-shot text-conditioning for guided synthesis
- Validation Pipeline: Automatic TikZ compilation checking and rasterization for output verification
Datasets & Benchmarks
π DaTikZv2 Dataset
Size: 360,000+ human-created TikZ graphics programs (largest TikZ dataset to date)
Source: Extracted from academic papers; public release excludes arXiv content due to licensing, but dataset creation scripts provided for reproduction
Versions: DaTikZv2 (NeurIPS 2024), DaTikZv3 (updated)
π SketchFig Dataset
Content: Hand-drawn sketches paired with corresponding scientific figures (Instruct-Pix2Pix)
Purpose: Training sketch-to-TikZ synthesis; synthetic sketch augmentation
π MetaFig Dataset
Content: Diverse scientific figures with associated metadata
Purpose: Training multimodal understanding of scientific visualization conventions
π― Evaluation Benchmarks
Automatic Metrics: TikZ compilation success rate, visual fidelity scoring
Human Evaluation: Quality assessment comparing against GPT-4V and Claude 3 baselines
Results: DeTikZify outperforms both commercial baselines; MCTS significantly boosts performance
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
Key Insights
- First large-scale approach to generate scientific vector graphics via TikZ graphics language, producing editable, scalable figures from text descriptions
- TikZ provides human-oriented high-level commands that facilitate language modeling compared to low-level SVG paths, enabling complex scientific figures with minimal code
- Fine-tuned LLaMA models outperform commercial GPT-4 and Claude 2 in both automatic and human evaluation for generating figures similar to human-created ones
- Introduces CLiMA architecture integrating CLIP's multimodal projection layer with LLaMA using soft prompting, improving text-image alignment and enabling image-conditioned generation
- Demonstrates that GPT-4 and Claude 2 generate simpler, less complex figures and are susceptible to "typographic attacks" (copying input captions into output images to inflate similarity scores)
- Models exhibit minimal memorization issues with >80% novelty for n-grams (n>8), generating truly novel outputs rather than copying training data
Methods & Approach
- Base Model: LLaMA (7B and 13B parameters) chosen for clearly specified training data to avoid test set leakage; fine-tuned with LoRA for efficiency
- CLiMA Architecture: Augments LLaMA with CLIP ViT-H/14 using multimodal projection layer; connects CLIP output to LLaMA input via soft prompting with feed-forward adapter layer
- Training Strategy: LoRA applied to all linear layers; 12 epochs with AdamW, batch size 128, learning rate 5e-4; random 50% image/caption swapping for CLiMA data augmentation
- Iterative Resampling: Novel error correction method that reverses generation to before error line and continues sampling, with exponential backtracking (4^(i-1) lines)
- Evaluation Metrics: CLIPScore (caption-image), CLIPScoreimg (image-image), KID (distribution quality), CrystalBLEU (code similarity), EED (string distance), CSR (compilation rate)
- Human Evaluation: Best-worst scaling (BWS) with expert annotators for caption similarity and reference similarity; 4-tuples comparison methodology
Datasets & Benchmarks
π DaTikZ Dataset
Size: 120,000 TikZ drawings with captions (first large-scale TikZ dataset)
Sources:
- ArXiv Papers: 85,656 examples (67.75% augmented) - extracted from scientific papers with TikZ source
- TeX Stack Exchange: 29,238 examples (51.31% augmented) - Q&A converted to captions via WizardLM
- Artificial Examples: 3,914 examples (50% augmented) - GPT-4 generated via knowledge distillation
- Curated Examples: 981 examples (63.2% augmented) - high-quality from community websites
Caption Augmentation: 62.71% of captions with <30 tokens augmented using LLaVAR (5 candidates ranked by CLIPScore); improves CLIPScore from 24.76β29.12
Public Release: Excludes arXiv content due to licensing, but provides dataset creation scripts for reproduction
π― Evaluation Benchmarks
Test Set: 1,000 human-created examples sampled after December 2022 to avoid data leakage with LLaMA training
Automatic Metrics: CLIPScore, CLIPScoreimg, KID, CrystalBLEU, EED, Compilation Sampling Rate
Human Evaluation: Best-worst scaling with expert annotators on 100-item subset; Caption Similarity (CS) and Reference Similarity (RS)
Key Results:
- CLiMA13b achieves best CLIPScore (26.97) among fine-tuned models, outperforms LLaMA13b on 5/7 metrics
- GPT-4/Claude 2 show inflated CLIPScore (29.12/27.75) due to caption copying but poor CLIPScoreimg (78.63/75.59 vs 81.02 for CLiMA13b)
- Human evaluation: CLiMA13b has mode >0 for caption similarity; GPT-4 shows uniform distribution for reference similarity
- Code complexity: Human (916 tokens avg) > CLiMA/LLaMA (420 tokens) > GPT-4/Claude 2 (180 tokens)
Why TikZ > SVG: Semantic & Symbolic Understanding
π― Core Advantages of TikZ Over SVG
- High-level semantic commands vs. low-level paths: TikZ uses human-oriented commands like
\draw (A) -- (B)expressing intent and relationships, while SVG uses primitive coordinates like<path d="M 50,100 L 150,100"/>with no semantic meaning - Explicitly encodes relationships: Spatial relationships (
right of=input1), hierarchical structures (node grouping), and connectivity patterns (all-to-all connections via loops) are captured in the code itself - Captures symbolic meaning through abstraction: Concepts like "layers," "groups," and "patterns" are expressed through constructs like
\foreachloops that reflect underlying architecture (e.g., neural network layers) - Maintains geometric precision with semantic intent: Commands like
\node at (0,0) {Center}encode both position AND the semantic relationship (label belongs to circle's center), not just coordinates - Expresses complex structures concisely: Human TikZ code averages 916 tokens vs. GPT-4's simpler 187 tokens, but TikZ captures far more semantic complexity with commands that align with natural language descriptions
π‘ Example: Multi-Layer Perceptron
TikZ captures semantic structure:
% Define neuron layers with semantic meaning
\foreach \i in {1,...,4}
\node[neuron] (I-\i) at (0,-\i) {Input \#\i};
\foreach \i in {1,...,5}
\node[neuron] (H-\i) at (2,-\i-0.5) {};
% All-to-all connectivity pattern
\foreach \i in {1,...,4}
\foreach \j in {1,...,5}
\draw[->] (I-\i) -- (H-\j);
What this encodes: Layers as conceptual groups, all-to-all connectivity pattern between layers, iteration structure reflecting neural network architecture
SVG equivalent: Would require manually specifying every single line coordinate with no concept of "layers" or "connections" - just raw geometric primitives
π§ Why This Matters for Language Models
- Natural language alignment: TikZ commands like
\draw,\node,\foreachalign with how we describe figures in captions - Compositional understanding: Models can learn that "multi-layer perceptron" β layers of nodes β all-to-all connections
- Fewer tokens, more meaning: Paper shows SVG methods "fail to maintain accurate geometric relations" (p.3) while TikZ creates "complex figures with only a few commands" that preserve semantic relationships
Can Large Language Models Understand Symbolic Graphics Programs?
Key Insights
- First benchmark (SGP-Bench) to evaluate LLMs' ability to semantically understand symbolic graphics programs (SVG and CAD) without vision encoders, testing "visual imagination" from code alone
- Tests semantic understanding via multiple-choice questions on rendered images using only symbolic program input; introduces consistency tests via SE(2) transformations (rotation/translation) that drastically change code but preserve semantics
- Comprehensive evaluation: 1,085 SVG programs (4,340 questions across 5 types: semantic, color, shape, count, reasoning) + 2,400 CAD programs (3D, 3Dcomplex, 2D subsets from DeepCAD, Fusion360, SketchGraphs)
- Introduces Symbolic Instruction Tuning (SIT): 72K instruction pairs generated by querying GPT-4o on rendered images; improves Llama-3.1-8B from 46.5% to 51.4% on benchmark AND boosts general reasoning across 15+ benchmarks (GSM8k +3.3%, AGIEval +7.9%)
- Performance strongly correlates with reasoning ability: Claude 3.5 Sonnet best (67.4% SVG, 74.2% CAD); stronger reasoners (o1, GPT-4) outperform weaker models; shows clear scaling law effects
- Critical finding: Even GPT-4o achieves only 13% (chance-level) on SGP-MNIST dataset where handwritten digits in SVG form are easy for humans but extremely challenging for LLMs without semantic components
Methods & Approach
- Benchmark Creation Pipeline: Render symbolic programs β Query GPT-4o for semantic questions on images β Manual inspection β Validation via human study (500 samples, high agreement)
- SVG Understanding: 1,085 programs across 19 categories (accessory, animal, food, etc.); 4 questions per program testing semantic, color, shape, count, reasoning abilities
- CAD Understanding: 2,400 programs from 3 datasets with different syntax complexities; domain-specific language (DSL) syntax provided in-context; 1 semantic question per program
- Consistency Testing: 5 random translations (T) + 5 SE(2) perturbations (rotation+translation) per SVG; measures accuracy and consistency score (frequency of same answer across perturbations)
- SIT Data Generation: 72K instruction pairs: GPT-4o generates detailed semantic descriptions from rendered images; supports bidirectional use (original: programβdescription, reverse: descriptionβprogram)
- Fine-tuning: Supervised fine-tuning with Orthogonal Fine-Tuning (OFT) on Llama-3.1-8B; also tested with LoRA (slightly worse); follows Alpaca instruction tuning procedure
Datasets & Benchmarks
π SGP-Bench: SVG Dataset
Size: 1,085 SVG programs with 4,340 questions (4 per program)
Categories: 19 categories including accessory, animal, book, clothing, food, furniture, tools, etc.
Question Types:
- Semantic (1,085 questions): Global semantic meaning of object
- Color (864 questions): Color-related questions about specific object parts (tests localization)
- Shape (1,217 questions): Geometric shapes of object parts
- Count (819 questions): Counting occurrences of patterns or semantic parts
- Reasoning (355 questions): Higher-level reasoning about object properties
Difficulty Curve: Color (easiest, ~80% for best models) β Shape β Count β Semantic (hardest, ~37% for Llama3.1-405B)
π SGP-Bench: CAD Dataset
Size: 2,400 CAD programs with 2,400 questions (1 per program)
Subsets:
- 3D (1,000 programs): From DeepCAD dataset
- 3Dcomplex (700 programs): From Fusion360 Reconstruction Dataset
- 2D (700 programs): From SketchGraphs dataset
Unique Challenge: Each subset uses different domain-specific language (DSL) syntax; requires in-context learning of syntax rules
π― SVG-Invariance Benchmark
Purpose: Test semantic consistency under SE(2) transformations that preserve visual semantics but drastically alter code
Perturbations: 5 translations (T) + 5 rotation+translation (SE(2)) per SVG sample
Key Findings: Most LLMs achieve >80% consistency (half >90%), suggesting fundamental understanding rather than memorization; no correlation between tree edit distance and consistency performance
π― SGP-MNIST Dataset
Size: 1,000 symbolic graphics programs (100 per digit 0-9)
Challenge: MNIST-like handwritten digits as SVG programs; no semantic components, only convoluted path trajectories with enclosed loops for "thickness"
Critical Result: Even GPT-4o achieves only 13% accuracy (barely above 10% chance level); demonstrates fundamental gap between LLM program understanding and human visual recognition
Why Symbolic Graphics Programs Test Sophisticated Reasoning
π§ Required Reasoning Abilities
- "Visual Imagination": LLMs must mentally simulate how symbolic operations render visually without seeing pixels
- Long-range Sequential Reasoning: Operation order drastically affects semantics; requires tracking procedural generation steps
- Fine-grained Grounding: Locating semantic components in program structure demands precise understanding of code-to-visual mapping
- Multi-step Compositional Reasoning: E.g., "What is object primarily used for?" requires: identify object semantically β determine function β answer (errors in any step fail)
- Numeric + Spatial + Geometric Understanding: Requires perceiving coordinates, dimensions, transformations, and their visual implications
π‘ Example: CAD Reasoning (OpenAI-o1)
Task: "How many protruding cylindrical shafts are visible in the CAD object?"
O1's Step-by-Step Process:
- Step 1: Main shaft created via circle extrusion (radius 0.32, extrude 3.175 units)
- Step 2: Base ring added (annulus extruded downward, doesn't add shaft)
- Step 3: Cutting features (removes material, no new shafts)
- Step 4: Small ring added (doesn't count as shaft)
- Steps 5-6: Upper shaft created and extended
- Answer: 2 shafts (main shaft + upper shaft)
Demonstrates: Numeric perception, spatial reasoning, geometric understanding, long-range planning, and common sense
Symbolic Instruction Tuning (SIT) Results
β SGP-Bench Performance Gains
Llama-3.1-8B improvement: 46.5% β 51.4% (+4.9%) with 55K SIT pairs
Scaling with data size:
- 10K pairs: 48.0% (+1.5%)
- 25K pairs: 50.3% (+3.8%)
- 40K pairs: 51.2% (+4.7%)
- 55K pairs: 51.4% (+4.9%)
π General Reasoning Improvements (Llama-3.1-8B)
Notable gains with OI-mixed-SIT (Open-Instruct + original + reverse SIT):
- Instruction Following: IFEval-inst +5.6%, IFEval-prompt +3.5%
- Math Reasoning: GSM8k +3.3%, ASDiv +2.8%, Arithmetic +2.0%, MathQA +1.4%
- General Reasoning: AGIEval +7.9%, BigBenchHard +1.7%, PIQA +0.5%
- Language Understanding: C-Eval +1.7%, XNLI +1.1%, MMLU +1.2%
- Reading Comprehension: SQuAD2.0 +2.7%, CoQA +1.2%
Key Insight: Reverse SIT (descriptionβprogram generation) provides complementary reasoning abilities to original SIT; mixed approach achieves best overall performance
Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation
Key Insights
- First training-free framework using cognitively guided chain-of-thought to convert raster scientific diagrams into editable mxGraph XML (draw.io compatible)
- Addresses critical gap: rasterized diagrams lose symbolic structure (nodes, edges, hierarchy, semantic grouping, layout constraints) making them uneditable and non-reusable
- Two-stage cognitive pipeline: (1) Coarse-to-Fine Planning via perceptual structuring and semantic layout planning, (2) Structure-Aware Code Generation with multi-round XML refinement
- Introduces Plot2XML benchmark: 247 real scientific diagrams from actual papers with gold-standard mxGraph XML annotations and 5-dimensional complexity analysis
- Outperforms GPT-4o, Claude, Gemini, Grok, Qwen2.5-VL, Llama-3.2V by 10-20% semantic alignment and up to 40% visual fidelity on complex diagrams
- Achieves 89% XML validity rate; ablation shows removing hierarchical XML drops validity to 66%, confirming necessity of structured approach
Methods & Approach
- Stage I - Perceptual Structuring: MLLM extracts Gestalt grouping, hierarchical decomposition, visual encoding (colorsβroles), and connector topology
- Stage I - Semantic Layout Planning: Produces regions (Input/Output blocks), typed elements (Process, Decision, Entity), and layout constraints (align, connect, layer)
- Stage II - Initial XML Generation: MLLM generates structured mxGraph code with Y_doc (XML root), Y_style (style dictionary), Y_node (symbolic nodes), Y_layout (positions + alignment), Y_edge (connectors + routing)
- Multi-Round XML Refinement: Iterative correction of formatting errors, application of XML schema constraints, draw.io validator verification for syntactic correctness and rendering success
- Cognitive Inspiration: Pipeline mirrors human diagram comprehension: visual perception β semantic understanding β symbolic representation
- Evaluation Metrics: CLIP (semantic alignment), DINO (visual features), FID (distribution quality), Aesthetic Score, plus human evaluation
Benchmark
π― Plot2XML Benchmark
Size: 247 real scientific diagrams from actual research papers
Annotations: Gold-standard mxGraph XML ground truth for each diagram
Unique Features: First benchmark focused on real-world scientific diagrams (vs. synthetic or UI-focused datasets)
Complexity Analysis: 5-dimensional evaluation framework:
- Connection Complexity: Number and types of edges, routing patterns
- Graphical Complexity: Number of nodes, shapes, hierarchical levels
- Color Usage: Color encoding schemes for semantic meaning
- Text Density: Labels, annotations, captions
- Special Elements: Mathematical notation, domain-specific symbols
Gap Addressed: Previous datasets (SVG, TikZ, UI-to-code) lacked rich relational structure required for scientific diagrams
π― Evaluation Results
Performance Gains vs. Best MLLMs:
- Semantic Alignment (CLIP): +10-20% improvement
- Visual Fidelity (DINO): Up to +40% on complex diagrams
- XML Validity Rate: 89% (vs. 66% without hierarchical structure)
- Human Evaluation: Preferred over GPT-4o, Claude, Gemini, Grok, Qwen2.5-VL, Llama-3.2V
Ablation Study Critical Findings:
- Removing perceptual structuring: β semantic alignment & validity
- Removing layout planning: Major performance drop
- Removing hierarchical XML: Validity drops from 89% β 66%
Why mxGraph XML for Scientific Diagrams
π― Advantages of mxGraph Over Raster/SVG/TikZ
- Preserves Symbolic Structure: Nodes, edges, hierarchy, semantic grouping, and layout constraints are explicitly encoded
- Editable & Reusable: Compatible with draw.io, enabling direct editing, modification, and reuse in presentations/papers
- Semantically Meaningful: Unlike SVG paths, mxGraph represents diagrams as structured graphs with typed elements (Process, Decision, Entity)
- Layout Constraints: Explicit encoding of alignment, connection routing, and z-layering rules
- Renderability Guarantee: XML can be validated and rendered programmatically via draw.io engine
π‘ Problems with Existing Approaches
- SVG: Low-level shapes with poor semantic meaning, cannot express rich relational structures
- TikZ: Powerful but hard to parse, non-standardized, limited tool support
- Python Plot Code: Domain-specific, cannot express general diagram structures (flowcharts, architecture diagrams)
- Raw MLLMs: Struggle with structural accuracy, layout consistency, long XML generation, executable formatting
Cognitive Reasoning Pipeline (Algorithm 1)
π Stage I: Coarse-to-Fine Planning
Step 1 - Perceptual Structuring:
- Extract: Gestalt grouping, hierarchical primitives, visual encodings (color/shapeβmeaning), connector topology
- Output: T_percept = (Gestalt, Hierarchy, Encoding, Connectors)
Step 2 - Semantic Specification:
- Generate: R (Regions: Input/Output/Modules), E (Typed elements: Process/Entity/Block/Arrow), L (Layout constraints: align, connect, layer)
- Output: T_hierarchy = (R, E, L) β the semantic blueprint
π§ Stage II: Structure-Aware Code Generation
Step 3 - Initial XML Generation:
- Generate: Y_doc (XML root), Y_style (style dictionary), Y_node (symbolic nodes), Y_layout (positions + alignment), Y_edge (connectors + routing)
Step 4 - Multi-Round XML Refinement:
- Iteratively correct: Missing tags, mis-nested elements, schema violations
- Verify via draw.io validator: Syntactic correctness, rendering success, structural consistency
- Early stopping once valid XML is achieved
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Key Insights
- Reframes image vectorization as code generation rather than pixel reconstruction: treats SVG generation as autoregressive language modeling to generate semantic primitives (circles, rectangles, text) instead of decomposing everything into path curves
- First multimodal LLM for SVG generation handling both image-to-SVG and text-to-SVG with proper semantic understanding of primitives
- Achieves 3-6Γ more compact code than baselines while maintaining state-of-the-art quality (DinoScore 0.966 vs 0.939-0.992 for traditional methods)
- Generates editable, human-readable SVG with proper primitives: a circle becomes
<circle/>(1 token) not 50 BΓ©zier curves (500+ tokens) - Unique capability: generates structured diagrams with text primitives and layouts (flowcharts, UI mockups) - impossible for path-only vectorization tools
- Introduces SVG-Stack (2M samples) and SVG-Bench (10 datasets, 3 tasks); proposes DinoScore metric with strong human correlation (Ο=0.62-0.76 vs Ο=0.06 for MSE)
Methods & Approach
- Vision Encoder: StarVector-1B uses CLIP ViT-B/32 (224Γ224, 257 tokens); StarVector-8B uses SigLip (384Γ384, 576 tokens) for enhanced visual understanding
- Adapter Layer: Non-linear projection hv = LayerNorm(WL Β· Swish(Wh Β· zv)) bridges vision features to language model embedding space
- Language Model: StarCoder-1B (8K context) or StarCoder2-7B (16K context) for autoregressive SVG code generation
- Training Data: SVG-Stack (2.1M training samples) with aggressive augmentation (rotation, scale, color jitter, Perlin noise on BΓ©zier curves); 4M text-image-SVG triplets with BLIP2/LLaVA captions filtered by CLIP Score β₯30
- Training Configuration: StarVector-1B trained 7 days on 8Γ A100 (batch 128); StarVector-8B trained 10 days on 64Γ H100 (batch 512); both use AdamW with lr=1e-5
- Inference Strategy: Sample multiple candidates at temperatures [0.0, 0.25, 0.5, 0.75, 1.0] with length_penalty=1.2; rerank by DinoScore for best perceptual quality
Datasets & Benchmarks
π SVG-Stack Dataset
Size: 2.1M training + 108K validation + 5.7K test samples
Source: TheStack dataset with deduplication, preprocessing, and rasterization validation via CairoSVG
Statistics: Average length 1,822 Β± 1,808 tokens; diverse primitives (icons, emojis, fonts, diagrams)
Captions: 4M text-image-SVG triplets generated via BLIP2 + LLaVA with CLIP Score filtering (threshold=30)
π― SVG-Bench Benchmark
Coverage: 10 datasets across 3 tasks
Tasks:
- Image-to-SVG: SVG-Stack, SVG-Fonts, SVG-Icons, SVG-Emoji (general vectorization quality)
- Text-to-SVG: SVG-Stack-Text, SVG-Fonts-Text, SVG-Icons-Text, SVG-Emoji-Text (multimodal generation)
- Diagram Generation: Diagram-FlowChart, Diagram-GraphPlot (structured graphics with text primitives)
Evaluation Metrics:
- DinoScore: L2 distance in DinoV2 feature space (strong human correlation Ο=0.62-0.76)
- Token Length: Code compactness indicator (semantic understanding proxy)
- Traditional Metrics: LPIPS, SSIM, MSE (shown to correlate poorly with human judgment)
- Text-to-SVG: FID, FID-CLIP, CLIP Score for distribution similarity and text-image alignment
Key Results:
- StarVector-8B: DinoScore 0.966, 5.3K tokens (vs ground truth 2.8K) in 74s
- LIVE baseline: DinoScore 0.939, 18.5K tokens in 1,412s (19Γ slower, 3.5Γ more bloated)
- AutoTrace/VTracer: DinoScore 0.988/0.975, 6.4K/8.1K tokens, path-only representation
- Diagram generation: 80% human preference vs 12% for LIVE (only StarVector generates valid text primitives)
- Text-to-SVG: FID 25.8, FID-CLIP 4.6, CLIP Score 0.781 (outperforms GPT-4V and CodeLlama-34B)
Why Code Generation > Pixel Reconstruction for Vectorization
π― Paradigm Shift: Vectorization as Code Generation
Traditional Approach (Inverse Rendering): Edge detection β contour tracing β curve fitting β path primitives
StarVector Approach (Code Generation): Visual understanding β semantic interpretation β SVG code synthesis
Key Advantage: Language models learn "these pixel patterns correspond to circle primitives" rather than applying fixed heuristics
π‘ Example: Circle Representation
Traditional Vectorization (AutoTrace/VTracer):
<!-- 50+ BΓ©zier curve segments -->
<path d="M 100,50 C 100,22.4 77.6,0 50,0 C 22.4,0 0,22.4 0,50
C 0,77.6 22.4,100 50,100 C 77.6,100 100,77.6 100,50 Z"/>
<!-- ...49 more path segments... -->
<!-- Result: ~500 tokens, uneditable -->
StarVector Output:
<circle cx="50" cy="50" r="50"/> <!-- Result: ~1 token, editable, 50Γ more compact -->
Impact: Semantic understanding enables compact, human-readable, editable SVG code
Human Evaluation Results
β Image-to-SVG Quality (1,948 assessments, 30 participants)
Question: "Which SVG better represents the input image?"
StarVector-8B Win Rate:
- vs. LIVE: 68% wins
- vs. AutoTrace: 74% wins
- vs. VTracer: 71% wins
Diagram Generation Task:
- vs. LIVE: 89% wins (diagrams with readable text)
- vs. AutoTrace: 92% wins
Critical Finding: Only StarVector generates valid diagrams with text and structured elements - baseline methods produce unreadable path approximations
CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
Key Insights
- First benchmark integrating visual circuit understanding and symbolic engineering reasoning, revealing major limitations in current multimodal LLMs
- Addresses critical gap: existing models excel at visual component recognition but fail at symbolic reasoning (applying Ohm's law, KCL/KVL, functional inference)
- Hierarchical evaluation spanning visual primitives β connectivity topology β functional modules β system-level behavior
- Comprehensive task taxonomy covering perception (component detection, layout recognition), symbolic computation (circuit equations, parameter inference), and system-level analysis (fault detection, design modification)
- Benchmark reveals MLLMs perform well on visual captioning but show large performance gaps vs. humans on hierarchical topology extraction and multi-step analytical problem solving
Benchmark Structure
- Visual Tasks: Component detection, connection reading, circuit graph reconstruction
- Symbolic & Analytical Tasks: Solving circuit equations, parameter inference from diagrams, module classification
- System-Level Reasoning: Circuit diagnosis, design modification impact analysis, step-by-step engineering reasoning
- Hierarchical Levels: Primitive symbols β Subcircuits β Functional blocks β Complete system diagrams
- Models Evaluated: GPT-4o, Claude 3.5/3.7, Qwen series, Gemini, LLaMA-based VL models
Benchmark
π― CircuitSense Benchmark
Type: Evaluation-only benchmark (no training dataset provided)
Content: Thousands of real and synthetic circuit diagrams with ground truth annotations
Annotations Include:
- Node/edge connectivity graphs
- Component attributes and parameters
- Symbolic circuit equations
- Engineering reasoning steps
Key Finding: All current MLLMs show significant failures in hierarchical topology extraction, symbolic circuit reasoning, and multi-step analytical problem solving despite good performance on basic visual tasks
π― Benchmark Contribution
Novel Aspects:
- First to unify visual perception and symbolic engineering reasoning in a single evaluation framework
- Comprehensive task taxonomy spanning low-level vision β mid-level graph reasoning β high-level analytical tasks
- Diagnostic analysis showing where and why MLLMs fail on engineering-domain reasoning
- Pathway for developing engineering-capable AI systems beyond basic diagram captioning
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science
Key Insights
- Decomposes expert-level physical science reasoning into: Physical Perception Model (PPM) for diagram understanding + Physics Simulator for symbolic reasoning
- Uses SPICE (Nagel, 1975) as simulation language to bridge visual diagrams and numerical reasoning - enables direct access to fundamental circuit structure
- Achieves 4.3Γ improvement over GPT-4V on college-level circuits: 7.6% β 32.9% accuracy on SimpleCircuitEval benchmark
- Chain-of-Simulation process: PPM generates SPICE netlist β NgSPICE simulator executes β numerical results guide MLLM final reasoning
- Ablation shows high simulator dependency: accuracy drops from 55% to 15% without simulation results, even with Python-based reasoning attempts
- Simulation language description alone reduces hallucination by 10% on non-simulatable problems (functions like scene graphs)
π§ SPICE Simulator: The Physical Reasoning Engine
What is SPICE?
SPICE (Simulation Program with Integrated Circuit Emphasis, Nagel 1975) is a general-purpose circuit simulation program that:
- Analyzes electronic circuits numerically by solving differential equations
- Industry standard for circuit simulation (used in IC design, education, research)
- Provides DC, AC, transient, and sensitivity analysis
- In MAPS: Uses NgSPICE as the execution engine
π The Simulation Pipeline
1. INPUT: Circuit Diagram Image
[Image with resistors, capacitors, voltage sources]
β
2. PPM (Physical Perception Model)
CogVLM-17B fine-tuned on 20K synthetic pairs
Outputs: SPICE Netlist (Simulation Language)
β
3. SPICE NETLIST EXAMPLE:
* Voltage Divider Circuit
V1 n1 0 DC 10V
R1 n1 n2 5k
R2 n2 0 10k
.op
.end
β
4. NgSPICE SIMULATOR EXECUTION
Solves Kirchhoff's laws, component equations
β
5. SIMULATION RESULTS (Observations):
V(n1) = 10.000V
V(n2) = 6.667V β Key intermediate value
I(V1) = -1.333mA
β
6. MLLM FINAL REASONING
Uses simulation results as context
"Based on simulation: V(n2) = 6.667V
Voltage across R2 = 6.667V - 0V = 6.667V"
β
7. OUTPUT: Final Answer with Explanation
β‘ How Simulation Results Guide Reasoning
NOT Traditional RL: MAPS doesn't use SPICE results as RL rewards during inference. Instead, simulation results are intermediate reasoning steps:
- Single-shot inference: Simulator runs once per question (not iterative policy learning)
- Observations as context: Voltages/currents fed to MLLM alongside question and diagram
- Grounded reasoning: LLM explains answer using numerical facts from physics engine
- Validation: Results checked for validity before proceeding to final answer
Training Signals (Reward-like): During PPM training, three accuracy metrics act as supervision:
- Component Quantity Accuracy (ACCCQ): Correct component count β binary reward
- Component Value Accuracy (ACCCV): Correct resistances/capacitances β per-component reward
- Simulation Accuracy (ACCsim): β Key metric - do generated and reference SPICE produce same results?
π SPICE as "Tool Use" Framework
MAPS is better understood as multimodal tool use than RL:
| Aspect | Traditional RL | MAPS Approach |
|---|---|---|
| Interaction | Iterative policy improvement | Single-shot tool call |
| Feedback | Reward signal for action quality | Observations as reasoning context |
| Goal | Optimize cumulative reward | Answer question accurately |
| Exploration | Explore action space | No exploration (deterministic) |
π LaTeX/CircuitTikz: Synthetic Data Generation Engine
Why LaTeX/CircuitTikz for Training Data?
MAPS uses CircuitTikz (LaTeX package) to generate 20,000 synthetic training pairs for the PPM. This choice is critical:
π Key Advantage: One Source β Two Perfect Outputs
SINGLE CircuitTikz CODE:
\begin{circuitikz}
\draw (0,0) to[V, v=$10V$] (0,3)
to[R=$5k\Omega$] (3,3)
to[R=$10k\Omega$] (3,0)
to (0,0);
\end{circuitikz}
β β
OUTPUT 1: OUTPUT 2:
[PNG Image] SPICE Netlist
Circuit diagram V1 n1 0 DC 10
with visual layout R1 n1 n2 5k
R2 n2 0 10k
.op .end
β β
TRAINING PAIR
(Image, SPICE) - Perfect Alignment!
π§ Synthetic Data Generation Pipeline
- Sample Circuit Parameters from hierarchical distribution:
- Component counts (2-10 components)
- Topology types (series, parallel, series-parallel, bridge, ladder)
- Component values (resistances: 1k-10k, capacitances: 1pF-100ΞΌF)
- Voltage sources (3V, 5V, 9V, 12V, etc.)
- Generate CircuitTikz LaTeX Code programmatically
- Compile to PDF/PNG using pdflatex + ImageMagick
- Parse CircuitTikz β SPICE Netlist (structured syntax makes this straightforward)
- Verify with NgSPICE - ensure valid circuit (reject invalid samples)
- Add to Dataset: ppm-syn-lprc (20K pairs)
Result: 20,000 perfectly aligned (diagram image, SPICE netlist) pairs with zero human annotation!
β Why LaTeX > Other Approaches
| Method | Problem |
|---|---|
| Hand-draw circuits | β Can't auto-extract SPICE; manual annotation needed |
| CAD tools (EAGLE, KiCAD) | β Complex file formats; overkill for simple circuits |
| Direct image synthesis | β Hard to ensure image β SPICE correspondence |
| Screenshot textbooks | β Copyright; limited diversity; no labels |
| Random pixel generation | β Unrealistic; domain shift from real problems |
| β CircuitTikz/LaTeX | β Programmatic, scalable, perfect alignment, realistic appearance |
π Hierarchical Complexity Control
LaTeX code enables curriculum learning by controlling circuit complexity:
- Simple (2-3 components): Series/parallel resistor networks
- Medium (4-6 components): Series-parallel combinations, voltage dividers
- Hard (7-10 components): Bridge circuits, ladder networks, mesh analysis
Key Benefit: Training data distribution matches real college-level problem difficulty, minimizing domain gap!
Architecture & Training
- Physical Perception Model (PPM): CogVLM-17B base, fine-tuned with LoRA (rank 50, lr 1e-5, batch 32)
- Training Objective: Minimize negative MLE loss for SPICE generation from diagrams
- Synthetic Dataset: ppm-syn-lprc (20,000 paired circuit diagrams + SPICE netlists)
- Data Source: Hierarchical random sampling β CircuitTikz/LaTeX β PNG + SPICE pairs
- Evaluation Metrics: Component Quantity/Value Accuracy, Simulation Accuracy (cosine similarity of sim results)
- Inference: PPM generates SPICE β MLLM refines with text β NgSPICE executes β Results guide final reasoning
Results
Performance on SimpleCircuitEval (79 college-level LPRC problems):
| Model | Direct Prompting | + MAPS | Improvement |
|---|---|---|---|
| GPT-4V | 7.6% | 32.9% | +333% (4.3Γ) |
| Claude-3.5 | 12.7% | 38.0% | +199% (3.0Γ) |
| GLM-4V | 5.1% | 29.1% | +471% (5.7Γ) |
π Ablation Study Key Finding:
- With simulation results: 55% accuracy
- Without simulation (Python reasoning): 15% accuracy
- Conclusion: Physics simulator is critical - symbolic reasoning alone insufficient
π‘ Why Such Huge Gains?
- LLMs are poor at circuit math - simulators provide accurate numerical values
- SPICE netlist reduces hallucination by constraining to valid circuit states
- Structured representation (netlist) eliminates ambiguity in diagram interpretation
- Simulation results serve as intermediate reasoning steps, not just final answers
Benchmark
π― SimpleCircuitEval Benchmark
Type: Evaluation benchmark for circuit analysis reasoning
Content: 79 college-level Linear Pure Resistive Circuit (LPRC) problems from 4 textbook chapters
Task Categories: Voltage/current calculation, power analysis, circuit reduction, Kirchhoff's laws
π― Training Dataset: ppm-syn-lprc
Size: 20,000 synthetic circuit diagram + SPICE netlist pairs
Generation: Automated via CircuitTikz/LaTeX β PNG + SPICE parsing
Validation: All circuits verified with NgSPICE before inclusion
Diversity: Hierarchical sampling across complexity levels and topology types
π Current Evaluation Benchmarks for Diagrams & Circuits
Comprehensive benchmarks for evaluating multimodal models on diagram understanding, circuit analysis, and technical graphics
AMSBENCH
Focus: Analog and Mixed-Signal (AMS) Circuit Understanding
Tasks: Circuit recognition, component identification, topology analysis, function prediction
Scope: Comprehensive evaluation of MLLM capabilities on AMS circuit diagrams
EEE-Bench
Focus: Electrical & Electronics Engineering Multimodal Understanding
Tasks: Circuit analysis, component recognition, schematic interpretation, engineering problem solving
Scope: Comprehensive benchmark spanning multiple EEE domains and diagram types
StarVector (SVG-Bench)
Focus: SVG Generation & Vectorization Quality
Tasks: Image-to-SVG, text-to-SVG, diagram generation (flowcharts, graphs)
Scope: 10 datasets across 3 tasks; DinoScore metric for perceptual quality
SGP-Bench
Focus: Symbolic Graphics Program Understanding (SVG & CAD)
Tasks: Semantic understanding, visual imagination, SE(2) consistency, reasoning
Scope: 1,085 SVG + 2,400 CAD programs; tests LLM "visual imagination" from code
ElectroVizQA
Focus: Electronics Diagram Visual Question Answering
Tasks: Component identification, circuit analysis, diagram comprehension, QA tasks
Scope: Specialized benchmark for electronics diagram understanding and reasoning
MMCircuitEval
Focus: Multimodal Circuit Evaluation & Analysis
Tasks: Circuit understanding, component recognition, topology analysis, multimodal reasoning
Scope: Comprehensive evaluation framework for circuit diagram comprehension
CircuitSense
Focus: Hierarchical Circuit System Understanding with Visual-Symbolic Integration
Tasks: Visual comprehension (component detection, topology), symbolic reasoning (circuit laws, parameter inference), system-level analysis (diagnosis, design modification)
Scope: First benchmark bridging visual perception and symbolic engineering reasoning across hierarchical circuit levels (primitives β subcircuits β functional blocks β systems). Evaluation-only (no training dataset).
π― Benchmark Coverage Summary
Circuit Diagrams
AMSBENCH, EEE-Bench, ElectroVizQA, MMCircuitEval, CircuitSense
SVG Generation
StarVector (SVG-Bench)
Program Understanding
SGP-Bench (SVG & CAD)
π οΈ Current Commercial Tools
Commercial non-VLM software demonstrating the practical value and market demand for vision-to-structured-output tasks
Mathpix
Function: Mathematical Equation Recognition and LaTeX Conversion
Capabilities: Converts images of equations (handwritten or printed) to LaTeX code, supports complex mathematical notation, integrates with various document editors
Use Case: Enables researchers, students, and academics to quickly digitize mathematical content from papers, textbooks, and handwritten notes into editable LaTeX format
Codia AI
Function: Image to HTML/CSS Code Converter
Capabilities: Converts design mockups, screenshots, and UI images into production-ready HTML and CSS code. Two-step process: Image β Figma design β Clean, semantic HTML/CSS. Supports responsive layouts, SEO best practices, and accessibility features.
Use Case: Bridges the gap between design and implementation for web developers, enabling rapid prototyping and pixel-perfect conversion of visual designs to functional web code
LilyPond
Function: Text-Based Music Notation and Sheet Music Generation
Capabilities: Compiles text-based music notation code into professional-quality sheet music (PDF, PNG, SVG). Similar to LaTeX for documents, LilyPond uses plain text syntax to define musical elements (notes, rhythms, dynamics, articulations). The inverse task - Optical Music Recognition (OMR) converting sheet music images to LilyPond code - represents another vision-to-structured-code challenge.
Use Case: Enables composers, music educators, and publishers to create beautifully engraved sheet music from text-based notation. The potential for automated conversion of printed/handwritten music scores to editable LilyPond format demonstrates the value of vision-to-code tasks in the music domain.
π‘ Key Insight
The existence of these commercial tools and formats demonstrates the practical value and market demand for vision-to-structured-output tasks. These non-VLM solutions validate that converting visual content (equations, UI designs, music scores) into structured formats (LaTeX, HTML/CSS, LilyPond) has real-world applications beyond academic research, supporting the commercial viability of diagram understanding and code generation tasks.