Engineering/Electronic Diagrams & CAD Survey

A comprehensive literature review of diagram understanding and CAD systems in machine learning

0

Total Papers

0

Datasets

0

Benchmarks

Quick Comparison Table

Paper Year Domain Method Dataset Benchmark Key Contribution
See it. Say it. Sorted 2025 Flowcharts / Sketch-to-Diagram Agentic VLM+LLM (Critic-Judge-Generate) ❌ No ❌ No Training-free SVG generation with iterative refinement
DeTikZify 2024 Scientific Figures / TikZ Graphics Multimodal LLM (LLaVA/Idefics3) + MCTS Inference βœ… Yes (DaTikZv2: 360K+, SketchFig, MetaFig) βœ… Yes (TikZ compilation, visual fidelity) First sketch/figure-to-TikZ synthesis with MCTS refinement
AutomaTikZ 2024 Scientific Figures / TikZ Graphics LLaMA + CLiMA (CLIP-augmented) with LoRA fine-tuning βœ… Yes (DaTikZ: 120K from arXiv, TeX SE, curated) βœ… Yes (CLIPScore, BWS human evaluation, code metrics) First large-scale text-to-TikZ with multimodal CLIP integration
SGP-Bench 2025 SVG & CAD Programs / LLM Reasoning LLM evaluation benchmark with Symbolic Instruction Tuning (SIT) βœ… Yes (1,085 SVG + 2,400 CAD programs; 72K SIT pairs) βœ… Yes (Semantic understanding, SE(2) consistency, SGP-MNIST) First benchmark for LLM "visual imagination" from symbolic programs; SIT improves general reasoning
Draw with Thought 2025 Scientific Diagrams / Image-to-mxGraph XML Training-free MLLM with cognitive CoT (Perceptualβ†’Semanticβ†’XML) ❌ No βœ… Yes (Plot2XML: 247 diagrams; CLIP, DINO, FID, human eval) First training-free cognitively guided framework for scientific diagrams; 89% XML validity; +10-20% semantic alignment vs GPT-4o/Claude/Gemini
StarVector 2024 SVG Vectorization / Image-to-SVG & Text-to-SVG Multimodal LLM (CLIP/SigLip + StarCoder) with code generation βœ… Yes (SVG-Stack: 2.1M samples; 4M text-image-SVG triplets) βœ… Yes (SVG-Bench: 10 datasets, 3 tasks; DinoScore metric) First to reframe vectorization as code generation; generates semantic primitives (3-6Γ— more compact); diagram generation with text
CircuitSense 2025 Circuit Diagrams / Visual-Symbolic Engineering Reasoning Evaluation benchmark (tests GPT-4o, Claude, Gemini, Qwen, LLaMA-VL) ❌ No (evaluation-only) βœ… Yes (Hierarchical visual-symbolic tasks; thousands of annotated circuits) First benchmark unifying visual comprehension and symbolic circuit reasoning; reveals MLLMs fail at topology extraction and engineering analysis
MAPS 2025 Circuit Analysis / SPICE Simulation + Multi-Modal Reasoning Physical Perception Model (PPM: CogVLM-17B) + NgSPICE Simulator + Chain-of-Simulation βœ… Yes (ppm-syn-lprc: 20K synthetic pairs via CircuitTikz/LaTeX) βœ… Yes (SimpleCircuitEval: 79 college-level LPRC problems) First to bridge diagram perception and physics simulation via SPICE; 4.3Γ— improvement over GPT-4V (7.6%β†’32.9%); uses simulation results as intermediate reasoning steps

See it. Say it. Sorted: Agentic System for Compositional Diagram Generation

πŸ“… Year: 2025 ✍️ Authors: Hantao Zhang, Jingyang Liu, Ed Li πŸ›οΈ Venue: arXiv 2508.15222
Flowcharts Sketch-to-Diagram Agentic System SVG Generation

Key Insights

  • Training-free agentic system combining VLMs and LLMs for sketch-to-diagram conversion, addressing diffusion models' limitations in spatial precision and symbolic structure
  • Three-component iterative loop: Critic VLM identifies 1-3 key discrepancies, multiple LLMs generate candidate solutions with diverse strategies, Judge VLM selects best result
  • Prioritizes qualitative reasoning over numerical estimates, preserving global constraints (alignment, connectivity) and enabling human-in-the-loop corrections
  • Generates editable SVG programs rather than raster images, enabling extensibility to presentation tools via APIs
  • Open-sourced implementation demonstrates superior performance vs GPT-5 and Gemini-2.5-Pro on flowchart reconstruction

Methods & Approach

  • Critic VLM (Gemini-2.5-Pro): Compares target sketch with current image, provides targeted qualitative modifications focusing on spatial relationships
  • Multi-candidate LLM Generation (Gemini-2.5-Flash): Generates multiple SVG solutions using conservative, moderate, aggressive, alternative, and focused strategies for exploration-exploitation balance
  • Judge VLM (Gemini-2.5-Pro): Selects best candidate or reverts if no improvement; ensures stable convergence (typically 3 iteration steps)
  • SVG Grammar: Supports basic primitives (circles, rectangles, ellipses, triangles) with parameters for position, scale, colors, stroke properties, rotation
  • Constraint Preservation: LLMs maintain alignment and connectivity through instruction-based guidance

Dataset/Benchmark

No dedicated dataset proposed. Evaluation conducted on 10 flowchart sketches derived from published papers.

Evaluation approach: Qualitative comparison against GPT-5 and Gemini-2.5-Pro baseline systems, focusing on layout reconstruction, structural accuracy, and avoiding unwanted text insertion.

DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

πŸ“… Year: 2024 ✍️ Authors: Jonas Belouadi, Simone Paolo Ponzetto, Steffen Eger πŸ›οΈ Venue: NeurIPS 2024 (Spotlight)
Has Dataset Has Benchmark TikZ Synthesis Scientific Figures Multimodal LLM MCTS

Key Insights

  • First multimodal language model to automatically convert sketches and scientific figures into semantics-preserving TikZ graphics programs, eliminating manual figure recreation
  • Introduces MCTS-based inference algorithm enabling iterative refinement without additional training, allowing extended generation (e.g., 10 min) to produce and score multiple candidates
  • Creates three novel datasets: DaTikZv2 (360K+ TikZ graphics, largest to date), SketchFig (sketch-figure pairs), and MetaFig (diverse scientific figures with metadata)
  • Outperforms commercial systems (GPT-4V, Claude 3) in both automatic and human evaluation of TikZ synthesis quality
  • Generates vector graphics code rather than raster images, enabling editability, scalability, and semantic preservation
  • DeTikZify v2.5-8B incorporates reinforcement learning from self-feedback (RLSF); TikZero enables zero-shot text-conditioned generation

Methods & Approach

  • Model Architecture (v1): Built on LLaVA and AutomaTikZ foundations; accepts visual inputs (sketches/figures) and generates TikZ code
  • Model Architecture (v2): Uses Idefics 3 (8B Llama3-based) as backbone for improved multimodal understanding
  • Training Data: MetaFig and DaTikZv2 datasets plus synthetically generated sketches learned from SketchFig hand-drawn examples
  • MCTS Inference: Adapted from VerMCTS methodology; generates multiple TikZ candidates, validates compilation, scores visual fidelity, selects best variant
  • TikZero Adapters: Parameter-efficient fine-tuning (10B params) enabling zero-shot text-conditioning for guided synthesis
  • Validation Pipeline: Automatic TikZ compilation checking and rasterization for output verification

Datasets & Benchmarks

πŸ“Š DaTikZv2 Dataset

Size: 360,000+ human-created TikZ graphics programs (largest TikZ dataset to date)

Source: Extracted from academic papers; public release excludes arXiv content due to licensing, but dataset creation scripts provided for reproduction

Versions: DaTikZv2 (NeurIPS 2024), DaTikZv3 (updated)

πŸ“Š SketchFig Dataset

Content: Hand-drawn sketches paired with corresponding scientific figures (Instruct-Pix2Pix)

Purpose: Training sketch-to-TikZ synthesis; synthetic sketch augmentation

πŸ“Š MetaFig Dataset

Content: Diverse scientific figures with associated metadata

Purpose: Training multimodal understanding of scientific visualization conventions

🎯 Evaluation Benchmarks

Automatic Metrics: TikZ compilation success rate, visual fidelity scoring

Human Evaluation: Quality assessment comparing against GPT-4V and Claude 3 baselines

Results: DeTikZify outperforms both commercial baselines; MCTS significantly boosts performance

AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ

πŸ“… Year: 2024 ✍️ Authors: Jonas Belouadi, Anne Lauscher, Steffen Eger πŸ›οΈ Venue: ICLR 2024
Has Dataset Has Benchmark TikZ Graphics Scientific Figures LLM Fine-tuning Multimodal Vector Graphics

Key Insights

  • First large-scale approach to generate scientific vector graphics via TikZ graphics language, producing editable, scalable figures from text descriptions
  • TikZ provides human-oriented high-level commands that facilitate language modeling compared to low-level SVG paths, enabling complex scientific figures with minimal code
  • Fine-tuned LLaMA models outperform commercial GPT-4 and Claude 2 in both automatic and human evaluation for generating figures similar to human-created ones
  • Introduces CLiMA architecture integrating CLIP's multimodal projection layer with LLaMA using soft prompting, improving text-image alignment and enabling image-conditioned generation
  • Demonstrates that GPT-4 and Claude 2 generate simpler, less complex figures and are susceptible to "typographic attacks" (copying input captions into output images to inflate similarity scores)
  • Models exhibit minimal memorization issues with >80% novelty for n-grams (n>8), generating truly novel outputs rather than copying training data

Methods & Approach

  • Base Model: LLaMA (7B and 13B parameters) chosen for clearly specified training data to avoid test set leakage; fine-tuned with LoRA for efficiency
  • CLiMA Architecture: Augments LLaMA with CLIP ViT-H/14 using multimodal projection layer; connects CLIP output to LLaMA input via soft prompting with feed-forward adapter layer
  • Training Strategy: LoRA applied to all linear layers; 12 epochs with AdamW, batch size 128, learning rate 5e-4; random 50% image/caption swapping for CLiMA data augmentation
  • Iterative Resampling: Novel error correction method that reverses generation to before error line and continues sampling, with exponential backtracking (4^(i-1) lines)
  • Evaluation Metrics: CLIPScore (caption-image), CLIPScoreimg (image-image), KID (distribution quality), CrystalBLEU (code similarity), EED (string distance), CSR (compilation rate)
  • Human Evaluation: Best-worst scaling (BWS) with expert annotators for caption similarity and reference similarity; 4-tuples comparison methodology

Datasets & Benchmarks

πŸ“Š DaTikZ Dataset

Size: 120,000 TikZ drawings with captions (first large-scale TikZ dataset)

Sources:

  • ArXiv Papers: 85,656 examples (67.75% augmented) - extracted from scientific papers with TikZ source
  • TeX Stack Exchange: 29,238 examples (51.31% augmented) - Q&A converted to captions via WizardLM
  • Artificial Examples: 3,914 examples (50% augmented) - GPT-4 generated via knowledge distillation
  • Curated Examples: 981 examples (63.2% augmented) - high-quality from community websites

Caption Augmentation: 62.71% of captions with <30 tokens augmented using LLaVAR (5 candidates ranked by CLIPScore); improves CLIPScore from 24.76β†’29.12

Public Release: Excludes arXiv content due to licensing, but provides dataset creation scripts for reproduction

🎯 Evaluation Benchmarks

Test Set: 1,000 human-created examples sampled after December 2022 to avoid data leakage with LLaMA training

Automatic Metrics: CLIPScore, CLIPScoreimg, KID, CrystalBLEU, EED, Compilation Sampling Rate

Human Evaluation: Best-worst scaling with expert annotators on 100-item subset; Caption Similarity (CS) and Reference Similarity (RS)

Key Results:

  • CLiMA13b achieves best CLIPScore (26.97) among fine-tuned models, outperforms LLaMA13b on 5/7 metrics
  • GPT-4/Claude 2 show inflated CLIPScore (29.12/27.75) due to caption copying but poor CLIPScoreimg (78.63/75.59 vs 81.02 for CLiMA13b)
  • Human evaluation: CLiMA13b has mode >0 for caption similarity; GPT-4 shows uniform distribution for reference similarity
  • Code complexity: Human (916 tokens avg) > CLiMA/LLaMA (420 tokens) > GPT-4/Claude 2 (180 tokens)

Why TikZ > SVG: Semantic & Symbolic Understanding

🎯 Core Advantages of TikZ Over SVG

  1. High-level semantic commands vs. low-level paths: TikZ uses human-oriented commands like \draw (A) -- (B) expressing intent and relationships, while SVG uses primitive coordinates like <path d="M 50,100 L 150,100"/> with no semantic meaning
  2. Explicitly encodes relationships: Spatial relationships (right of=input1), hierarchical structures (node grouping), and connectivity patterns (all-to-all connections via loops) are captured in the code itself
  3. Captures symbolic meaning through abstraction: Concepts like "layers," "groups," and "patterns" are expressed through constructs like \foreach loops that reflect underlying architecture (e.g., neural network layers)
  4. Maintains geometric precision with semantic intent: Commands like \node at (0,0) {Center} encode both position AND the semantic relationship (label belongs to circle's center), not just coordinates
  5. Expresses complex structures concisely: Human TikZ code averages 916 tokens vs. GPT-4's simpler 187 tokens, but TikZ captures far more semantic complexity with commands that align with natural language descriptions

πŸ’‘ Example: Multi-Layer Perceptron

TikZ captures semantic structure:

% Define neuron layers with semantic meaning
\foreach \i in {1,...,4}
    \node[neuron] (I-\i) at (0,-\i) {Input \#\i};
\foreach \i in {1,...,5}
    \node[neuron] (H-\i) at (2,-\i-0.5) {};
% All-to-all connectivity pattern
\foreach \i in {1,...,4}
    \foreach \j in {1,...,5}
        \draw[->] (I-\i) -- (H-\j);

What this encodes: Layers as conceptual groups, all-to-all connectivity pattern between layers, iteration structure reflecting neural network architecture

SVG equivalent: Would require manually specifying every single line coordinate with no concept of "layers" or "connections" - just raw geometric primitives

🧠 Why This Matters for Language Models

  • Natural language alignment: TikZ commands like \draw, \node, \foreach align with how we describe figures in captions
  • Compositional understanding: Models can learn that "multi-layer perceptron" β†’ layers of nodes β†’ all-to-all connections
  • Fewer tokens, more meaning: Paper shows SVG methods "fail to maintain accurate geometric relations" (p.3) while TikZ creates "complex figures with only a few commands" that preserve semantic relationships

Can Large Language Models Understand Symbolic Graphics Programs?

πŸ“… Year: 2025 ✍️ Authors: Zeju Qiu, Weiyang Liu, Haiwen Feng, et al. πŸ›οΈ Venue: ICLR 2025
Has Benchmark Has Dataset SVG Programs CAD Programs LLM Reasoning Visual Imagination Semantic Understanding

Key Insights

  • First benchmark (SGP-Bench) to evaluate LLMs' ability to semantically understand symbolic graphics programs (SVG and CAD) without vision encoders, testing "visual imagination" from code alone
  • Tests semantic understanding via multiple-choice questions on rendered images using only symbolic program input; introduces consistency tests via SE(2) transformations (rotation/translation) that drastically change code but preserve semantics
  • Comprehensive evaluation: 1,085 SVG programs (4,340 questions across 5 types: semantic, color, shape, count, reasoning) + 2,400 CAD programs (3D, 3Dcomplex, 2D subsets from DeepCAD, Fusion360, SketchGraphs)
  • Introduces Symbolic Instruction Tuning (SIT): 72K instruction pairs generated by querying GPT-4o on rendered images; improves Llama-3.1-8B from 46.5% to 51.4% on benchmark AND boosts general reasoning across 15+ benchmarks (GSM8k +3.3%, AGIEval +7.9%)
  • Performance strongly correlates with reasoning ability: Claude 3.5 Sonnet best (67.4% SVG, 74.2% CAD); stronger reasoners (o1, GPT-4) outperform weaker models; shows clear scaling law effects
  • Critical finding: Even GPT-4o achieves only 13% (chance-level) on SGP-MNIST dataset where handwritten digits in SVG form are easy for humans but extremely challenging for LLMs without semantic components

Methods & Approach

  • Benchmark Creation Pipeline: Render symbolic programs β†’ Query GPT-4o for semantic questions on images β†’ Manual inspection β†’ Validation via human study (500 samples, high agreement)
  • SVG Understanding: 1,085 programs across 19 categories (accessory, animal, food, etc.); 4 questions per program testing semantic, color, shape, count, reasoning abilities
  • CAD Understanding: 2,400 programs from 3 datasets with different syntax complexities; domain-specific language (DSL) syntax provided in-context; 1 semantic question per program
  • Consistency Testing: 5 random translations (T) + 5 SE(2) perturbations (rotation+translation) per SVG; measures accuracy and consistency score (frequency of same answer across perturbations)
  • SIT Data Generation: 72K instruction pairs: GPT-4o generates detailed semantic descriptions from rendered images; supports bidirectional use (original: programβ†’description, reverse: descriptionβ†’program)
  • Fine-tuning: Supervised fine-tuning with Orthogonal Fine-Tuning (OFT) on Llama-3.1-8B; also tested with LoRA (slightly worse); follows Alpaca instruction tuning procedure

Datasets & Benchmarks

πŸ“Š SGP-Bench: SVG Dataset

Size: 1,085 SVG programs with 4,340 questions (4 per program)

Categories: 19 categories including accessory, animal, book, clothing, food, furniture, tools, etc.

Question Types:

  • Semantic (1,085 questions): Global semantic meaning of object
  • Color (864 questions): Color-related questions about specific object parts (tests localization)
  • Shape (1,217 questions): Geometric shapes of object parts
  • Count (819 questions): Counting occurrences of patterns or semantic parts
  • Reasoning (355 questions): Higher-level reasoning about object properties

Difficulty Curve: Color (easiest, ~80% for best models) β†’ Shape β†’ Count β†’ Semantic (hardest, ~37% for Llama3.1-405B)

πŸ“Š SGP-Bench: CAD Dataset

Size: 2,400 CAD programs with 2,400 questions (1 per program)

Subsets:

  • 3D (1,000 programs): From DeepCAD dataset
  • 3Dcomplex (700 programs): From Fusion360 Reconstruction Dataset
  • 2D (700 programs): From SketchGraphs dataset

Unique Challenge: Each subset uses different domain-specific language (DSL) syntax; requires in-context learning of syntax rules

🎯 SVG-Invariance Benchmark

Purpose: Test semantic consistency under SE(2) transformations that preserve visual semantics but drastically alter code

Perturbations: 5 translations (T) + 5 rotation+translation (SE(2)) per SVG sample

Key Findings: Most LLMs achieve >80% consistency (half >90%), suggesting fundamental understanding rather than memorization; no correlation between tree edit distance and consistency performance

🎯 SGP-MNIST Dataset

Size: 1,000 symbolic graphics programs (100 per digit 0-9)

Challenge: MNIST-like handwritten digits as SVG programs; no semantic components, only convoluted path trajectories with enclosed loops for "thickness"

Critical Result: Even GPT-4o achieves only 13% accuracy (barely above 10% chance level); demonstrates fundamental gap between LLM program understanding and human visual recognition

Why Symbolic Graphics Programs Test Sophisticated Reasoning

🧠 Required Reasoning Abilities

  1. "Visual Imagination": LLMs must mentally simulate how symbolic operations render visually without seeing pixels
  2. Long-range Sequential Reasoning: Operation order drastically affects semantics; requires tracking procedural generation steps
  3. Fine-grained Grounding: Locating semantic components in program structure demands precise understanding of code-to-visual mapping
  4. Multi-step Compositional Reasoning: E.g., "What is object primarily used for?" requires: identify object semantically β†’ determine function β†’ answer (errors in any step fail)
  5. Numeric + Spatial + Geometric Understanding: Requires perceiving coordinates, dimensions, transformations, and their visual implications

πŸ’‘ Example: CAD Reasoning (OpenAI-o1)

Task: "How many protruding cylindrical shafts are visible in the CAD object?"

O1's Step-by-Step Process:

  • Step 1: Main shaft created via circle extrusion (radius 0.32, extrude 3.175 units)
  • Step 2: Base ring added (annulus extruded downward, doesn't add shaft)
  • Step 3: Cutting features (removes material, no new shafts)
  • Step 4: Small ring added (doesn't count as shaft)
  • Steps 5-6: Upper shaft created and extended
  • Answer: 2 shafts (main shaft + upper shaft)

Demonstrates: Numeric perception, spatial reasoning, geometric understanding, long-range planning, and common sense

Symbolic Instruction Tuning (SIT) Results

βœ… SGP-Bench Performance Gains

Llama-3.1-8B improvement: 46.5% β†’ 51.4% (+4.9%) with 55K SIT pairs

Scaling with data size:

  • 10K pairs: 48.0% (+1.5%)
  • 25K pairs: 50.3% (+3.8%)
  • 40K pairs: 51.2% (+4.7%)
  • 55K pairs: 51.4% (+4.9%)

πŸš€ General Reasoning Improvements (Llama-3.1-8B)

Notable gains with OI-mixed-SIT (Open-Instruct + original + reverse SIT):

  • Instruction Following: IFEval-inst +5.6%, IFEval-prompt +3.5%
  • Math Reasoning: GSM8k +3.3%, ASDiv +2.8%, Arithmetic +2.0%, MathQA +1.4%
  • General Reasoning: AGIEval +7.9%, BigBenchHard +1.7%, PIQA +0.5%
  • Language Understanding: C-Eval +1.7%, XNLI +1.1%, MMLU +1.2%
  • Reading Comprehension: SQuAD2.0 +2.7%, CoQA +1.2%

Key Insight: Reverse SIT (description→program generation) provides complementary reasoning abilities to original SIT; mixed approach achieves best overall performance

Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

πŸ“… Year: 2025 ✍️ Authors: Not specified in summary πŸ›οΈ Venue: arXiv (Preprint)
Has Benchmark Scientific Diagrams mxGraph XML Cognitive Reasoning Training-Free Chain-of-Thought

Key Insights

  • First training-free framework using cognitively guided chain-of-thought to convert raster scientific diagrams into editable mxGraph XML (draw.io compatible)
  • Addresses critical gap: rasterized diagrams lose symbolic structure (nodes, edges, hierarchy, semantic grouping, layout constraints) making them uneditable and non-reusable
  • Two-stage cognitive pipeline: (1) Coarse-to-Fine Planning via perceptual structuring and semantic layout planning, (2) Structure-Aware Code Generation with multi-round XML refinement
  • Introduces Plot2XML benchmark: 247 real scientific diagrams from actual papers with gold-standard mxGraph XML annotations and 5-dimensional complexity analysis
  • Outperforms GPT-4o, Claude, Gemini, Grok, Qwen2.5-VL, Llama-3.2V by 10-20% semantic alignment and up to 40% visual fidelity on complex diagrams
  • Achieves 89% XML validity rate; ablation shows removing hierarchical XML drops validity to 66%, confirming necessity of structured approach

Methods & Approach

  • Stage I - Perceptual Structuring: MLLM extracts Gestalt grouping, hierarchical decomposition, visual encoding (colorsβ†’roles), and connector topology
  • Stage I - Semantic Layout Planning: Produces regions (Input/Output blocks), typed elements (Process, Decision, Entity), and layout constraints (align, connect, layer)
  • Stage II - Initial XML Generation: MLLM generates structured mxGraph code with Y_doc (XML root), Y_style (style dictionary), Y_node (symbolic nodes), Y_layout (positions + alignment), Y_edge (connectors + routing)
  • Multi-Round XML Refinement: Iterative correction of formatting errors, application of XML schema constraints, draw.io validator verification for syntactic correctness and rendering success
  • Cognitive Inspiration: Pipeline mirrors human diagram comprehension: visual perception β†’ semantic understanding β†’ symbolic representation
  • Evaluation Metrics: CLIP (semantic alignment), DINO (visual features), FID (distribution quality), Aesthetic Score, plus human evaluation

Benchmark

🎯 Plot2XML Benchmark

Size: 247 real scientific diagrams from actual research papers

Annotations: Gold-standard mxGraph XML ground truth for each diagram

Unique Features: First benchmark focused on real-world scientific diagrams (vs. synthetic or UI-focused datasets)

Complexity Analysis: 5-dimensional evaluation framework:

  • Connection Complexity: Number and types of edges, routing patterns
  • Graphical Complexity: Number of nodes, shapes, hierarchical levels
  • Color Usage: Color encoding schemes for semantic meaning
  • Text Density: Labels, annotations, captions
  • Special Elements: Mathematical notation, domain-specific symbols

Gap Addressed: Previous datasets (SVG, TikZ, UI-to-code) lacked rich relational structure required for scientific diagrams

🎯 Evaluation Results

Performance Gains vs. Best MLLMs:

  • Semantic Alignment (CLIP): +10-20% improvement
  • Visual Fidelity (DINO): Up to +40% on complex diagrams
  • XML Validity Rate: 89% (vs. 66% without hierarchical structure)
  • Human Evaluation: Preferred over GPT-4o, Claude, Gemini, Grok, Qwen2.5-VL, Llama-3.2V

Ablation Study Critical Findings:

  • Removing perceptual structuring: ↓ semantic alignment & validity
  • Removing layout planning: Major performance drop
  • Removing hierarchical XML: Validity drops from 89% β†’ 66%

Why mxGraph XML for Scientific Diagrams

🎯 Advantages of mxGraph Over Raster/SVG/TikZ

  1. Preserves Symbolic Structure: Nodes, edges, hierarchy, semantic grouping, and layout constraints are explicitly encoded
  2. Editable & Reusable: Compatible with draw.io, enabling direct editing, modification, and reuse in presentations/papers
  3. Semantically Meaningful: Unlike SVG paths, mxGraph represents diagrams as structured graphs with typed elements (Process, Decision, Entity)
  4. Layout Constraints: Explicit encoding of alignment, connection routing, and z-layering rules
  5. Renderability Guarantee: XML can be validated and rendered programmatically via draw.io engine

πŸ’‘ Problems with Existing Approaches

  • SVG: Low-level shapes with poor semantic meaning, cannot express rich relational structures
  • TikZ: Powerful but hard to parse, non-standardized, limited tool support
  • Python Plot Code: Domain-specific, cannot express general diagram structures (flowcharts, architecture diagrams)
  • Raw MLLMs: Struggle with structural accuracy, layout consistency, long XML generation, executable formatting

Cognitive Reasoning Pipeline (Algorithm 1)

πŸ“‹ Stage I: Coarse-to-Fine Planning

Step 1 - Perceptual Structuring:

  • Extract: Gestalt grouping, hierarchical primitives, visual encodings (color/shapeβ†’meaning), connector topology
  • Output: T_percept = (Gestalt, Hierarchy, Encoding, Connectors)

Step 2 - Semantic Specification:

  • Generate: R (Regions: Input/Output/Modules), E (Typed elements: Process/Entity/Block/Arrow), L (Layout constraints: align, connect, layer)
  • Output: T_hierarchy = (R, E, L) β€” the semantic blueprint

πŸ”§ Stage II: Structure-Aware Code Generation

Step 3 - Initial XML Generation:

  • Generate: Y_doc (XML root), Y_style (style dictionary), Y_node (symbolic nodes), Y_layout (positions + alignment), Y_edge (connectors + routing)

Step 4 - Multi-Round XML Refinement:

  • Iteratively correct: Missing tags, mis-nested elements, schema violations
  • Verify via draw.io validator: Syntactic correctness, rendering success, structural consistency
  • Early stopping once valid XML is achieved

StarVector: Generating Scalable Vector Graphics Code from Images and Text

πŸ“… Year: 2024 ✍️ Authors: Juan A. Rodriguez et al. (ServiceNow Research, Mila) πŸ›οΈ Venue: AAAI 2025, CVPR 2025 πŸ“„ arXiv: 2312.11556
Has Dataset Has Benchmark SVG Generation Image-to-SVG Text-to-SVG Multimodal LLM Diagram Generation Code Generation

Key Insights

  • Reframes image vectorization as code generation rather than pixel reconstruction: treats SVG generation as autoregressive language modeling to generate semantic primitives (circles, rectangles, text) instead of decomposing everything into path curves
  • First multimodal LLM for SVG generation handling both image-to-SVG and text-to-SVG with proper semantic understanding of primitives
  • Achieves 3-6Γ— more compact code than baselines while maintaining state-of-the-art quality (DinoScore 0.966 vs 0.939-0.992 for traditional methods)
  • Generates editable, human-readable SVG with proper primitives: a circle becomes <circle/> (1 token) not 50 BΓ©zier curves (500+ tokens)
  • Unique capability: generates structured diagrams with text primitives and layouts (flowcharts, UI mockups) - impossible for path-only vectorization tools
  • Introduces SVG-Stack (2M samples) and SVG-Bench (10 datasets, 3 tasks); proposes DinoScore metric with strong human correlation (ρ=0.62-0.76 vs ρ=0.06 for MSE)

Methods & Approach

  • Vision Encoder: StarVector-1B uses CLIP ViT-B/32 (224Γ—224, 257 tokens); StarVector-8B uses SigLip (384Γ—384, 576 tokens) for enhanced visual understanding
  • Adapter Layer: Non-linear projection hv = LayerNorm(WL Β· Swish(Wh Β· zv)) bridges vision features to language model embedding space
  • Language Model: StarCoder-1B (8K context) or StarCoder2-7B (16K context) for autoregressive SVG code generation
  • Training Data: SVG-Stack (2.1M training samples) with aggressive augmentation (rotation, scale, color jitter, Perlin noise on BΓ©zier curves); 4M text-image-SVG triplets with BLIP2/LLaVA captions filtered by CLIP Score β‰₯30
  • Training Configuration: StarVector-1B trained 7 days on 8Γ— A100 (batch 128); StarVector-8B trained 10 days on 64Γ— H100 (batch 512); both use AdamW with lr=1e-5
  • Inference Strategy: Sample multiple candidates at temperatures [0.0, 0.25, 0.5, 0.75, 1.0] with length_penalty=1.2; rerank by DinoScore for best perceptual quality

Datasets & Benchmarks

πŸ“Š SVG-Stack Dataset

Size: 2.1M training + 108K validation + 5.7K test samples

Source: TheStack dataset with deduplication, preprocessing, and rasterization validation via CairoSVG

Statistics: Average length 1,822 Β± 1,808 tokens; diverse primitives (icons, emojis, fonts, diagrams)

Captions: 4M text-image-SVG triplets generated via BLIP2 + LLaVA with CLIP Score filtering (threshold=30)

🎯 SVG-Bench Benchmark

Coverage: 10 datasets across 3 tasks

Tasks:

  • Image-to-SVG: SVG-Stack, SVG-Fonts, SVG-Icons, SVG-Emoji (general vectorization quality)
  • Text-to-SVG: SVG-Stack-Text, SVG-Fonts-Text, SVG-Icons-Text, SVG-Emoji-Text (multimodal generation)
  • Diagram Generation: Diagram-FlowChart, Diagram-GraphPlot (structured graphics with text primitives)

Evaluation Metrics:

  • DinoScore: L2 distance in DinoV2 feature space (strong human correlation ρ=0.62-0.76)
  • Token Length: Code compactness indicator (semantic understanding proxy)
  • Traditional Metrics: LPIPS, SSIM, MSE (shown to correlate poorly with human judgment)
  • Text-to-SVG: FID, FID-CLIP, CLIP Score for distribution similarity and text-image alignment

Key Results:

  • StarVector-8B: DinoScore 0.966, 5.3K tokens (vs ground truth 2.8K) in 74s
  • LIVE baseline: DinoScore 0.939, 18.5K tokens in 1,412s (19Γ— slower, 3.5Γ— more bloated)
  • AutoTrace/VTracer: DinoScore 0.988/0.975, 6.4K/8.1K tokens, path-only representation
  • Diagram generation: 80% human preference vs 12% for LIVE (only StarVector generates valid text primitives)
  • Text-to-SVG: FID 25.8, FID-CLIP 4.6, CLIP Score 0.781 (outperforms GPT-4V and CodeLlama-34B)

Why Code Generation > Pixel Reconstruction for Vectorization

🎯 Paradigm Shift: Vectorization as Code Generation

Traditional Approach (Inverse Rendering): Edge detection β†’ contour tracing β†’ curve fitting β†’ path primitives

StarVector Approach (Code Generation): Visual understanding β†’ semantic interpretation β†’ SVG code synthesis

Key Advantage: Language models learn "these pixel patterns correspond to circle primitives" rather than applying fixed heuristics

πŸ’‘ Example: Circle Representation

Traditional Vectorization (AutoTrace/VTracer):

<!-- 50+ BΓ©zier curve segments -->
<path d="M 100,50 C 100,22.4 77.6,0 50,0 C 22.4,0 0,22.4 0,50
         C 0,77.6 22.4,100 50,100 C 77.6,100 100,77.6 100,50 Z"/>
<!-- ...49 more path segments... -->
<!-- Result: ~500 tokens, uneditable -->

StarVector Output:

<circle cx="50" cy="50" r="50"/>
<!-- Result: ~1 token, editable, 50Γ— more compact -->

Impact: Semantic understanding enables compact, human-readable, editable SVG code

Human Evaluation Results

βœ… Image-to-SVG Quality (1,948 assessments, 30 participants)

Question: "Which SVG better represents the input image?"

StarVector-8B Win Rate:

  • vs. LIVE: 68% wins
  • vs. AutoTrace: 74% wins
  • vs. VTracer: 71% wins

Diagram Generation Task:

  • vs. LIVE: 89% wins (diagrams with readable text)
  • vs. AutoTrace: 92% wins

Critical Finding: Only StarVector generates valid diagrams with text and structured elements - baseline methods produce unreadable path approximations

CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

πŸ“… Year: 2025 ✍️ Authors: Not specified πŸ›οΈ Venue: arXiv (Preprint)
Has Benchmark Circuit Diagrams Symbolic Reasoning Engineering Design Hierarchical Understanding

Key Insights

  • First benchmark integrating visual circuit understanding and symbolic engineering reasoning, revealing major limitations in current multimodal LLMs
  • Addresses critical gap: existing models excel at visual component recognition but fail at symbolic reasoning (applying Ohm's law, KCL/KVL, functional inference)
  • Hierarchical evaluation spanning visual primitives β†’ connectivity topology β†’ functional modules β†’ system-level behavior
  • Comprehensive task taxonomy covering perception (component detection, layout recognition), symbolic computation (circuit equations, parameter inference), and system-level analysis (fault detection, design modification)
  • Benchmark reveals MLLMs perform well on visual captioning but show large performance gaps vs. humans on hierarchical topology extraction and multi-step analytical problem solving

Benchmark Structure

  • Visual Tasks: Component detection, connection reading, circuit graph reconstruction
  • Symbolic & Analytical Tasks: Solving circuit equations, parameter inference from diagrams, module classification
  • System-Level Reasoning: Circuit diagnosis, design modification impact analysis, step-by-step engineering reasoning
  • Hierarchical Levels: Primitive symbols β†’ Subcircuits β†’ Functional blocks β†’ Complete system diagrams
  • Models Evaluated: GPT-4o, Claude 3.5/3.7, Qwen series, Gemini, LLaMA-based VL models

Benchmark

🎯 CircuitSense Benchmark

Type: Evaluation-only benchmark (no training dataset provided)

Content: Thousands of real and synthetic circuit diagrams with ground truth annotations

Annotations Include:

  • Node/edge connectivity graphs
  • Component attributes and parameters
  • Symbolic circuit equations
  • Engineering reasoning steps

Key Finding: All current MLLMs show significant failures in hierarchical topology extraction, symbolic circuit reasoning, and multi-step analytical problem solving despite good performance on basic visual tasks

🎯 Benchmark Contribution

Novel Aspects:

  • First to unify visual perception and symbolic engineering reasoning in a single evaluation framework
  • Comprehensive task taxonomy spanning low-level vision β†’ mid-level graph reasoning β†’ high-level analytical tasks
  • Diagnostic analysis showing where and why MLLMs fail on engineering-domain reasoning
  • Pathway for developing engineering-capable AI systems beyond basic diagram captioning

MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science

πŸ“… Year: 2025 ✍️ Authors: THU-COAI πŸ›οΈ Venue: ICLR 2025
Has Benchmark Circuit Analysis SPICE Simulation LaTeX/CircuitTikz Physical Perception Chain-of-Simulation

Key Insights

  • Decomposes expert-level physical science reasoning into: Physical Perception Model (PPM) for diagram understanding + Physics Simulator for symbolic reasoning
  • Uses SPICE (Nagel, 1975) as simulation language to bridge visual diagrams and numerical reasoning - enables direct access to fundamental circuit structure
  • Achieves 4.3Γ— improvement over GPT-4V on college-level circuits: 7.6% β†’ 32.9% accuracy on SimpleCircuitEval benchmark
  • Chain-of-Simulation process: PPM generates SPICE netlist β†’ NgSPICE simulator executes β†’ numerical results guide MLLM final reasoning
  • Ablation shows high simulator dependency: accuracy drops from 55% to 15% without simulation results, even with Python-based reasoning attempts
  • Simulation language description alone reduces hallucination by 10% on non-simulatable problems (functions like scene graphs)

πŸ”§ SPICE Simulator: The Physical Reasoning Engine

What is SPICE?

SPICE (Simulation Program with Integrated Circuit Emphasis, Nagel 1975) is a general-purpose circuit simulation program that:

  • Analyzes electronic circuits numerically by solving differential equations
  • Industry standard for circuit simulation (used in IC design, education, research)
  • Provides DC, AC, transient, and sensitivity analysis
  • In MAPS: Uses NgSPICE as the execution engine

πŸ”„ The Simulation Pipeline

1. INPUT: Circuit Diagram Image
   [Image with resistors, capacitors, voltage sources]
        ↓
2. PPM (Physical Perception Model)
   CogVLM-17B fine-tuned on 20K synthetic pairs
   Outputs: SPICE Netlist (Simulation Language)
        ↓
3. SPICE NETLIST EXAMPLE:
   * Voltage Divider Circuit
   V1 n1 0 DC 10V
   R1 n1 n2 5k
   R2 n2 0 10k
   .op
   .end
        ↓
4. NgSPICE SIMULATOR EXECUTION
   Solves Kirchhoff's laws, component equations
        ↓
5. SIMULATION RESULTS (Observations):
   V(n1) = 10.000V
   V(n2) = 6.667V    ← Key intermediate value
   I(V1) = -1.333mA
        ↓
6. MLLM FINAL REASONING
   Uses simulation results as context
   "Based on simulation: V(n2) = 6.667V
    Voltage across R2 = 6.667V - 0V = 6.667V"
        ↓
7. OUTPUT: Final Answer with Explanation

⚑ How Simulation Results Guide Reasoning

NOT Traditional RL: MAPS doesn't use SPICE results as RL rewards during inference. Instead, simulation results are intermediate reasoning steps:

  • Single-shot inference: Simulator runs once per question (not iterative policy learning)
  • Observations as context: Voltages/currents fed to MLLM alongside question and diagram
  • Grounded reasoning: LLM explains answer using numerical facts from physics engine
  • Validation: Results checked for validity before proceeding to final answer

Training Signals (Reward-like): During PPM training, three accuracy metrics act as supervision:

  • Component Quantity Accuracy (ACCCQ): Correct component count β†’ binary reward
  • Component Value Accuracy (ACCCV): Correct resistances/capacitances β†’ per-component reward
  • Simulation Accuracy (ACCsim): ⭐ Key metric - do generated and reference SPICE produce same results?

πŸ“Š SPICE as "Tool Use" Framework

MAPS is better understood as multimodal tool use than RL:

Aspect Traditional RL MAPS Approach
Interaction Iterative policy improvement Single-shot tool call
Feedback Reward signal for action quality Observations as reasoning context
Goal Optimize cumulative reward Answer question accurately
Exploration Explore action space No exploration (deterministic)

πŸ“ LaTeX/CircuitTikz: Synthetic Data Generation Engine

Why LaTeX/CircuitTikz for Training Data?

MAPS uses CircuitTikz (LaTeX package) to generate 20,000 synthetic training pairs for the PPM. This choice is critical:

πŸ”‘ Key Advantage: One Source β†’ Two Perfect Outputs

SINGLE CircuitTikz CODE:
\begin{circuitikz}
  \draw (0,0) to[V, v=$10V$] (0,3)
        to[R=$5k\Omega$] (3,3)
        to[R=$10k\Omega$] (3,0)
        to (0,0);
\end{circuitikz}

           ↙                    β†˜
OUTPUT 1:              OUTPUT 2:
[PNG Image]              SPICE Netlist
Circuit diagram          V1 n1 0 DC 10
with visual layout       R1 n1 n2 5k
                         R2 n2 0 10k
                         .op .end

         β†˜                    ↙
        TRAINING PAIR
    (Image, SPICE) - Perfect Alignment!

πŸ”§ Synthetic Data Generation Pipeline

  1. Sample Circuit Parameters from hierarchical distribution:
    • Component counts (2-10 components)
    • Topology types (series, parallel, series-parallel, bridge, ladder)
    • Component values (resistances: 1k-10k, capacitances: 1pF-100ΞΌF)
    • Voltage sources (3V, 5V, 9V, 12V, etc.)
  2. Generate CircuitTikz LaTeX Code programmatically
  3. Compile to PDF/PNG using pdflatex + ImageMagick
  4. Parse CircuitTikz β†’ SPICE Netlist (structured syntax makes this straightforward)
  5. Verify with NgSPICE - ensure valid circuit (reject invalid samples)
  6. Add to Dataset: ppm-syn-lprc (20K pairs)

Result: 20,000 perfectly aligned (diagram image, SPICE netlist) pairs with zero human annotation!

βœ… Why LaTeX > Other Approaches

Method Problem
Hand-draw circuits ❌ Can't auto-extract SPICE; manual annotation needed
CAD tools (EAGLE, KiCAD) ❌ Complex file formats; overkill for simple circuits
Direct image synthesis ❌ Hard to ensure image ↔ SPICE correspondence
Screenshot textbooks ❌ Copyright; limited diversity; no labels
Random pixel generation ❌ Unrealistic; domain shift from real problems
βœ… CircuitTikz/LaTeX βœ… Programmatic, scalable, perfect alignment, realistic appearance

πŸ“š Hierarchical Complexity Control

LaTeX code enables curriculum learning by controlling circuit complexity:

  • Simple (2-3 components): Series/parallel resistor networks
  • Medium (4-6 components): Series-parallel combinations, voltage dividers
  • Hard (7-10 components): Bridge circuits, ladder networks, mesh analysis

Key Benefit: Training data distribution matches real college-level problem difficulty, minimizing domain gap!

Architecture & Training

  • Physical Perception Model (PPM): CogVLM-17B base, fine-tuned with LoRA (rank 50, lr 1e-5, batch 32)
  • Training Objective: Minimize negative MLE loss for SPICE generation from diagrams
  • Synthetic Dataset: ppm-syn-lprc (20,000 paired circuit diagrams + SPICE netlists)
  • Data Source: Hierarchical random sampling β†’ CircuitTikz/LaTeX β†’ PNG + SPICE pairs
  • Evaluation Metrics: Component Quantity/Value Accuracy, Simulation Accuracy (cosine similarity of sim results)
  • Inference: PPM generates SPICE β†’ MLLM refines with text β†’ NgSPICE executes β†’ Results guide final reasoning

Results

Performance on SimpleCircuitEval (79 college-level LPRC problems):

Model Direct Prompting + MAPS Improvement
GPT-4V 7.6% 32.9% +333% (4.3Γ—)
Claude-3.5 12.7% 38.0% +199% (3.0Γ—)
GLM-4V 5.1% 29.1% +471% (5.7Γ—)

πŸ” Ablation Study Key Finding:

  • With simulation results: 55% accuracy
  • Without simulation (Python reasoning): 15% accuracy
  • Conclusion: Physics simulator is critical - symbolic reasoning alone insufficient

πŸ’‘ Why Such Huge Gains?

  • LLMs are poor at circuit math - simulators provide accurate numerical values
  • SPICE netlist reduces hallucination by constraining to valid circuit states
  • Structured representation (netlist) eliminates ambiguity in diagram interpretation
  • Simulation results serve as intermediate reasoning steps, not just final answers

Benchmark

🎯 SimpleCircuitEval Benchmark

Type: Evaluation benchmark for circuit analysis reasoning

Content: 79 college-level Linear Pure Resistive Circuit (LPRC) problems from 4 textbook chapters

Task Categories: Voltage/current calculation, power analysis, circuit reduction, Kirchhoff's laws

🎯 Training Dataset: ppm-syn-lprc

Size: 20,000 synthetic circuit diagram + SPICE netlist pairs

Generation: Automated via CircuitTikz/LaTeX β†’ PNG + SPICE parsing

Validation: All circuits verified with NgSPICE before inclusion

Diversity: Hierarchical sampling across complexity levels and topology types

πŸ“Š Current Evaluation Benchmarks for Diagrams & Circuits

Comprehensive benchmarks for evaluating multimodal models on diagram understanding, circuit analysis, and technical graphics

AMSBENCH

Focus: Analog and Mixed-Signal (AMS) Circuit Understanding

Tasks: Circuit recognition, component identification, topology analysis, function prediction

Scope: Comprehensive evaluation of MLLM capabilities on AMS circuit diagrams

EEE-Bench

Focus: Electrical & Electronics Engineering Multimodal Understanding

Tasks: Circuit analysis, component recognition, schematic interpretation, engineering problem solving

Scope: Comprehensive benchmark spanning multiple EEE domains and diagram types

StarVector (SVG-Bench)

Focus: SVG Generation & Vectorization Quality

Tasks: Image-to-SVG, text-to-SVG, diagram generation (flowcharts, graphs)

Scope: 10 datasets across 3 tasks; DinoScore metric for perceptual quality

SGP-Bench

Focus: Symbolic Graphics Program Understanding (SVG & CAD)

Tasks: Semantic understanding, visual imagination, SE(2) consistency, reasoning

Scope: 1,085 SVG + 2,400 CAD programs; tests LLM "visual imagination" from code

ElectroVizQA

Focus: Electronics Diagram Visual Question Answering

Tasks: Component identification, circuit analysis, diagram comprehension, QA tasks

Scope: Specialized benchmark for electronics diagram understanding and reasoning

MMCircuitEval

Focus: Multimodal Circuit Evaluation & Analysis

Tasks: Circuit understanding, component recognition, topology analysis, multimodal reasoning

Scope: Comprehensive evaluation framework for circuit diagram comprehension

CircuitSense

Focus: Hierarchical Circuit System Understanding with Visual-Symbolic Integration

Tasks: Visual comprehension (component detection, topology), symbolic reasoning (circuit laws, parameter inference), system-level analysis (diagnosis, design modification)

Scope: First benchmark bridging visual perception and symbolic engineering reasoning across hierarchical circuit levels (primitives β†’ subcircuits β†’ functional blocks β†’ systems). Evaluation-only (no training dataset).

🎯 Benchmark Coverage Summary

Circuit Diagrams

AMSBENCH, EEE-Bench, ElectroVizQA, MMCircuitEval, CircuitSense

SVG Generation

StarVector (SVG-Bench)

Program Understanding

SGP-Bench (SVG & CAD)

πŸ› οΈ Current Commercial Tools

Commercial non-VLM software demonstrating the practical value and market demand for vision-to-structured-output tasks

Mathpix

Function: Mathematical Equation Recognition and LaTeX Conversion

Capabilities: Converts images of equations (handwritten or printed) to LaTeX code, supports complex mathematical notation, integrates with various document editors

Use Case: Enables researchers, students, and academics to quickly digitize mathematical content from papers, textbooks, and handwritten notes into editable LaTeX format

Codia AI

Function: Image to HTML/CSS Code Converter

Capabilities: Converts design mockups, screenshots, and UI images into production-ready HTML and CSS code. Two-step process: Image β†’ Figma design β†’ Clean, semantic HTML/CSS. Supports responsive layouts, SEO best practices, and accessibility features.

Use Case: Bridges the gap between design and implementation for web developers, enabling rapid prototyping and pixel-perfect conversion of visual designs to functional web code

LilyPond

Function: Text-Based Music Notation and Sheet Music Generation

Capabilities: Compiles text-based music notation code into professional-quality sheet music (PDF, PNG, SVG). Similar to LaTeX for documents, LilyPond uses plain text syntax to define musical elements (notes, rhythms, dynamics, articulations). The inverse task - Optical Music Recognition (OMR) converting sheet music images to LilyPond code - represents another vision-to-structured-code challenge.

Use Case: Enables composers, music educators, and publishers to create beautifully engraved sheet music from text-based notation. The potential for automated conversion of printed/handwritten music scores to editable LilyPond format demonstrates the value of vision-to-code tasks in the music domain.

πŸ’‘ Key Insight

The existence of these commercial tools and formats demonstrates the practical value and market demand for vision-to-structured-output tasks. These non-VLM solutions validate that converting visual content (equations, UI designs, music scores) into structured formats (LaTeX, HTML/CSS, LilyPond) has real-world applications beyond academic research, supporting the commercial viability of diagram understanding and code generation tasks.