S1-MMAlign

A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure–Text Understanding
一个大规模、多学科的科学图像-文本数据集

15.5M+
图文对 (Image-Text Pairs)
2.5M+
科学论文 (Source Papers)
+18.21%
CLIP Score 提升
9
学科领域 (Disciplines)
📐 数学 Mathematics ⚛️ 物理学 Physics 🧪 化学 Chemistry 🧬 生物学 Biology 🔭 天文学 Astronomy 🌍 地球科学 Earth Science 🏥 医学 Medicine ⚙️ 工程学 Engineering 💻 计算机科学 CS

📊 数据集样例 Dataset Samples

Sample #1
cond-mat/0412357 cond-mat.supr-con
Sample 1 Figure
📜 论文标题 Paper Title Theory of proximity effect in normal metal/d_{x²-y²}-wave superconductor interface in the presence of subdominant components of the pair potentials
🤖 AI Recaption (Qwen-VL 生成的增强描述)

This image displays two plots, labeled (a) and (b), showing the angle-resolved local density of states (LDOS) at the interface of a normal metal/d-wave superconductor junction. Both plots have the normalized LDOS, n_D(E,φ,0)/N₀, on the y-axis and the normalized energy, E/Δ_d0, on the x-axis. In both plots, the parameter R=0 and θ=π/6 are specified. Plot (a) shows two curves: a black solid line for φ = π/3 and a red solid line for φ = π/20. Both curves exhibit a dip at E/Δ_d0 = 0. Plot (b) shows two curves: a black solid line for φ = -π/3 and a red solid line for φ = -π/20. The black curve in plot (b) shows a sharp peak at E/Δ_d0 = 0, while the red curve shows a dip at the same energy.

📁 图片路径 Image Path images/0705/0412357.tar.gz/fig014.png
📏 图片尺寸 1518 × 3073 px
Sample #2
cond-mat/0412357 cond-mat.str-el
Sample 2 Figure
📜 论文标题 Paper Title Theory of proximity effect in normal metal/d_{x²-y²}-wave superconductor interface in the presence of subdominant components of the pair potentials
🤖 AI Recaption (Qwen-VL 生成的增强描述)

This image displays two panels, (a) and (b), each showing a plot of the normalized local density of states (LDOS), labeled as N_D(E,0)/N₀, versus the normalized energy, E/Δ_d0, for a normal metal/d-wave superconductor (N/D) junction. Both panels are labeled with T_N/T_d = 0.01 and θ = π/4. In panel (a), three curves are shown: a black solid line for R=0, a red solid line for R=0.25, and a blue solid line for R=0.5. In panel (b), two curves are shown: a red solid line for R=0.75 and a black solid line for R=1. The y-axis ranges from 0 to 4, and the x-axis ranges from -2 to 2.

📁 图片路径 Image Path images/0705/0412357.tar.gz/fig004.png
📏 图片尺寸 1525 × 3072 px
Sample #3
cond-mat/0412357 cond-mat.supr-con
Sample 3 Figure
📜 论文标题 Paper Title Theory of proximity effect in normal metal/d_{x²-y²}-wave superconductor interface in the presence of subdominant components of the pair potentials
🤖 AI Recaption (Qwen-VL 生成的增强描述)

This image displays a set of four plots showing the spatial dependencies of pair potentials in a normal metal/superconductor junction. The plots are arranged in a 2x2 grid. The top-left plot (a) shows the real parts of the pair potentials, Re[Δ_N] and Re[Δ_d], as functions of position x/ξ_d for T_N/T_d = 0.01 and θ = π/6. The solid black line represents Re[Δ_d] for R = 1, and the solid red line represents Re[Δ_N] for R = 0. The top-right plot shows Re[Δ_d] and Re[Δ_N] for different values of R (0.25, 0.5, 0.75) at the same temperature and angle. The bottom-left plot (b) shows the real and imaginary parts of the pair potentials, Re[Δ_s], Im[Δ_s], Re[Δ_d], and Im[Δ_d], as functions of position x/ξ_d for R = 1 (solid lines) and R = 0 (dot lines). The bottom-right plot shows the same quantities for different values of R (0.75 solid, 0.5 dash, 0.25 dot). In all plots, the vertical dashed line indicates the interface position at x = 0.

📁 图片路径 Image Path images/0705/0412357.tar.gz/fig010.png
📏 图片尺寸 2169 × 1696 px

📦 数据结构 Data Structure

JSONL 元数据文件 Metadata Files

📍 位置: arxiv/jsonl/YY_recaption.jsonl

  • arxiv_id 论文的 arXiv ID,如 "hep-th/0702063" 或 "0708.1630"
  • title 论文标题
  • image_path 图片路径,格式: images/YYMM/paper_id.tar.gz/figNNN.png
  • recaption 🌟 AI增强的图片描述,由 Qwen-VL 模型生成,结合论文摘要和引用上下文
  • categories arXiv 分类标签
WebDataset 图片文件 Image Files

📍 位置: arxiv/images_YYYY.tar.gz

  • png PNG 格式的科学图片数据
  • __key__ 图片唯一标识,格式: YYMM/paper_id.tar.gz/figNNN
  • __url__ WebDataset 来源 URL
🔗 匹配规则 Matching Rule
JSONL.image_path 去掉前缀 "images/" 和后缀 ".png" = WebDataset.__key__

示例: images/0705/0412357.tar.gz/fig014.png0705/0412357.tar.gz/fig014