feat: Update image handling and refine AI prompt instructions

Refactor image data passing in `pdf_convertor.py` to use a direct base64 and mime_type format, aligning with updated API requirements for vision models. Additionally, the `pdf_convertor_prompt.md` has been significantly refined to improve the clarity and specificity of instructions for the AI model, particularly concerning: - **Image Content Explanation:** Added detailed rules to ensure the AI only processes existing image references, preserves paths, and focuses on descriptive text. - **Mathematical Formulas:** Clarified conversion to LaTeX notation. - **Heading Structure:** Enhanced rules and examples for adjusting heading levels and merging adjacent or duplicate headings to ensure logical document flow.
2025-11-12 18:05:24 +11:00
parent 3b62c0f478
commit e8fa2617ba
2 changed files with 40 additions and 36 deletions
--- a/pdf_convertor.py
+++ b/pdf_convertor.py
@@ -170,10 +170,9 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
        )
        human_message_parts.append(
            {
-                "type": "image_url",
-                "image_url": {
-                    "url": f"data:image/png;base64,{base64.b64encode(images[image_name]).decode('utf-8')}"
-                },
+                "type": "image",
+                "base64": base64.b64encode(images[image_name]).decode("utf-8"),
+                "mime_type": "image/png",
            }
        )

--- a/pdf_convertor_prompt.md
+++ b/pdf_convertor_prompt.md
@@ -8,59 +8,64 @@ Examine the Markdown text and remove any conversion artifacts or strange formatt

 ## 2. Explain Image Content

-Refer to charts, diagrams, and images in the PDF, and add detailed descriptions after image references, so that readers can obtain complete information from the text description even without seeing the image.
+For image references that **already exist** in the original Markdown (format: `![...](...)`), refer to the corresponding charts, diagrams, and images in the PDF, and add detailed descriptions after each image reference.

- Add a blank line after the image reference to control line breaks.
+**Processing rules:**
+
+- Only process existing image references - never create new ones
+- Keep the image path (inside parentheses) completely unchanged
+- You may modify the description text (inside square brackets) to be more descriptive
+- If images exist in the PDF but have no corresponding Markdown reference, ignore them
+- Add a blank line after the image reference for proper formatting

 Example format:

-```
-![Brief image description](./image.png)
+```markdown
+![Brief image description](./images/0.png)

 A detailed explanation of the image, detailed enough to replace the image and help readers understand the content.
 ```

 ## 3. Correct List Formatting

-The conversion process may flatten nested lists. Please analyze the list structure in the PDF and restore the correct multi-level indentation in Markdown.
+The conversion process may flatten nested lists. Analyze the list structure in the PDF and restore the correct multi-level indentation in Markdown.

 ## 4. Correct Mathematical Formulas and Symbols

-Convert plain text formulas to correct formula notation, for example:
+Convert plain text formulas to correct LaTeX notation, for example:

 - `Kmin` should be `$K_{min}$`
-
 - `E = hc/λ` should be `$E = \frac{hc}{\lambda}$`

-## 5. Adjust Heading Structure (Important)
+## 5. Adjust Heading Structure (Critical)

-**This is the most critical task, please pay special attention!**
+**This is the most critical task - please pay special attention!**

 ### 5.1 Core Principles

 - **No content between same-level headings**: If two same-level headings are adjacent, they must be merged or their levels adjusted.
- **Avoid duplicate same-level headings**: Rename identical headings based on the different content of their sub-sections.
+- **Avoid duplicate same-level headings**: Rename identical headings based on their content differences.
 - **Maintain logical clarity**: Heading levels should reflect the organizational structure of the content.

 ### 5.2 Processing Rules

 #### Rule A: Adjacent Same-Level Headings (no content in between)

-When two same-level headings are found to be adjacent with no content in between:
+When two same-level headings are adjacent with no content in between:

- **Case 1**: If the second heading is a supplementary explanation to the first heading → Merge them into a single heading.
- **Case 2**: If the second heading is a sub-topic of the first heading → Demote the second heading to a sub-heading.
+- **Case 1**: If the second heading supplements the first → Merge them into one heading
+- **Case 2**: If the second heading is a sub-topic → Demote it to a lower-level heading

 #### Rule B: Duplicate Same-Level Headings

 When multiple identical headings appear at the same level:

- Add a distinguishing suffix to each heading based on their content differences.
- Or merge them into a single heading with different sub-headings.
+- Add distinguishing suffixes based on their content differences
+- Or merge them with different sub-headings

-### 5.3 Specific Examples
+### 5.3 Examples

-**Incorrect Example 1** (two adjacent same-level headings):
+**Incorrect** (adjacent same-level headings):

 ```markdown
 ## Software Testing
@@ -68,7 +73,7 @@ When multiple identical headings appear at the same level:
 ## Testing Strategies in Object-Oriented Analysis and Design (OOAD)
 ```

-**Correct Modification** (demote the second heading):
+**Correct** (demote the second heading):

 ```markdown
 ## Software Testing
@@ -76,14 +81,14 @@ When multiple identical headings appear at the same level:
 ### Testing Strategies in Object-Oriented Analysis and Design (OOAD)
 ```

-**Incorrect Example 2** (multiple duplicate headings):
+**Incorrect** (duplicate headings):

 ```markdown
 ## Convolutional Neural Networks: Weight Sharing with Multiple Filters

 ## Weight Sharing

-Multiple filters can be applied to detect the spatial distribution of various visual patterns.
+Multiple filters can be applied to detect the spatial distribution of various patterns.

 ## Convolutional Neural Networks: Weight Sharing and Translation Invariance

@@ -92,37 +97,37 @@ Multiple filters can be applied to detect the spatial distribution of various vi
 ## Translation Invariance:
 ```

-**Correct Modification**:
+**Correct**:

 ```markdown
 ## Convolutional Neural Networks

 ### Weight Sharing with Multiple Filters

-Multiple filters can be applied to detect the spatial distribution of various visual patterns.
+Multiple filters can be applied to detect the spatial distribution of various patterns.

 ### Weight Sharing and Translation Invariance

 #### Translation Invariance:
 ```

-### 5.4 Checklist
+### 5.4 Self-Check

-After completing the heading adjustments, please self-check:
+After adjusting headings, verify:

- [ ] Are there still adjacent same-level headings (with no content in between)?
- [ ] Are there still identical same-level headings?
- [ ] Do the heading levels clearly reflect the organizational structure of the content?
+- [ ] No adjacent same-level headings without content in between
+- [ ] No duplicate same-level headings
+- [ ] Heading levels clearly reflect content organization

 ## 6. Translation

-Translate the content into Simplified Chinese. Specialized terms should retain their original English names, for example, `Magnetic resonance imaging` should be translated as `磁共振成像(Magnetic resonance imaging, MRI)`. If a term appears multiple times, the original English name should be included each time.
+Translate the content into Simplified Chinese. Retain original English names for specialized terms in parentheses, e.g., `Magnetic resonance imaging` → `磁共振成像(Magnetic resonance imaging, MRI)`. Include the English name each time a term appears.

 ---

 ## Output Requirements

- Only output the adjusted Markdown text, without any other explanatory text.
- Do not use JSON format.
- Do not add ` ``` ` or ` ```markdown ` at the beginning or end.
- **Before outputting, please specifically check whether rule 5 (heading adjustment) has been fully executed.**
+- Output only the refined Markdown text without explanatory comments
+- Do not use JSON format or wrap output in code blocks (` ``` `)
+- Verify that heading structure (Rule 5) has been properly adjusted
+- Ensure all image references exist in the original input with unchanged paths