feat: Improve content refinement with SystemMessage and prompt updates
This commit refactors the content refinement process to leverage `SystemMessage` for the primary prompt, enhancing clarity and adherence to LLM best practices. The `pdf_convertor.py` file was updated to: - Import `SystemMessage` from `langchain_core.messages`. - Modify the `refine_content` function to use `SystemMessage` for the main prompt, moving the prompt content from `human_message_parts`. - Adjust `human_message_parts` to only contain the Markdown and image data for the `HumanMessage`. The `pdf_convertor_prompt.md` file was updated to: - Reformat the prompt with clearer headings and instructions for each task. - Improve the clarity and conciseness of the instructions for cleaning up characters, explaining image content, and correcting list formatting. Additionally, `.gitignore` was updated to include `.vscode/` to prevent IDE-specific files from being committed. These changes improve the structure of the LLM interaction and make the prompt more readable and maintainable.
This commit is contained in:
3
.gitignore
vendored
3
.gitignore
vendored
@@ -11,4 +11,5 @@ wheels/
|
||||
input/
|
||||
output/
|
||||
config.ini
|
||||
test.py
|
||||
test.py
|
||||
.vscode/
|
||||
23
.vscode/launch.json
vendored
23
.vscode/launch.json
vendored
@@ -1,23 +0,0 @@
|
||||
{
|
||||
"configurations": [
|
||||
{
|
||||
"name": "refine",
|
||||
"type": "debugpy",
|
||||
"request": "launch",
|
||||
"program": "refine.py",
|
||||
"console": "integratedTerminal",
|
||||
"args": [
|
||||
"--md-path",
|
||||
"output/13/index.md"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "main",
|
||||
"type": "debugpy",
|
||||
"request": "launch",
|
||||
"program": "main.py",
|
||||
"console": "integratedTerminal",
|
||||
"args": []
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -10,7 +10,7 @@ from docling_core.types.io import DocumentStream
|
||||
from docling.datamodel.settings import settings
|
||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||
from docling_core.types.doc.base import ImageRefMode
|
||||
from langchain_core.messages import HumanMessage
|
||||
from langchain_core.messages import HumanMessage, SystemMessage
|
||||
from langchain_google_genai import ChatGoogleGenerativeAI
|
||||
from langchain_ollama import ChatOllama
|
||||
from llm import set_api_key, get_model_name, get_temperature
|
||||
@@ -146,7 +146,7 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
||||
prompt = f.read()
|
||||
|
||||
# 添加 Markdown
|
||||
human_message_parts = [{"type": "text", "text": prompt}]
|
||||
human_message_parts = []
|
||||
if provider == "gemini":
|
||||
human_message_parts.append(
|
||||
{
|
||||
@@ -223,6 +223,7 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
||||
doc.close()
|
||||
|
||||
message_content = [
|
||||
SystemMessage(content=prompt),
|
||||
HumanMessage(content=human_message_parts), # type: ignore
|
||||
]
|
||||
|
||||
|
||||
@@ -1,67 +1,128 @@
|
||||
You are a professional technical document editor. Your task is to polish a Markdown text that has been automatically converted from a corresponding PDF document. Please use the original PDF as the sole reference for layout, images, and context.
|
||||
You are a professional technical documentation editor. Your task is to refine Markdown text automatically converted from a PDF. Please use the original PDF as the sole reference for layout, images, and context.
|
||||
|
||||
Please perform the following operations based on the provided Markdown and PDF:
|
||||
Please process the provided Markdown and PDF according to the following operations:
|
||||
|
||||
1. **Clean up extraneous characters**: Check the Markdown text and delete any conversion artifacts or strange formatting that does not exist in the original PDF.
|
||||
2. **Explain image content**: Refer to the charts, diagrams, and images in the PDF, and add a description after the image reference so that the full information can be obtained from the text description even without the image.
|
||||
## 1. Clean Up Redundant Characters
|
||||
|
||||
- Add a blank line after the image reference to control line breaks.
|
||||
Examine the Markdown text and remove any conversion artifacts or strange formatting that does not exist in the original PDF.
|
||||
|
||||
For example
|
||||
## 2. Explain Image Content
|
||||
|
||||
```
|
||||

|
||||
Refer to charts, diagrams, and images in the PDF, and add detailed descriptions after image references, so that readers can obtain complete information from the text description even without seeing the image.
|
||||
|
||||
A detailed explanation of the image, detailed enough to replace the image and help the reader understand the content.
|
||||
```
|
||||
- Add a blank line after the image reference to control line breaks.
|
||||
|
||||
3. **Correct list formatting**: The conversion process may flatten nested lists. Please analyze the list structure in the PDF and restore the correct multi-level indentation in Markdown.
|
||||
4. **Correct mathematical formulas and symbols**: Convert plain text formulas into correct formula notation, for example, `Kmin` should be `$K_{min}$`, and `E = hc/λ` should be `$E = \\frac{hc}{\\lambda}`.
|
||||
5. **Adjust headings**: Based on the different content within sub-chapters, rename headings with the same name to avoid duplicate headings and ensure a clear outline.
|
||||
6. **Clean up redundant headings**: If there is no content between adjacent headings of the same level, the headings should be adjusted to conform to standards.
|
||||
Example format:
|
||||
|
||||
- For example, the following format is incorrect, there is no content between multiple peer headings, and there are duplicate heading names.
|
||||
```
|
||||

|
||||
|
||||
```
|
||||
## Convolutional Neural Networks: Weight Sharing with Multiple Filters
|
||||
A detailed explanation of the image, detailed enough to replace the image and help readers understand the content.
|
||||
```
|
||||
|
||||
## Weight sharing
|
||||
## 3. Correct List Formatting
|
||||
|
||||
Multiple filters can be applied to detect the spatial distributions of multiple visual patterns.
|
||||
The conversion process may flatten nested lists. Please analyze the list structure in the PDF and restore the correct multi-level indentation in Markdown.
|
||||
|
||||

|
||||
## 4. Correct Mathematical Formulas and Symbols
|
||||
|
||||
This diagram consists of two parts. The left part illustrates how multiple filters (represented by connections of different colors) are applied to an input image, with each filter detecting a different pattern. The right part shows how a single filter (hidden unit / filter response) is convolved over the input to produce a feature map.
|
||||
Convert plain text formulas to correct formula notation, for example:
|
||||
|
||||
## Convolutional Neural Networks: Weight Sharing and Translation Invariance
|
||||
- `Kmin` should be `$K_{min}$`
|
||||
|
||||
## Weight sharing
|
||||
- `E = hc/λ` should be `$E = \frac{hc}{\lambda}$`
|
||||
|
||||
## Translation invariance:
|
||||
## 5. Adjust Heading Structure (Important)
|
||||
|
||||
* Captures statistics in local patches, and they are independent of location.
|
||||
```
|
||||
**This is the most critical task, please pay special attention!**
|
||||
|
||||
It can be changed to
|
||||
### 5.1 Core Principles
|
||||
|
||||
```
|
||||
## Convolutional Neural Networks
|
||||
- **No content between same-level headings**: If two same-level headings are adjacent, they must be merged or their levels adjusted.
|
||||
- **Avoid duplicate same-level headings**: Rename identical headings based on the different content of their sub-sections.
|
||||
- **Maintain logical clarity**: Heading levels should reflect the organizational structure of the content.
|
||||
|
||||
### Weight Sharing with Multiple Filters
|
||||
### 5.2 Processing Rules
|
||||
|
||||
Multiple filters can be applied to detect the spatial distributions of multiple visual patterns.
|
||||
#### Rule A: Adjacent Same-Level Headings (no content in between)
|
||||
|
||||

|
||||
When two same-level headings are found to be adjacent with no content in between:
|
||||
|
||||
This diagram consists of two parts. The left part illustrates how multiple filters (represented by connections of different colors) are applied to an input image, with each filter detecting a different pattern. The right part shows how a single filter (hidden unit / filter response) is convolved over the input to produce a feature map.
|
||||
- **Case 1**: If the second heading is a supplementary explanation to the first heading → Merge them into a single heading.
|
||||
- **Case 2**: If the second heading is a sub-topic of the first heading → Demote the second heading to a sub-heading.
|
||||
|
||||
### Weight Sharing and Translation Invariance
|
||||
#### Rule B: Duplicate Same-Level Headings
|
||||
|
||||
#### Translation invariance:
|
||||
When multiple identical headings appear at the same level:
|
||||
|
||||
* Captures statistics in local patches, and they are independent of location.
|
||||
```
|
||||
- Add a distinguishing suffix to each heading based on their content differences.
|
||||
- Or merge them into a single heading with different sub-headings.
|
||||
|
||||
7. **Translate**: Translate the content into Simplified Chinese. When translating, proper nouns should retain their original expression, for example, `Magnetic resonance imaging` should be translated as `磁共振成像(Magnetic resonance imaging, MRI)`. If the term appears multiple times, the original expression should be included each time.
|
||||
### 5.3 Specific Examples
|
||||
|
||||
Only output the adjusted Markdown text, without any other text content. Do not output in JSON format, and do not add ` ``` ` or ` ```markdown ` at the beginning or end of the Markdown.
|
||||
**Incorrect Example 1** (two adjacent same-level headings):
|
||||
|
||||
```markdown
|
||||
## Software Testing
|
||||
|
||||
## Testing Strategies in Object-Oriented Analysis and Design (OOAD)
|
||||
```
|
||||
|
||||
**Correct Modification** (demote the second heading):
|
||||
|
||||
```markdown
|
||||
## Software Testing
|
||||
|
||||
### Testing Strategies in Object-Oriented Analysis and Design (OOAD)
|
||||
```
|
||||
|
||||
**Incorrect Example 2** (multiple duplicate headings):
|
||||
|
||||
```markdown
|
||||
## Convolutional Neural Networks: Weight Sharing with Multiple Filters
|
||||
|
||||
## Weight Sharing
|
||||
|
||||
Multiple filters can be applied to detect the spatial distribution of various visual patterns.
|
||||
|
||||
## Convolutional Neural Networks: Weight Sharing and Translation Invariance
|
||||
|
||||
## Weight Sharing
|
||||
|
||||
## Translation Invariance:
|
||||
```
|
||||
|
||||
**Correct Modification**:
|
||||
|
||||
```markdown
|
||||
## Convolutional Neural Networks
|
||||
|
||||
### Weight Sharing with Multiple Filters
|
||||
|
||||
Multiple filters can be applied to detect the spatial distribution of various visual patterns.
|
||||
|
||||
### Weight Sharing and Translation Invariance
|
||||
|
||||
#### Translation Invariance:
|
||||
```
|
||||
|
||||
### 5.4 Checklist
|
||||
|
||||
After completing the heading adjustments, please self-check:
|
||||
|
||||
- [ ] Are there still adjacent same-level headings (with no content in between)?
|
||||
- [ ] Are there still identical same-level headings?
|
||||
- [ ] Do the heading levels clearly reflect the organizational structure of the content?
|
||||
|
||||
## 6. Translation
|
||||
|
||||
Translate the content into Simplified Chinese. Specialized terms should retain their original English names, for example, `Magnetic resonance imaging` should be translated as `磁共振成像(Magnetic resonance imaging, MRI)`. If a term appears multiple times, the original English name should be included each time.
|
||||
|
||||
---
|
||||
|
||||
## Output Requirements
|
||||
|
||||
- Only output the adjusted Markdown text, without any other explanatory text.
|
||||
- Do not use JSON format.
|
||||
- Do not add ` ``` ` or ` ```markdown ` at the beginning or end.
|
||||
- **Before outputting, please specifically check whether rule 5 (heading adjustment) has been fully executed.**
|
||||
|
||||
Reference in New Issue
Block a user