refactor: 统一使用 OpenAI 兼容 API,支持自定义 base_url/key/model
- 移除 Gemini 和 Ollama 独立适配,统一使用 ChatOpenAI + base_url - config.ini 简化为 BASE_URL / API_KEY / MODEL / TEMPERATURE / MAX_RETRIES - 新增 config.example.ini 示例配置 - 移除 langchain-google-genai / langchain-ollama / pymupdf 依赖 - main.py 新增断点续跑:跳过已有 index.md / index_refined.md - LLM 请求支持 max_retries 自动重试(默认 3 次) - 优化 README
This commit is contained in:
@@ -1,63 +1,87 @@
|
|||||||
## 留子课程幻灯片整理翻译工具
|
# 留子课程幻灯片整理翻译工具
|
||||||
|
|
||||||
本工具旨在为海外留学生提供一个高效、智能的课程资料处理解决方案,以应对他们在学习过程中遇到的语言障碍和复杂的幻灯片整理挑战。
|
将课程 PDF 幻灯片自动转换为结构化中文 Markdown,利用 LLM 优化排版、解释图片、翻译并保留英文专业术语。
|
||||||
|
|
||||||
许多留学生在面对英文或其他语言的课程幻灯片时,不仅需要理解专业内容,还要克服语言隔阂,并且手动整理和翻译耗时费力,容易遗漏关键信息,尤其是在处理含有大量图表的幻灯片时。
|
## 功能
|
||||||
|
|
||||||
### 程序功能
|
- **PDF → Markdown**:自动将 PDF 转为结构化 Markdown
|
||||||
|
- **智能排版**:LLM 优化格式、修正标题层级、修复数学公式
|
||||||
|
- **图片注解**:自动识别图片内容并添加说明
|
||||||
|
- **中文翻译**:翻译为简体中文,保留专业术语英文原文(如 `磁共振成像(MRI)`)
|
||||||
|
- **断点续跑**:自动跳过已完成的步骤,支持中断后继续
|
||||||
|
|
||||||
1. **自动化内容提取与转换:** 将 PDF 格式的课程幻灯片**自动转换为结构化的 Markdown 格式**,便于后续编辑和阅读。
|
## 前置要求
|
||||||
2. **智能格式优化与增强:** 利用**大型语言模型 (LLM) 进行深度处理,对转换后的 Markdown 内容进行微调,优化版面格式,并智能地为图片增加注解**,提升理解效率。
|
|
||||||
3. **精准专业翻译:** 将内容**翻译成简体中文,同时智能识别并保留专业名词的英文原文注解**,确保专业术语的准确性,避免翻译歧义,让学生在中文语境下理解内容的同时,也能熟悉和掌握专业英文表达。
|
|
||||||
|
|
||||||
### 前置要求
|
- Nvidia GPU(docling 转换需要)
|
||||||
|
- OpenAI 兼容 API(支持 OpenAI / DeepSeek / 通义千问 / Ollama 等)
|
||||||
|
|
||||||
- Nvidia GPU
|
## 安装
|
||||||
- LLMs API Key
|
|
||||||
- Gemini
|
|
||||||
- OpenAI
|
|
||||||
- Ollama
|
|
||||||
|
|
||||||
### 安装
|
|
||||||
|
|
||||||
1. **安装 uv:** 如果您尚未安装 `uv`,请按照官方文档进行安装。通常可以使用 pip 安装:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install uv
|
pip install uv
|
||||||
```
|
|
||||||
|
|
||||||
2. **安装依赖:** 在项目根目录下,使用 `uv` 安装所有必要的依赖:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
uv venv
|
uv venv
|
||||||
uv sync
|
uv sync
|
||||||
```
|
```
|
||||||
|
|
||||||
### 配置
|
## 配置
|
||||||
|
|
||||||
本项目使用 `config.ini` 文件来管理 API 密钥。请确保在运行程序之前,在项目根目录下创建 `config.ini` 文件,并按照以下格式配置:
|
复制 `config.example.ini` 为 `config.ini` 并填入你的 API 信息:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp config.example.ini config.ini
|
||||||
|
```
|
||||||
|
|
||||||
|
`config.ini` 格式:
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
[llm]
|
[llm]
|
||||||
# openai/gemini/ollama
|
BASE_URL = https://api.openai.com/v1
|
||||||
PROVIDER = openai
|
API_KEY = sk-xxxx
|
||||||
GEMINI_MODEL_NAME = gemini-2.5-flash
|
MODEL = gpt-4o
|
||||||
OPENAI_MODEL_NAME = gpt-5-mini
|
|
||||||
OLLAMA_MODEL_NAME = gemma3:latest
|
|
||||||
OLLAMA_BASE_URL = http://localhost:11434
|
|
||||||
TEMPERATURE = 0.7
|
TEMPERATURE = 0.7
|
||||||
GOOGLE_API_KEY =
|
MAX_RETRIES = 3
|
||||||
OPENAI_API_KEY=
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 使用方法
|
**常见服务配置:**
|
||||||
|
|
||||||
|
| 服务 | BASE_URL | MODEL 示例 |
|
||||||
|
|------|----------|-----------|
|
||||||
|
| OpenAI | `https://api.openai.com/v1` | `gpt-4o` |
|
||||||
|
| DeepSeek | `https://api.deepseek.com/v1` | `deepseek-chat` |
|
||||||
|
| 通义千问 | `https://dashscope.aliyuncs.com/compatible-mode/v1` | `qwen-max` |
|
||||||
|
| Ollama | `http://localhost:11434/v1` | `gemma3:latest` |
|
||||||
|
|
||||||
|
> Ollama 的 `API_KEY` 可填任意值(如 `ollama`)。
|
||||||
|
|
||||||
|
环境变量 `OPENAI_API_KEY` / `OPENAI_BASE_URL` 也可配置,优先级低于 `config.ini`。
|
||||||
|
|
||||||
|
## 使用
|
||||||
|
|
||||||
|
将 PDF 放入 `input/` 目录,然后运行:
|
||||||
|
|
||||||
1. 将需要处理的 PDF 文件放入 `input` 目录下。
|
|
||||||
2. 运行 `main.py` 脚本。程序将自动处理 `input` 目录下的所有 PDF 文件。请使用 `uv run` 命令来执行脚本,以确保在正确的虚拟环境中运行:
|
|
||||||
```bash
|
```bash
|
||||||
uv run python main.py
|
uv run python main.py
|
||||||
```
|
```
|
||||||
|
|
||||||
|
输出结构:
|
||||||
|
|
||||||
|
```
|
||||||
|
output/
|
||||||
|
└── 课程名称/
|
||||||
|
├── index.md # PDF 转 Markdown
|
||||||
|
├── index_refined.md # LLM 精炼翻译
|
||||||
|
└── images/ # 提取的图片
|
||||||
|
```
|
||||||
|
|
||||||
|
### 断点续跑
|
||||||
|
|
||||||
|
程序会自动跳过已完成的步骤:
|
||||||
|
|
||||||
|
- `index_refined.md` 已存在 → **跳过整个文件**
|
||||||
|
- `index.md` 已存在但 `index_refined.md` 不存在 → **跳过 PDF 转换,仅运行 LLM 精炼**
|
||||||
|
|
||||||
|
如需重新处理,删除对应的输出文件即可。
|
||||||
|
|
||||||
## 引用
|
## 引用
|
||||||
|
|
||||||
- [docling](https://github.com/docling-project/docling)
|
- [docling](https://github.com/docling-project/docling)
|
||||||
@@ -67,8 +91,8 @@ OPENAI_API_KEY=
|
|||||||
|
|
||||||
### docling 转换 PDF 时报错
|
### docling 转换 PDF 时报错
|
||||||
|
|
||||||
可能是 PDF 文件不规范导致的,可以尝试使用 ghostscript 规范文件。
|
可能是 PDF 不规范,用 ghostscript 修复:
|
||||||
|
|
||||||
```shell
|
```bash
|
||||||
gs -o <output.pdf> -sDEVICE=pdfwrite -dPDFSETTINGS=/default <input.pdf>
|
gs -o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/default input.pdf
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[llm]
|
||||||
|
# API 配置 (支持任何 OpenAI 兼容 API)
|
||||||
|
BASE_URL = https://api.openai.com/v1
|
||||||
|
API_KEY = sk-xxxx
|
||||||
|
MODEL = gpt-4o
|
||||||
|
TEMPERATURE = 0.7
|
||||||
|
# 请求失败自动重试次数 (默认 3)
|
||||||
|
MAX_RETRIES = 3
|
||||||
@@ -5,70 +5,36 @@ import os
|
|||||||
def set_api_key() -> None:
|
def set_api_key() -> None:
|
||||||
config = configparser.ConfigParser()
|
config = configparser.ConfigParser()
|
||||||
config.read("config.ini")
|
config.read("config.ini")
|
||||||
provider = config.get("llm", "PROVIDER", fallback="gemini")
|
api_key = config.get("llm", "API_KEY", fallback=None)
|
||||||
if provider == "gemini":
|
|
||||||
set_gemini_api_key()
|
|
||||||
elif provider == "ollama":
|
|
||||||
set_ollama_config()
|
|
||||||
elif provider == "openai":
|
|
||||||
set_openai_api_key()
|
|
||||||
|
|
||||||
|
|
||||||
def set_openai_api_key() -> None:
|
|
||||||
config = configparser.ConfigParser()
|
|
||||||
config.read("config.ini")
|
|
||||||
openai_api_key = config.get("llm", "OPENAI_API_KEY", fallback=None)
|
|
||||||
if not os.environ.get("OPENAI_API_KEY"):
|
if not os.environ.get("OPENAI_API_KEY"):
|
||||||
if openai_api_key:
|
if api_key:
|
||||||
os.environ["OPENAI_API_KEY"] = openai_api_key
|
os.environ["OPENAI_API_KEY"] = api_key
|
||||||
else:
|
else:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
"Error: OPENAI_API_KEY not found in config.ini or environment variables"
|
"Error: API_KEY not found in config.ini or environment variables"
|
||||||
)
|
)
|
||||||
return
|
|
||||||
|
|
||||||
|
|
||||||
def set_gemini_api_key() -> None:
|
def get_base_url() -> str | None:
|
||||||
config = configparser.ConfigParser()
|
config = configparser.ConfigParser()
|
||||||
config.read("config.ini")
|
config.read("config.ini")
|
||||||
google_api_key = config.get("llm", "GOOGLE_API_KEY", fallback=None)
|
base_url = config.get("llm", "BASE_URL", fallback=None)
|
||||||
|
return base_url or os.environ.get("OPENAI_BASE_URL")
|
||||||
if not os.environ.get("GOOGLE_API_KEY"):
|
|
||||||
if google_api_key:
|
|
||||||
os.environ["GOOGLE_API_KEY"] = google_api_key
|
|
||||||
else:
|
|
||||||
raise ValueError(
|
|
||||||
"Error: GOOGLE_API_KEY not found in config.ini or environment variables"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
|
|
||||||
def set_ollama_config() -> None:
|
|
||||||
config = configparser.ConfigParser()
|
|
||||||
config.read("config.ini")
|
|
||||||
ollama_base_url = config.get(
|
|
||||||
"llm", "OLLAMA_BASE_URL", fallback="http://localhost:11434"
|
|
||||||
)
|
|
||||||
|
|
||||||
if not os.environ.get("OLLAMA_BASE_URL"):
|
|
||||||
os.environ["OLLAMA_BASE_URL"] = ollama_base_url
|
|
||||||
return
|
|
||||||
|
|
||||||
|
|
||||||
def get_model_name() -> str:
|
def get_model_name() -> str:
|
||||||
config = configparser.ConfigParser()
|
config = configparser.ConfigParser()
|
||||||
config.read("config.ini")
|
config.read("config.ini")
|
||||||
provider = config.get("llm", "PROVIDER", fallback="gemini")
|
return config.get("llm", "MODEL", fallback="gpt-4o")
|
||||||
if provider == "gemini":
|
|
||||||
return config.get("llm", "GEMINI_MODEL_NAME", fallback="gemini-2.5-flash")
|
|
||||||
elif provider == "ollama":
|
|
||||||
return config.get("llm", "OLLAMA_MODEL_NAME", fallback="gemma3:latest")
|
|
||||||
elif provider == "openai":
|
|
||||||
return config.get("llm", "OPENAI_MODEL_NAME", fallback="gpt-5-mini")
|
|
||||||
return "gemini-2.5-flash" # Default fallback
|
|
||||||
|
|
||||||
|
|
||||||
def get_temperature() -> float:
|
def get_temperature() -> float:
|
||||||
config = configparser.ConfigParser()
|
config = configparser.ConfigParser()
|
||||||
config.read("config.ini")
|
config.read("config.ini")
|
||||||
return float(config.get("llm", "TEMPERATURE", fallback=0.7))
|
return float(config.get("llm", "TEMPERATURE", fallback=0.7))
|
||||||
|
|
||||||
|
|
||||||
|
def get_max_retries() -> int:
|
||||||
|
config = configparser.ConfigParser()
|
||||||
|
config.read("config.ini")
|
||||||
|
return int(config.get("llm", "MAX_RETRIES", fallback=3))
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
import os
|
import os
|
||||||
from pdf_convertor import (
|
from pdf_convertor import (
|
||||||
convert_pdf_to_markdown,
|
convert_pdf_to_markdown,
|
||||||
|
load_md_file,
|
||||||
save_md_images,
|
save_md_images,
|
||||||
refine_content,
|
refine_content,
|
||||||
)
|
)
|
||||||
@@ -24,13 +25,30 @@ def main():
|
|||||||
|
|
||||||
current_output_dir.mkdir(parents=True, exist_ok=True)
|
current_output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
index_md = current_output_dir / "index.md"
|
||||||
|
refined_md = current_output_dir / "index_refined.md"
|
||||||
|
|
||||||
|
# 整个跳过:已存在精炼结果,跳过该文件
|
||||||
|
if refined_md.exists():
|
||||||
|
print(f"Skipping {pdf_path.name}: already processed (index_refined.md exists)")
|
||||||
|
continue
|
||||||
|
|
||||||
print(f"Processing {pdf_path} -> {current_output_dir}")
|
print(f"Processing {pdf_path} -> {current_output_dir}")
|
||||||
|
|
||||||
with open(pdf_path, "rb") as pdf_file:
|
with open(pdf_path, "rb") as pdf_file:
|
||||||
pdf_content = pdf_file.read()
|
pdf_content = pdf_file.read()
|
||||||
|
|
||||||
|
# 部分跳过:已存在转换结果,跳过 PDF 转 MD
|
||||||
|
if index_md.exists():
|
||||||
|
print(f" Skipping PDF→MD conversion: index.md already exists")
|
||||||
|
md, images = load_md_file(index_md)
|
||||||
|
else:
|
||||||
md, images = convert_pdf_to_markdown(pdf_content)
|
md, images = convert_pdf_to_markdown(pdf_content)
|
||||||
save_md_images(current_output_dir, md, images)
|
save_md_images(current_output_dir, md, images)
|
||||||
|
|
||||||
|
# 部分跳过:已存在精炼结果,跳过 LLM 精炼
|
||||||
|
# (前面已检查整个跳过,这里不需要再检查)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
md = refine_content(md, images, pdf_content)
|
md = refine_content(md, images, pdf_content)
|
||||||
except BaseException:
|
except BaseException:
|
||||||
|
|||||||
+14
-52
@@ -1,6 +1,5 @@
|
|||||||
import re
|
import re
|
||||||
import base64
|
import base64
|
||||||
import os
|
|
||||||
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
|
||||||
from docling.datamodel.base_models import InputFormat
|
from docling.datamodel.base_models import InputFormat
|
||||||
from docling.datamodel.pipeline_options import (
|
from docling.datamodel.pipeline_options import (
|
||||||
@@ -11,14 +10,11 @@ from docling.datamodel.settings import settings
|
|||||||
from docling.document_converter import DocumentConverter, PdfFormatOption
|
from docling.document_converter import DocumentConverter, PdfFormatOption
|
||||||
from docling_core.types.doc.base import ImageRefMode
|
from docling_core.types.doc.base import ImageRefMode
|
||||||
from langchain_core.messages import HumanMessage, SystemMessage
|
from langchain_core.messages import HumanMessage, SystemMessage
|
||||||
from langchain_google_genai import ChatGoogleGenerativeAI
|
|
||||||
from langchain_ollama import ChatOllama
|
|
||||||
from langchain_openai import ChatOpenAI
|
from langchain_openai import ChatOpenAI
|
||||||
from llm import set_api_key, get_model_name, get_temperature
|
from llm import set_api_key, get_model_name, get_temperature, get_base_url, get_max_retries
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import configparser
|
import base64
|
||||||
import fitz
|
|
||||||
|
|
||||||
|
|
||||||
def save_md_images(
|
def save_md_images(
|
||||||
@@ -117,35 +113,21 @@ def convert_pdf_to_markdown(pdf: bytes) -> tuple[str, dict[str, bytes]]:
|
|||||||
def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
||||||
"""Refines the Markdown content using an LLM."""
|
"""Refines the Markdown content using an LLM."""
|
||||||
|
|
||||||
config = configparser.ConfigParser()
|
|
||||||
config.read("config.ini")
|
|
||||||
provider = config.get("llm", "PROVIDER", fallback="gemini")
|
|
||||||
|
|
||||||
set_api_key()
|
set_api_key()
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if provider == "gemini":
|
kwargs = {
|
||||||
llm = ChatGoogleGenerativeAI(
|
"model": get_model_name(),
|
||||||
model=get_model_name(), temperature=get_temperature()
|
"temperature": get_temperature(),
|
||||||
)
|
}
|
||||||
elif provider == "ollama":
|
base_url = get_base_url()
|
||||||
llm = ChatOllama(
|
if base_url:
|
||||||
model=get_model_name(),
|
kwargs["base_url"] = base_url
|
||||||
temperature=get_temperature(),
|
kwargs["max_retries"] = get_max_retries()
|
||||||
base_url=os.environ["OLLAMA_BASE_URL"],
|
llm = ChatOpenAI(**kwargs)
|
||||||
num_ctx=256000,
|
|
||||||
num_predict=-1,
|
|
||||||
)
|
|
||||||
elif provider == "openai":
|
|
||||||
llm = ChatOpenAI(
|
|
||||||
model=get_model_name(),
|
|
||||||
temperature=get_temperature(),
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
raise ValueError(f"Unsupported LLM provider: {provider}")
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
raise BaseException(
|
raise BaseException(
|
||||||
f"Error initializing LLM. Make sure your LLM configuration is correct. Error: {e}"
|
f"Error initializing LLM. Make sure your configuration is correct. Error: {e}"
|
||||||
)
|
)
|
||||||
|
|
||||||
with open("pdf_convertor_prompt.md", "r") as f:
|
with open("pdf_convertor_prompt.md", "r") as f:
|
||||||
@@ -204,7 +186,6 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
|||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
|
||||||
if provider == "gemini" or provider == "openai":
|
|
||||||
human_message_parts.extend(
|
human_message_parts.extend(
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
@@ -215,25 +196,6 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
|||||||
},
|
},
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
if provider == "ollama":
|
|
||||||
doc = fitz.open(stream=pdf, filetype="pdf")
|
|
||||||
for page_num in range(doc.page_count):
|
|
||||||
page = doc.load_page(page_num)
|
|
||||||
pix = page.get_pixmap()
|
|
||||||
img_bytes = pix.tobytes("png")
|
|
||||||
human_message_parts.append(
|
|
||||||
{
|
|
||||||
"type": "text",
|
|
||||||
"text": f"This is page {page_num + 1} of the original PDF file:\n",
|
|
||||||
}
|
|
||||||
)
|
|
||||||
human_message_parts.append(
|
|
||||||
{
|
|
||||||
"type": "image_url",
|
|
||||||
"image_url": f"data:image/png;base64,{base64.b64encode(img_bytes).decode('utf-8')}",
|
|
||||||
}
|
|
||||||
)
|
|
||||||
doc.close()
|
|
||||||
|
|
||||||
message_content = [
|
message_content = [
|
||||||
SystemMessage(content=prompt),
|
SystemMessage(content=prompt),
|
||||||
@@ -241,7 +203,7 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
|||||||
]
|
]
|
||||||
|
|
||||||
print(
|
print(
|
||||||
f"Sending request to {provider} with the PDF, Markdown and referenced images... This may take a moment."
|
"Sending request to LLM with the PDF, Markdown and referenced images... This may take a moment."
|
||||||
)
|
)
|
||||||
try:
|
try:
|
||||||
response = llm.invoke(message_content)
|
response = llm.invoke(message_content)
|
||||||
@@ -250,7 +212,7 @@ def refine_content(md: str, images: dict[str, bytes], pdf: bytes) -> str:
|
|||||||
raise BaseException(f"An error occurred while invoking the LLM: {e}")
|
raise BaseException(f"An error occurred while invoking the LLM: {e}")
|
||||||
|
|
||||||
if str(refined_content) == "":
|
if str(refined_content) == "":
|
||||||
raise BaseException(f"Response of {provider} is empty")
|
raise BaseException("Response of LLM is empty")
|
||||||
|
|
||||||
return fix_output(str(refined_content))
|
return fix_output(str(refined_content))
|
||||||
|
|
||||||
|
|||||||
@@ -7,9 +7,5 @@ requires-python = ">=3.13"
|
|||||||
dependencies = [
|
dependencies = [
|
||||||
"docling>=2.57.0",
|
"docling>=2.57.0",
|
||||||
"langchain>=1.0.2",
|
"langchain>=1.0.2",
|
||||||
"langchain-community>=0.4.1",
|
|
||||||
"langchain-google-genai>=3.0.0",
|
|
||||||
"langchain-ollama>=1.0.0",
|
|
||||||
"langchain-openai>=1.0.2",
|
"langchain-openai>=1.0.2",
|
||||||
"pymupdf>=1.26.6",
|
|
||||||
]
|
]
|
||||||
|
|||||||
Reference in New Issue
Block a user