This commit improves the structure and clarity of the prompt sent to the LLM (Gemini/OpenAI) in the `refine_content` function.
Changes include:
* Adding explicit introductory text for the Markdown, individual images, and PDF sections to guide the LLM on the purpose of each input.
* Introducing clear "START OF IMAGE" and "END OF IMAGE" delimiters for each image to better define their boundaries.
* Unifying the PDF attachment mechanism for both Gemini and OpenAI providers, simplifying the code and ensuring consistent handling of PDF input.
These changes aim to improve the LLM's understanding of the provided content, leading to more accurate and relevant refinements.
Refine the image processing instructions within the PDF conversion prompt to emphasize the critical importance of matching image descriptions to their exact filenames.
The previous instructions were ambiguous and could lead to incorrect image descriptions. This update adds:
- A "Critical" warning to match image names correctly.
- Detailed rules outlining how to process image references based on provided filenames.
- An example workflow to illustrate the correct matching process.
- A new "Critical" verification step in the final instructions to ensure image explanations correspond to their filenames.
This change aims to prevent errors where image descriptions might be mismatched or generated from the wrong image content, ensuring higher accuracy in the conversion process.
Refactor image data passing in `pdf_convertor.py` to use a direct base64 and mime_type format, aligning with updated API requirements for vision models.
Additionally, the `pdf_convertor_prompt.md` has been significantly refined to improve the clarity and specificity of instructions for the AI model, particularly concerning:
- **Image Content Explanation:** Added detailed rules to ensure the AI only processes existing image references, preserves paths, and focuses on descriptive text.
- **Mathematical Formulas:** Clarified conversion to LaTeX notation.
- **Heading Structure:** Enhanced rules and examples for adjusting heading levels and merging adjacent or duplicate headings to ensure logical document flow.
This commit integrates OpenAI as a new Large Language Model (LLM) provider,
expanding the available options for content refinement.
Key changes include:
- Added `set_openai_api_key` to handle OpenAI API key retrieval from
`config.ini` or environment variables.
- Modified `set_api_key` to dynamically read the LLM provider from `config.ini`
This commit refactors the content refinement process to leverage `SystemMessage` for the primary prompt, enhancing clarity and adherence to LLM best practices.
The `pdf_convertor.py` file was updated to:
- Import `SystemMessage` from `langchain_core.messages`.
- Modify the `refine_content` function to use `SystemMessage` for the main prompt, moving the prompt content from `human_message_parts`.
- Adjust `human_message_parts` to only contain the Markdown and image data for the `HumanMessage`.
The `pdf_convertor_prompt.md` file was updated to:
- Reformat the prompt with clearer headings and instructions for each task.
- Improve the clarity and conciseness of the instructions for cleaning up characters, explaining image content, and correcting list formatting.
Additionally, `.gitignore` was updated to include `.vscode/` to prevent IDE-specific files from being committed.
These changes improve the structure of the LLM interaction and make the prompt more readable and maintainable.
This commit introduces support for Ollama as an alternative Large Language Model (LLM) provider and enhances PDF image extraction capabilities.
- **Ollama Integration:**
- Implemented `set_ollama_config` to configure Ollama's base URL from `config.ini`.
- Modified `llm.py` to dynamically select and configure the LLM (Gemini or Ollama) based on the `PROVIDER` setting.
- Updated `get_model_name` to return provider-specific default model names.
- `pdf_convertor.py` now conditionally initializes `ChatGoogleGenerativeAI` or `ChatOllama` based on the configured provider.
- **PyMuPDF Image Extraction:**
- Added a new `extract_images_from_pdf` function using PyMuPDF (`fitz`) for direct image extraction from PDF files.
- Introduced `get_extract_images_from_pdf_flag` to control this feature via `config.ini`.
- `convert_pdf_to_markdown` and `refine_content` functions were updated to utilize this new image extraction method when enabled.
- **Refinement Flow:**
- Adjusted the order of `save_md_images` in `main.py` and added an option to save the refined markdown with a specific filename (`index_refined.md`).
- **Dependencies:**
- Updated `pyproject.lock` to include new dependencies for Ollama integration (`langchain-ollama`) and PyMuPDF (`PyMuPDF`), along with platform-specific markers for NVIDIA dependencies.
The main.py script was becoming monolithic, containing all the logic for PDF conversion, image path simplification, and content refinement. This change extracts these core functionalities into a new `pdf_convertor` module.
This refactoring improves the project structure by:
- Enhancing modularity and separation of concerns.
- Making the main.py script a cleaner, high-level orchestrator.
- Improving code readability and maintainability.
The functions `convert_pdf_to_markdown`, `save_md_images`, and `refine_content` are now imported from the `pdf_convertor` module and called from the main execution block.