瀏覽代碼

feat: Enable antialiasing for PDF image extraction to improve OCR accuracy (#7562)

### What problem does this PR solve?

When the PDF uses vector fonts, the rendered text in the captured page
image often has missing strokes, leading to numerous OCR errors and
incorrect characters. Similar issues also occur in the extracted chart
images.

**Before**

![0089e1f762](https://github.com/user-attachments/assets/a84f8cd7-48ae-4da4-81ca-fc0bd93320f1)

**After**

![03053149e9](https://github.com/user-attachments/assets/45fa5ebb-a2de-42b1-9535-1ea087877eb2)

You can use the following document for testing.

[Casio说明书.pdf](https://github.com/user-attachments/files/20119690/Casio.pdf)


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
tags/v0.19.0
liuzhenghua 5 月之前
父節點
當前提交
ea5e8caa69
No account linked to committer's email address
共有 1 個文件被更改,包括 1 次插入1 次删除
  1. 1
    1
      deepdoc/parser/pdf_parser.py

+ 1
- 1
deepdoc/parser/pdf_parser.py 查看文件

@@ -1015,7 +1015,7 @@ class RAGFlowPdfParser:
with sys.modules[LOCK_KEY_pdfplumber]:
with (pdfplumber.open(fnm) if isinstance(fnm, str) else pdfplumber.open(BytesIO(fnm))) as pdf:
self.pdf = pdf
self.page_images = [p.to_image(resolution=72 * zoomin).annotated for i, p in
self.page_images = [p.to_image(resolution=72 * zoomin, antialias=True).annotated for i, p in
enumerate(self.pdf.pages[page_from:page_to])]

try:

Loading…
取消
儲存