瀏覽代碼

fix parsing spaces in russian language PDFs (#1987) (#2427)

### What problem does this PR solve?

[#1987](https://github.com/infiniflow/ragflow/issues/1987)

When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf)
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816)
and old [a-zA-Z...] regex parses it correctly with spaces.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
tags/v0.11.0
Vitaliy Groshev 1 年之前
父節點
當前提交
7e75b9d778
沒有連結到貢獻者的電子郵件帳戶。
共有 1 個檔案被更改,包括 1 行新增1 行删除
  1. 1
    1
      deepdoc/parser/pdf_parser.py

+ 1
- 1
deepdoc/parser/pdf_parser.py 查看文件

self.lefted_chars.append(c) self.lefted_chars.append(c)
continue continue
if c["text"] == " " and bxs[ii]["text"]: if c["text"] == " " and bxs[ii]["text"]:
if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]):
if re.match(r"[0-9a-zA-Zа-яА-Я,.?;:!%%]", bxs[ii]["text"][-1]):
bxs[ii]["text"] += " " bxs[ii]["text"] += " "
else: else:
bxs[ii]["text"] += c["text"] bxs[ii]["text"] += c["text"]

Loading…
取消
儲存