浏览代码

Fix: Embedding err when docx contains unsupported images (#1720)

### What problem does this PR solve?

Fix the problem of not being able to embedding when docx document
contains unsupported images.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
tags/v0.9.0
Yuhao Tsui 1年前
父节点
当前提交
a973b9e01f
没有帐户链接到提交者的电子邮件
共有 1 个文件被更改,包括 11 次插入4 次删除
  1. 11
    4
      rag/app/naive.py

+ 11
- 4
rag/app/naive.py 查看文件

from PIL import Image from PIL import Image
from functools import reduce from functools import reduce
from markdown import markdown from markdown import markdown
from docx.image.exceptions import UnrecognizedImageError
class Docx(DocxParser): class Docx(DocxParser):
def __init__(self): def __init__(self):
img = img[0] img = img[0]
embed = img.xpath('.//a:blip/@r:embed')[0] embed = img.xpath('.//a:blip/@r:embed')[0]
related_part = document.part.related_parts[embed] related_part = document.part.related_parts[embed]
image = related_part.image
image = Image.open(BytesIO(image.blob)).convert('RGB')
return image
try:
image_blob = related_part.image.blob
except UnrecognizedImageError:
print("Unrecognized image format. Skipping image.")
return None
try:
image = Image.open(BytesIO(image_blob)).convert('RGB')
return image
except Exception as e:
return None
def __clean(self, line): def __clean(self, line):
line = re.sub(r"\u3000", " ", line).strip() line = re.sub(r"\u3000", " ", line).strip()

正在加载...
取消
保存