浏览代码

Use `python-docx` to extract docx files (#2654)

tags/0.5.9
Bowen Liang 1年前
父节点
当前提交
b163545771
没有帐户链接到提交者的电子邮件
共有 2 个文件被更改,包括 12 次插入10 次删除
  1. 11
    9
      api/core/rag/extractor/word_extractor.py
  2. 1
    1
      api/requirements.txt

+ 11
- 9
api/core/rag/extractor/word_extractor.py 查看文件





class WordExtractor(BaseExtractor): class WordExtractor(BaseExtractor):
"""Load pdf files.
"""Load docx files.




Args: Args:


def extract(self) -> list[Document]: def extract(self) -> list[Document]:
"""Load given path as single page.""" """Load given path as single page."""
import docx2txt

return [
Document(
page_content=docx2txt.process(self.file_path),
metadata={"source": self.file_path},
)
]
from docx import Document as docx_Document

document = docx_Document(self.file_path)
doc_texts = [paragraph.text for paragraph in document.paragraphs]
content = '\n'.join(doc_texts)

return [Document(
page_content=content,
metadata={"source": self.file_path},
)]


@staticmethod @staticmethod
def _is_valid_url(url: str) -> bool: def _is_valid_url(url: str) -> bool:

+ 1
- 1
api/requirements.txt 查看文件

redis~=4.5.4 redis~=4.5.4
openpyxl==3.1.2 openpyxl==3.1.2
chardet~=5.1.0 chardet~=5.1.0
docx2txt==0.8
python-docx~=1.1.0
pypdfium2==4.16.0 pypdfium2==4.16.0
resend~=0.7.0 resend~=0.7.0
pyjwt~=2.8.0 pyjwt~=2.8.0

正在加载...
取消
保存