瀏覽代碼

Refa: Improve ppt_parser better handle list (#6162)

### What problem does this PR solve?
This pull request (PR) incorporates codes for parsing PPTX files, aiming
to more precisely depict text in list formats (hint list by .).

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
tags/v0.18.0
Stephen Hu 7 月之前
父節點
當前提交
79482ff672
沒有連結到貢獻者的電子郵件帳戶。
共有 1 個檔案被更改,包括 14 行新增2 行删除
  1. 14
    2
      deepdoc/parser/ppt_parser.py

+ 14
- 2
deepdoc/parser/ppt_parser.py 查看文件

def __init__(self): def __init__(self):
super().__init__() super().__init__()


def __get_bulleted_text(self, paragraph):
is_bulleted = bool(paragraph._p.xpath("./a:pPr/a:buChar")) or bool(bool(paragraph._p.xpath("./a:pPr/a:buAutoNum")) )
if is_bulleted:
return f"{' '* paragraph.level}.{paragraph.text}"
else:
return paragraph.text

def __extract(self, shape): def __extract(self, shape):
if shape.shape_type == 19: if shape.shape_type == 19:
tb = shape.table tb = shape.table
return "\n".join(rows) return "\n".join(rows)


if shape.has_text_frame: if shape.has_text_frame:
return shape.text_frame.text
text_frame = shape.text_frame
texts = []
for paragraph in text_frame.paragraphs:
if paragraph.text.strip():
texts.append(self.__get_bulleted_text(paragraph))
return "\n".join(texts)


if shape.shape_type == 6: if shape.shape_type == 6:
texts = [] texts = []
logging.exception(e) logging.exception(e)
txts.append("\n".join(texts)) txts.append("\n".join(texts))


return txts
return txts

Loading…
取消
儲存