浏览代码

Perf: ignore concate between rows. (#8507)

### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
tags/v0.20.0
Kevin Hu 4 个月前
父节点
当前提交
6d256ff0f5
没有帐户链接到提交者的电子邮件
共有 2 个文件被更改,包括 5 次插入1640 次删除
  1. 5
    1
      deepdoc/parser/pdf_parser.py
  2. 0
    1639
      rag/res/ner.json

+ 5
- 1
deepdoc/parser/pdf_parser.py 查看文件

@@ -479,6 +479,9 @@ class RAGFlowPdfParser:
self.boxes = bxs

def _concat_downward(self, concat_between_pages=True):
self.boxes = Recognizer.sort_Y_firstly(self.boxes, 0)
return

# count boxes in the same row as a feature
for i in range(len(self.boxes)):
mh = self.mean_height[self.boxes[i]["page_number"] - 1]
@@ -1136,7 +1139,8 @@ class RAGFlowPdfParser:
need_image, zoomin, return_html, False)
return self.__filterout_scraps(deepcopy(self.boxes), zoomin), tbls

def remove_tag(self, txt):
@staticmethod
def remove_tag(txt):
return re.sub(r"@@[\t0-9.-]+?##", "", txt)

def crop(self, text, ZM=3, need_position=False):

+ 0
- 1639
rag/res/ner.json
文件差异内容过多而无法显示
查看文件


正在加载...
取消
保存