ragflow

Commit Graph

작성자	SHA1	메시지	날짜
pingguoCooler	cf0011be67	Feat: Upgrade html parser (#9675) ### What problem does this PR solve? parse more html content. ### Type of change - [x] Other (please describe):	2 달 전
Yongteng Lei	382458ace7	Feat: advanced markdown parsing (#9607) ### What problem does this PR solve? Using AST parsing to handle markdown more accurately, preventing components from being cut off by chunking. #9564 <img width="1746" height="993" alt="image" src="https://github.com/user-attachments/assets/4aaf4bf6-5714-4d48-a9cf-864f59633f7f" /> <img width="1739" height="982" alt="image" src="https://github.com/user-attachments/assets/dc00233f-7a55-434f-bbb7-74ce7f57a6cf" /> <img width="559" height="100" alt="image" src="https://github.com/user-attachments/assets/4a556b5b-d9c6-4544-a486-8ac342bd504e" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2 달 전
Kevin Hu	312f1a0477	Fix: enlarge raptor timeout limits. (#9600) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2 달 전
Yongteng Lei	787e0c6786	Refa: OpenAI whisper-1 (#9552) ### What problem does this PR solve? Refactor OpenAI to enable audio parsing. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring	2 달 전
Yongteng Lei	eef43fa25c	Fix: unexpected truncated Excel files (#9500) ### What problem does this PR solve? Handle unexpected truncated Excel files. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2 달 전
Jay Xu	6d1078b538	fix 'KeyError: "There is no item named 'word/NULL' in the archive"' (#9455) ### What problem does this PR solve? Issue referring to: https://github.com/python-openxml/python-docx/issues/797 Fix referring to: https://github.com/python-openxml/python-docx/issues/1105#issuecomment-1298075246 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2 달 전
HaiyangP	79399f7f25	Support the case of one cell split by multiple columns. (#9225) ### What problem does this PR solve? Support the case of one cell split by multiple columns. Besides, the codes are compatible with the common cell case. #8606 can be fixed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) I provide a case of one cell split by multiple columns: [test.xlsx](https://github.com/user-attachments/files/21578693/test.xlsx) The chunk res: <img width="236" height="57" alt="2025-06-17 16-04-07 的屏幕截图" src="https://github.com/user-attachments/assets/b0a499ac-349d-4c3d-8c6e-0931c8fc26de" />	2 달 전
Jay Xu	7f08ba47d7	Fix "no `tc` element at grid_offset" (#9375) ### What problem does this PR solve? fix "no `tc` element at grid_offset", just log warning and ignore. stacktrace: ``` Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 620, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 553, in do_handle_task chunks = await build_chunks(task, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 257, in build_chunks cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 257, in <lambda> cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], File "/ragflow/rag/app/naive.py", line 384, in chunk sections, tables = Docx()(filename, binary) File "/ragflow/rag/app/naive.py", line 230, in __call__ while i < len(r.cells): File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 438, in cells return tuple(_iter_row_cells()) File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 436, in _iter_row_cells yield from iter_tc_cells(tc) File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 424, in iter_tc_cells yield from iter_tc_cells(tc._tc_above) # pyright: ignore[reportPrivateUsage] File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 741, in _tc_above return self._tr_above.tc_at_grid_offset(self.grid_offset) File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 98, in tc_at_grid_offset raise ValueError(f"no `tc` element at grid_offset={grid_offset}") ValueError: no `tc` element at grid_offset=10 ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2 달 전
yzz	550e65bb22	Fix: PlainParser using fix in presentation (#9239) ### What problem does this PR solve? tiny fix about the using of `deepdoc.pdf_parser.PlainParser` in `rag.app.presentation.chunk`, I referred to other ways of using this class. So tiny the fix is, a issue seems unnecessary. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2 달 전
Jay Xu	cae11201ef	fix "out of memory" if slide.get_thumbnail() to a huge image (#9211) ### What problem does this PR solve? fix "out of memory" if slide.get_thumbnail() to a huge image ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	3 달 전
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	3 달 전
Yongteng Lei	39ef2ffba9	Feat: parsing supports jsonl or ldjson format (#9087) ### What problem does this PR solve? Supports jsonl or ldjson format. Feature request from [discussion](https://github.com/orgs/infiniflow/discussions/8774). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	3 달 전
Stephen Hu	92cfbcb382	Fix: when parse markdown support extract image at local (#8906) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8902 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	3 달 전
Yongteng Lei	e9b14142a5	Fix: fixed invalid save() arguments for slide thumbnails (#8851) ### What problem does this PR solve? Fixed invalid save() arguments for slide thumbnails. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	3 달 전
Yongteng Lei	51a8604dcb	Fix: fixed context loss caused by separating markdown tables from original text (#8844) ### What problem does this PR solve? Fix context loss caused by separating markdown tables from original text. #6871, #8804. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	3 달 전
Stephen Hu	ce140f1393	Fix:Better Support Table Value Type (#8822) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8782 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	3 달 전
Stephen Hu	2b7adbd2d1	Fix: Improve Memory Usage For Presentation (#8792) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8791 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	3 달 전
wenxuan.zhang	f586dd0a96	Fix: docx parse error. (#8600) ### What problem does this PR solve? docx parse error. ![image](https://github.com/user-attachments/assets/efbe6d1b-10c8-415e-b693-a86f73e1ffa6) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### What problem does this PR solve? Some docx parse with naive cause error. `block.style.name` in Function `__get_nearest_title` will be None in some case. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: wenxuan.zhang <wenxuan.zhang@chinacreator.com>	4 달 전
Tuan Le	6b1221d2f6	Fix parser_config access for layout_recognize in presentation.py (#8492) ### What problem does this PR solve? This PR addresses an issue in the presentation parser where the `layout_recognize` configuration was incorrectly retrieved from `kwargs.get("layout_recognize", "DeepDOC")`. Instead, it should be sourced from the `parser_config` parameter, specifically `parser_config.get("layout_recognize", "DeepDOC")`. This mismatch could cause the parser to default to the "DeepDOC" layout recognizer, ignoring any alternative recognition method specified in the parser configuration. As a result, PDF document parsing might use an incorrect recognition engine. The fix ensures the presentation parser consistently uses the `layout_recognize` setting from `parser_config`, aligning with the configuration access patterns used elsewhere in the codebase. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	4 달 전
liuzhenghua	5256980ffb	Fix: Solve the OOM issue when passing large PDF files while using QA chunking method. (#8464) ### What problem does this PR solve? Using the QA chunking method with a large PDF (e.g., 300+ pages) may lead to OOM in the ragflow-worker module. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	4 달 전
HaiyangP	d6a941ebf5	Fix the bug of long type value overflow (#8313) ### What problem does this PR solve? This PR will fix the #8271 by extending int type to float type when there is any value out of long type range in a column. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	4 달 전
Jin Hai	4a2ff633e0	Fix typo in code (#8327) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	4 달 전
HaiyangP	baf32ee461	Display only the duplicate column names and corresponding original source. (#8138) ### What problem does this PR solve? This PR aims to slove #8120 which request a better error display of duplicate column names. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	4 달 전
Kevin Hu	24625e0695	Fix: presentation of PDF using vlm. (#8133) ### What problem does this PR solve? #8109 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	4 달 전
Yongteng Lei	bd4678bca6	Fix: Unnecessary truncation in markdown parser (#7972) ### What problem does this PR solve? Fix unnecessary truncation in markdown parser. So that markdown can work perfectly like [this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576) in #7824, supporting multiple special delimiters. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	5 달 전
Kevin Hu	bfe97d896d	Fix: docx get image exception. (#7636) ### What problem does this PR solve? Close #7631 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	5 달 전
Kevin Hu	321a280031	Feat: add image preview to retrieval test. (#7610) ### What problem does this PR solve? #7608 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	5 달 전
alkscr	baa108f5cc	Fix: markdown table conversion error (#7570) ### What problem does this PR solve? Since `import markdown.markdown` has been changed to `import markdown` in `rag/app/naive.py`, previous code for converting markdown tables would call a markdown module instead of a callable function. This cause error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	5 달 전
WhiteBear	5352bdf4da	Error storing tag in Redis (#7541) ### What problem does this PR solve? The parameter positions were incorrect and have been corrected to use keyword argument passing ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	5 달 전
Stephen Hu	1a5608d0f8	Fix: Add title_tks for Pictures (#7365) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/7362 append title_tks ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	6 달 전
Stephen Hu	1662c7eda3	Feat: Markdown add image (#7124) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/6984 1. Markdown parser supports get pictures 2. For Native, when handling Markdown, it will handle images 3. improve merge and ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	6 달 전
QuintinTao	1b4016317e	fix bug chunking:expected string or bytes-like object (#7116) … bytes-like object ### What problem does this PR solve? fix bug #6990 internal server error ehile chunking:expected string or bytes-like object _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: unknown <taoshi.ln@chinatelecom.cn>	6 달 전
Kevin Hu	ed5f81b02e	Fix: abnormal cell mergeing. (#6991) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	6 달 전
dylan	5aae73c230	Make error messages during PPT processing clearer. (#6980) ### What problem does this PR solve? Sometimes a slide may trigger a Proxy error (ArgumentException: Parameter is not valid) due to issues in the original file, and this error message can be confusing for users. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [x] Other (please describe):	6 달 전
Kevin Hu	14a3efd756	Fix: docx image exceptions. (#6839) ### What problem does this PR solve? Close #6784 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	6 달 전
Kevin Hu	ee5aa51d43	Fix: point in tag issue. (#6436) ### What problem does this PR solve? #6414 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	7 달 전
fansir	0e0ebaac5f	Feat: Adds hierarchical title path tracking for tables in DOCX documents to improve context association (#6374) ### What problem does this PR solve? Adds hierarchical title path tracking for tables in DOCX documents to improve context association. Previously, extracted tables lacked positional context within document structure. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	7 달 전
Kevin Hu	95497b4aab	Fix: adapt to old configurations. (#6321) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	7 달 전
Yongteng Lei	9611185eb4	Feat: add VLM-boosted DocX parser (#6307) ### What problem does this PR solve? Add VLM-boosted DocX parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	7 달 전
Yongteng Lei	e4380843c4	Feat: add fallback for PDF figure parser (#6305) ### What problem does this PR solve? Add fallback for PDF figure parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	7 달 전
Yongteng Lei	1d6760dd84	Feat: add VLM-boosted PDF parser (#6278) ### What problem does this PR solve? Add VLM-boosted PDF parser if VLM is set. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	7 달 전
Yongteng Lei	5cf610af40	Feat: add vision LLM PDF parser (#6173) ### What problem does this PR solve? Add vision LLM PDF parser ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	7 달 전
Kevin Hu	1333d3c02a	Fix: float transfer exception. (#6197) ### What problem does this PR solve? #6177 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	7 달 전
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168) ### What problem does this PR solve? ### Type of change - [x] Refactoring	7 달 전
Kevin Hu	bfa8d342b3	Fix: retrieval debug mode issue. (#6150) ### What problem does this PR solve? #6139 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	7 달 전
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	7 달 전
Yongteng Lei	4ff609b6a8	Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027) ### What problem does this PR solve? Optimize OCR garbage identification to reduce unnecessary filtering. #5713 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	7 달 전
Yongteng Lei	7cd37c37cd	Feat: add CSV file parsing support (#5989) ### What problem does this PR solve? Add CSV file parsing support #4552, #5849, #5870 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	7 달 전
hy89	b0c21b00d9	Refactor: Optimize error handling and support parsing of XLS(EXCEL97—2003) files. (#5633) Optimize error handling and support parsing of XLS(EXCEL97—2003) files.	8 달 전
Kevin Hu	b418ce5643	Fix table parser issue. (#5482) ### What problem does this PR solve? #1475 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	8 달 전

1 2 3 4

187 커밋 (2d89863fddbb360934c6687ecbbdf620f2a5dbbe)