ragflow

Commit graph

Autor	SHA1	Nachricht	Datum
Jay Xu	6d1078b538	fix 'KeyError: "There is no item named 'word/NULL' in the archive"' (#9455) ### What problem does this PR solve? Issue referring to: https://github.com/python-openxml/python-docx/issues/797 Fix referring to: https://github.com/python-openxml/python-docx/issues/1105#issuecomment-1298075246 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 2 Monaten
Jay Xu	7f08ba47d7	Fix "no `tc` element at grid_offset" (#9375) ### What problem does this PR solve? fix "no `tc` element at grid_offset", just log warning and ignore. stacktrace: ``` Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 620, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 553, in do_handle_task chunks = await build_chunks(task, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 257, in build_chunks cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 257, in <lambda> cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], File "/ragflow/rag/app/naive.py", line 384, in chunk sections, tables = Docx()(filename, binary) File "/ragflow/rag/app/naive.py", line 230, in __call__ while i < len(r.cells): File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 438, in cells return tuple(_iter_row_cells()) File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 436, in _iter_row_cells yield from iter_tc_cells(tc) File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 424, in iter_tc_cells yield from iter_tc_cells(tc._tc_above) # pyright: ignore[reportPrivateUsage] File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 741, in _tc_above return self._tr_above.tc_at_grid_offset(self.grid_offset) File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 98, in tc_at_grid_offset raise ValueError(f"no `tc` element at grid_offset={grid_offset}") ValueError: no `tc` element at grid_offset=10 ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 2 Monaten
Kevin Hu	d9fe279dde	Feat: Redesign and refactor agent module (#9113) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 3 Monaten
Yongteng Lei	39ef2ffba9	Feat: parsing supports jsonl or ldjson format (#9087) ### What problem does this PR solve? Supports jsonl or ldjson format. Feature request from [discussion](https://github.com/orgs/infiniflow/discussions/8774). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 3 Monaten
Stephen Hu	92cfbcb382	Fix: when parse markdown support extract image at local (#8906) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/8902 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 3 Monaten
Yongteng Lei	51a8604dcb	Fix: fixed context loss caused by separating markdown tables from original text (#8844) ### What problem does this PR solve? Fix context loss caused by separating markdown tables from original text. #6871, #8804. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 3 Monaten
wenxuan.zhang	f586dd0a96	Fix: docx parse error. (#8600) ### What problem does this PR solve? docx parse error. ![image](https://github.com/user-attachments/assets/efbe6d1b-10c8-415e-b693-a86f73e1ffa6) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### What problem does this PR solve? Some docx parse with naive cause error. `block.style.name` in Function `__get_nearest_title` will be None in some case. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: wenxuan.zhang <wenxuan.zhang@chinacreator.com>	vor 4 Monaten
Jin Hai	4a2ff633e0	Fix typo in code (#8327) ### What problem does this PR solve? Fix typo in code ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	vor 4 Monaten
Yongteng Lei	bd4678bca6	Fix: Unnecessary truncation in markdown parser (#7972) ### What problem does this PR solve? Fix unnecessary truncation in markdown parser. So that markdown can work perfectly like [this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576) in #7824, supporting multiple special delimiters. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 5 Monaten
Kevin Hu	bfe97d896d	Fix: docx get image exception. (#7636) ### What problem does this PR solve? Close #7631 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 5 Monaten
alkscr	baa108f5cc	Fix: markdown table conversion error (#7570) ### What problem does this PR solve? Since `import markdown.markdown` has been changed to `import markdown` in `rag/app/naive.py`, previous code for converting markdown tables would call a markdown module instead of a callable function. This cause error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	vor 5 Monaten
Stephen Hu	1662c7eda3	Feat: Markdown add image (#7124) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/6984 1. Markdown parser supports get pictures 2. For Native, when handling Markdown, it will handle images 3. improve merge and ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	vor 6 Monaten
Kevin Hu	14a3efd756	Fix: docx image exceptions. (#6839) ### What problem does this PR solve? Close #6784 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 6 Monaten
fansir	0e0ebaac5f	Feat: Adds hierarchical title path tracking for tables in DOCX documents to improve context association (#6374) ### What problem does this PR solve? Adds hierarchical title path tracking for tables in DOCX documents to improve context association. Previously, extracted tables lacked positional context within document structure. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 7 Monaten
Kevin Hu	95497b4aab	Fix: adapt to old configurations. (#6321) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 7 Monaten
Yongteng Lei	9611185eb4	Feat: add VLM-boosted DocX parser (#6307) ### What problem does this PR solve? Add VLM-boosted DocX parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 7 Monaten
Yongteng Lei	e4380843c4	Feat: add fallback for PDF figure parser (#6305) ### What problem does this PR solve? Add fallback for PDF figure parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 7 Monaten
Yongteng Lei	1d6760dd84	Feat: add VLM-boosted PDF parser (#6278) ### What problem does this PR solve? Add VLM-boosted PDF parser if VLM is set. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 7 Monaten
Yongteng Lei	5cf610af40	Feat: add vision LLM PDF parser (#6173) ### What problem does this PR solve? Add vision LLM PDF parser ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	vor 7 Monaten
Kevin Hu	3a99c2b5f4	Refa: PARALLEL_DEVICES is a static parameter. (#6168) ### What problem does this PR solve? ### Type of change - [x] Refactoring	vor 7 Monaten
Kevin Hu	bfa8d342b3	Fix: retrieval debug mode issue. (#6150) ### What problem does this PR solve? #6139 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 7 Monaten
Debug Doctor	3e19044dee	Feat: add OCR's muti-gpus and parallel processing support (#5972) ### What problem does this PR solve? Add OCR's muti-gpus and parallel processing support ### Type of change - [x] New Feature (non-breaking change which adds functionality) @yuzhichang I've tried to resolve the comments in #5697. OCR jobs can now be done on both CPU and GPU. ( By the way, I've encountered a “Generate embedding error” issue #5954 that might be due to my outdated GPUs? idk. ) Please review it and give me suggestions. GPU: ![gpu_ocr](https://github.com/user-attachments/assets/0ee2ecfb-a665-4e50-8bc7-15941b9cd80e) ![smi](https://github.com/user-attachments/assets/a2312f8c-cf24-443d-bf89-bec50503546d) CPU: ![cpu_ocr](https://github.com/user-attachments/assets/1ba6bb0b-94df-41ea-be79-790096da4bf1)	vor 7 Monaten
Yongteng Lei	4ff609b6a8	Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027) ### What problem does this PR solve? Optimize OCR garbage identification to reduce unnecessary filtering. #5713 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 7 Monaten
Yongteng Lei	7cd37c37cd	Feat: add CSV file parsing support (#5989) ### What problem does this PR solve? Add CSV file parsing support #4552, #5849, #5870 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 7 Monaten
Kevin Hu	c28bc41a96	Fix docx table issue. (#5117) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 8 Monaten
Kevin Hu	dd0ebbea35	Light GraphRAG (#4585) ### What problem does this PR solve? #4543 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 9 Monaten
Jin Hai	3894de895b	Update comments (#4569) ### What problem does this PR solve? Add license statement. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	vor 9 Monaten
liuhua	1d65299791	Fix rerank_model bug in chat and markdown bug (#4061) ### What problem does this PR solve? Fix rerank_model bug in chat and markdown bug #4000 #3992 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>	vor 10 Monaten
Jin Hai	821fdf02b4	Fix parsing JSON file error (#3829) ### What problem does this PR solve? Close issue: #3828 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: jinhai <haijin.chn@gmail.com>	vor 11 Monaten
Jin Hai	08c1a5e1e8	Refactor parse progress (#3781) ### What problem does this PR solve? Refactor parse file progress ### Type of change - [x] Refactoring Signed-off-by: jinhai <haijin.chn@gmail.com>	vor 11 Monaten
Jin Hai	e079656473	Update progress info and start welcome info (#3768) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Refactoring --------- Signed-off-by: jinhai <haijin.chn@gmail.com>	vor 11 Monaten
Zhichang Yu	482c1b59c8	Check tika.parser return result (#3564) ### What problem does this PR solve? Check tika.parser return result. Close #3229 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	vor 11 Monaten
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	vor 11 Monaten
Zhichang Yu	a2a5631da4	Rework logging (#3358) Unified all log files into one. ### What problem does this PR solve? Unified all log files into one. ### Type of change - [x] Refactoring	vor 11 Monaten
Kevin Hu	1fce6caf80	make titles in markdown not be splited with following content (#2971) ### What problem does this PR solve? #2970 ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	vor 1 Jahr
lidp	20e63f8ec4	Fix docx images (#2756) ### What problem does this PR solve? #2755 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 1 Jahr
yqkcn	570ad420a8	remove unused import (#2679) ### What problem does this PR solve? ### Type of change - [x] Refactoring	vor 1 Jahr
yqkcn	aea553c3a8	Add get_txt function (#2639) ### What problem does this PR solve? Add get_txt function to reduce duplicate code ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	vor 1 Jahr
Kevin Hu	78856703c4	make excel parsing configurable (#2517) ### What problem does this PR solve? #2516 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 1 Jahr
Kevin Hu	01acc3fd5a	fix duplicated llm name betweeen different suppliers (#2477) ### What problem does this PR solve? #2465 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 1 Jahr
Jin Hai	6b3a40be5c	Format file format from Windows/dos to Unix (#1949) ### What problem does this PR solve? Related source file is in Windows/DOS format, they are format to Unix format. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	vor 1 Jahr
Kevin Hu	d73a75506e	fix mind map bug (#1934) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	vor 1 Jahr
Kevin Hu	cafdee536f	add sql to naive parser (#1908) ### What problem does this PR solve? ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	vor 1 Jahr
Kung Quang	19ded65c66	Fix a "TypeError: expected string or buffer bug" in docx files extracted using Knowledge Graph.#1859 (#1865) ### What problem does this PR solve? Fix a "TypeError: expected string or buffer bug" in docx files extracted using Knowledge Graph. #1859 ``` Traceback (most recent call last): File "//Users/XXX/ragflow/rag/svr/task_executor.py", line 149, in build cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/XXX/ragflow/rag/app/knowledge_graph.py", line 18, in chunk chunks = build_knowlege_graph_chunks(tenant_id, sections, callback, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/XXX/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks tkn_cnt = num_tokens_from_string(chunks[i]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/XXX/github/ragflow/rag/utils/__init__.py", line 79, in num_tokens_from_string num_tokens = len(encoder.encode(string)) ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/XXX/tiktoken/core.py", line 116, in encode if match := _special_token_regex(disallowed_special).search(text): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: expected string or buffer ``` This type is `Dict` <img width="1689" alt="Pasted Graphic 3" src="https://github.com/user-attachments/assets/e5ba5c45-df1d-4697-98c9-14365c839f20"> The correct type should be ` Str` <img width="1725" alt="Pasted Graphic 2" src="https://github.com/user-attachments/assets/e54d5e60-4ce4-4180-b394-24e485013534"> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	vor 1 Jahr
黄腾	ede733e130	add support for eml file parser (#1768) ### What problem does this PR solve? add support for eml file parser #1363 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	vor 1 Jahr
Kevin Hu	fe797bcc66	be better chunks before graphrag (#1811) ### What problem does this PR solve? #1594 ### Type of change - [x] Refactoring	vor 1 Jahr
Kevin Hu	152072f900	Add graphrag (#1793) ### What problem does this PR solve? #1594 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 1 Jahr
Yuhao Tsui	a973b9e01f	Fix: Embedding err when docx contains unsupported images (#1720) ### What problem does this PR solve? Fix the problem of not being able to embedding when docx document contains unsupported images. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	vor 1 Jahr
H	0cb588f7bf	Fix docx parser line bug (#1715) ### What problem does this PR solve? #1704 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	vor 1 Jahr
Zhedong Cen	a95c1d45f0	Support table for markdown file in general parser (#1278) ### What problem does this PR solve? Support extracting table for markdown file in general parser ### Type of change - [x] New Feature (non-breaking change which adds functionality)	vor 1 Jahr

1 2

91 Commits (6d1078b5385dd88d1ecbca23b381961be8f9feec)