### What problem does this PR solve?
fix "no `tc` element at grid_offset", just log warning and ignore.
stacktrace:
```
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 620, in handle_task
await do_handle_task(task)
File "/ragflow/rag/svr/task_executor.py", line 553, in do_handle_task
chunks = await build_chunks(task, progress_callback)
File "/ragflow/rag/svr/task_executor.py", line 257, in build_chunks
cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"],
File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync
return msg_from_thread.unwrap()
File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap
raise captured_error
File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result
return result.unwrap()
File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap
raise captured_error
File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn
ret = context.run(sync_fn, *args)
File "/ragflow/rag/svr/task_executor.py", line 257, in <lambda>
cks = await trio.to_thread.run_sync(lambda: chunker.chunk(task["name"], binary=binary, from_page=task["from_page"],
File "/ragflow/rag/app/naive.py", line 384, in chunk
sections, tables = Docx()(filename, binary)
File "/ragflow/rag/app/naive.py", line 230, in __call__
while i < len(r.cells):
File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 438, in cells
return tuple(_iter_row_cells())
File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 436, in _iter_row_cells
yield from iter_tc_cells(tc)
File "/ragflow/.venv/lib/python3.10/site-packages/docx/table.py", line 424, in iter_tc_cells
yield from iter_tc_cells(tc._tc_above) # pyright: ignore[reportPrivateUsage]
File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 741, in _tc_above
return self._tr_above.tc_at_grid_offset(self.grid_offset)
File "/ragflow/.venv/lib/python3.10/site-packages/docx/oxml/table.py", line 98, in tc_at_grid_offset
raise ValueError(f"no `tc` element at grid_offset={grid_offset}")
ValueError: no `tc` element at grid_offset=10
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
#9082#6365
<u> **WARNING: it's not compatible with the older version of `Agent`
module, which means that `Agent` from older versions can not work
anymore.**</u>
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Feat: parsing supports jsonl or ldjson format (#9087)
### What problem does this PR solve?
Supports jsonl or ldjson format. Feature request from
[discussion](https://github.com/orgs/infiniflow/discussions/8774).
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Fix: fixed context loss caused by separating markdown tables from original text (#8844)
### What problem does this PR solve?
Fix context loss caused by separating markdown tables from original
text. #6871, #8804.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
docx parse error.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Some docx parse with naive cause error. `block.style.name` in Function
`__get_nearest_title` will be None in some case.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: wenxuan.zhang <wenxuan.zhang@chinacreator.com>
Fix: Unnecessary truncation in markdown parser (#7972)
### What problem does this PR solve?
Fix unnecessary truncation in markdown parser. So that markdown can work
perfectly like
[this](https://github.com/infiniflow/ragflow/issues/7824#issuecomment-2921312576)
in #7824, supporting multiple special delimiters.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Since `import markdown.markdown` has been changed to `import markdown`
in `rag/app/naive.py`, previous code for converting markdown tables
would call a markdown module instead of a callable function. This cause
error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/6984
1. Markdown parser supports get pictures
2. For Native, when handling Markdown, it will handle images
3. improve merge and
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Feat: Adds hierarchical title path tracking for tables in DOCX documents to improve context association (#6374)
### What problem does this PR solve?
Adds hierarchical title path tracking for tables in DOCX documents to
improve context association. Previously, extracted tables lacked
positional context within document structure.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add fallback for PDF figure parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add VLM-boosted PDF parser if VLM is set.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add vision LLM PDF parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027)
### What problem does this PR solve?
Optimize OCR garbage identification to reduce unnecessary filtering.
#5713
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add CSV file parsing support #4552, #5849, #5870
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Fix rerank_model bug in chat and markdown bug (#4061)
### What problem does this PR solve?
Fix rerank_model bug in chat and markdown bug
#4000#3992
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn>
### What problem does this PR solve?
Close issue: #3828
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Signed-off-by: jinhai <haijin.chn@gmail.com>
Update progress info and start welcome info (#3768)
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Refactoring
---------
Signed-off-by: jinhai <haijin.chn@gmail.com>
### What problem does this PR solve?
Check tika.parser return result. Close #3229
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Use consistent log file names, introduced initLogger (#3403)
### What problem does this PR solve?
Use consistent log file names, introduced initLogger
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
make titles in markdown not be splited with following content (#2971)
### What problem does this PR solve?
#2970
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add get_txt function to reduce duplicate code
### Type of change
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Format file format from Windows/dos to Unix (#1949)
### What problem does this PR solve?
Related source file is in Windows/DOS format, they are format to Unix
format.
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
Fix a "TypeError: expected string or buffer bug" in docx files extracted using Knowledge Graph.#1859 (#1865)
### What problem does this PR solve?
Fix a "TypeError: expected string or buffer bug" in docx files extracted
using Knowledge Graph. #1859
```
Traceback (most recent call last):
File "//Users/XXX/ragflow/rag/svr/task_executor.py", line 149, in build
cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/XXX/ragflow/rag/app/knowledge_graph.py", line 18, in chunk
chunks = build_knowlege_graph_chunks(tenant_id, sections, callback,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/XXX/ragflow/graphrag/index.py", line 87, in build_knowlege_graph_chunks
tkn_cnt = num_tokens_from_string(chunks[i])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/XXX/github/ragflow/rag/utils/__init__.py", line 79, in num_tokens_from_string
num_tokens = len(encoder.encode(string))
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/XXX/tiktoken/core.py", line 116, in encode
if match := _special_token_regex(disallowed_special).search(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or buffer
```
This type is `Dict`
<img width="1689" alt="Pasted Graphic 3"
src="https://github.com/user-attachments/assets/e5ba5c45-df1d-4697-98c9-14365c839f20">
The correct type should be ` Str`
<img width="1725" alt="Pasted Graphic 2"
src="https://github.com/user-attachments/assets/e54d5e60-4ce4-4180-b394-24e485013534">
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
add support for eml file parser
#1363
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Zhedong Cen <cenzhedong2@126.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Fix: Embedding err when docx contains unsupported images (#1720)
### What problem does this PR solve?
Fix the problem of not being able to embedding when docx document
contains unsupported images.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
#1704
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Support table for markdown file in general parser (#1278)
### What problem does this PR solve?
Support extracting table for markdown file in general parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)