Fix broken data stream when writing image file (#9354)
### What problem does this PR solve?
fix "broken data stream when writing image file", just log warning and
ignore
Close #8379
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
FIX: If chunk["content_with_weight"] contains one or more unpaired surrogate characters (such as incomplete emoji or other special characters), then calling .encode("utf-8") directly will raise a UnicodeEncodeError. (#9246)
FIX: If chunk["content_with_weight"] contains one or more unpaired
surrogate characters (such as incomplete emoji or other special
characters), then calling .encode("utf-8") directly will raise a
UnicodeEncodeError.
### What problem does this PR solve?
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Fix: fixed invalid save() arguments for slide thumbnails (#8851)
### What problem does this PR solve?
Fixed invalid save() arguments for slide thumbnails.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Fix memory leaks in PIL image and BytesIO handling during chunk processing (#8522)
### What problem does this PR solve?
This PR addresses critical memory leaks in the task executor's image
processing pipeline. The current implementation fails to properly
dispose of PIL Image objects and BytesIO buffers during chunk
processing, leading to progressive memory accumulation that can cause
the task executor to consume excessive memory over time.
### Background context
- The `upload_to_minio` function processes images from document chunks
and converts them to JPEG format for storage.
- PIL Image objects hold significant memory resources that must be
explicitly closed to prevent memory leaks.
- BytesIO objects also consume memory and should be properly disposed of
after use.
- In high-throughput scenarios with many image-containing documents,
these memory leaks can lead to out-of-memory errors and degraded
performance.
### Specific issues fixed
- PIL Image objects were not being explicitly closed after processing.
- BytesIO buffers lacked proper cleanup in all code paths.
- Converted images (RGBA/P to RGB) were not disposing of the original
image object.
- Memory references to large image data were not being cleared promptly.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Performance Improvement
### Changes made
- Added explicit `d["image"].close()` calls after image processing
operations.
- Implemented proper cleanup of converted images when changing formats
from RGBA/P to RGB.
- Enhanced BytesIO cleanup with `try/finally` blocks to ensure disposal
in all code paths.
- Added explicit `del d["image"]` to clear memory references after
processing.
This fix ensures stable memory usage during long-running document
processing tasks and prevents potential out-of-memory conditions in
production environments.
Refactor:improve the logic to check cancel (#8524)
### What problem does this PR solve?
improve the logic to check cancel
### Type of change
- [x] Refactoring
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Fix: some cases Task return but not set progress (#8469)
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/8466
I go through the codes, current logic:
When do_handle_task raises an exception, handle_task will set the
progress, but for some cases do_handle_task internal will just return
but not set the right progress, at this cases the redis stream will been
acked but the task is running.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
FIX:Saving an RGBA image directly as JPEG will cause an error. If the… (#8399)
Saving an RGBA image directly as JPEG will cause an error. If the image
is in RGBA mode, convert it to RGB mode before saving it in JPG format.
### What problem does this PR solve?
During document parsing in the knowledge base, we occasionally encounter
the error 'cannot write mode RGBA as JPEG.' This occurs because images
in RGBA mode cannot be directly saved as JPEG. They must be converted
first before saving.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Fix: Document parse via API will alot problen (#8407)
### What problem does this PR solve?
#8391#8404
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Feat: make document parsing and embedding batch sizes configurable via environment variables (#8266)
### Description
This PR introduces two new environment variables, `DOC_BULK_SIZE` and
`EMBEDDING_BATCH_SIZE`, to allow flexible tuning of batch sizes for
document parsing and embedding vectorization in RAGFlow. By making these
parameters configurable, users can optimize performance and resource
usage according to their hardware capabilities and workload
requirements.
### What problem does this PR solve?
Previously, the batch sizes for document parsing and embedding were
hardcoded, limiting the ability to adjust throughput and memory
consumption. This PR enables users to set these values via environment
variables (in `.env`, Helm chart, or directly in the deployment
environment), improving flexibility and scalability for both small and
large deployments.
- `DOC_BULK_SIZE`: Controls how many document chunks are processed in a
single batch during document parsing (default: 4).
- `EMBEDDING_BATCH_SIZE`: Controls how many text chunks are processed
in a single batch during embedding vectorization (default: 16).
This change updates the codebase, documentation, and configuration files
to reflect the new options.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
### Additional context
- Updated `.env`, `helm/values.yaml`, and documentation to describe
the new variables.
- Modified relevant code paths to use the environment variables instead
of hardcoded values.
- Users can now tune these parameters to achieve better throughput or
reduce memory usage as needed.
Before:
Default value:
<img width="643" alt="image"
src="https://github.com/user-attachments/assets/086e1173-18f3-419d-a0f5-68394f63866a"
/>
After:
10x:
<img width="777" alt="image"
src="https://github.com/user-attachments/assets/5722bbc0-0bcb-4536-b928-077031e550f1"
/>
Refa: revert to original task message collection logic (#8251)
### What problem does this PR solve?
Get rid of 'RedisDB.get_unacked_iterator queue rag_flow_svr_queue_1
doesn't exist'
----
Edit: revert to original message collection logic.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Refa: GraphRAG and explaining GraphRAG stalling behavior on large files (#8223)
### What problem does this PR solve?
This PR investigates the cause of #7957.
TL;DR: Incorrect similarity calculations lead to too many candidates.
Since candidate selection involves interaction with the LLM, this causes
significant delays in the program.
What this PR does:
1. **Fix similarity calculation**:
When processing a 64 pages government document, the corrected similarity
calculation reduces the number of candidates from over 100,000 to around
16,000. With a default batch size of 100 pairs per LLM call, this fix
reduces unnecessary LLM interactions from over 1,000 calls to around
160, a roughly 10x improvement.
2. **Add concurrency and timeout limits**:
Up to 5 entity types are processed in "parallel", each with a 180-second
timeout. These limits may be configurable in future updates.
3. **Improve logging**:
The candidate resolution process now reports progress in real time.
4. **Mitigates potential concurrency risks**
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
fix: single task executor getting all tasks from Redis queue (#7330)
### What problem does this PR solve?
Currently, as long as there are tasks in Redis, this loop will keep
getting the tasks. This will lead to a single task executor with many
tasks in the pending state. Then we need to wait for the pending tasks
to get them back in the queue.
In first place, if we set the `MAX_CONCURRENT_TASKS` to X, then only X
tasks should be picked from the queue, and others should be left in the
queue for other `task_executors` or be picked after 1 of the spots in
the current executor gets free. This PR ensures this behavior.
The additional changes were due to the Ruff linting in pre-commit. But I
believe these are expected to keep the coding style.
### Type of change
- [X] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
### What problem does this PR solve?
https://github.com/infiniflow/ragflow/issues/7761
but it may be difficult to achieve 0 delay (which need to pass the
cancel token to all parts)
Another solution is just 0 delay effect at UI.
And task will stop latter
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
delete useless image blobs when the task executor meets edge cases
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
fix: Fix the problem that concurrent execution limit in task executor fails and causes OOM (issue#7580) (#7700)
### What problem does this PR solve?
## Cause of the bug:
During the execution process, due to improper use of trio
CapacityLimiter, the configuration parameter MAX_CONCURRENT_TASKS is
invalid, causing the executor to take out a large number of tasks from
the Redis queue at one time.
This behavior will cause the task executor to occupy too much memory and
be killed by the OS when a large number of tasks exist at the same time.
As a result, all executing tasks are suspended.
## Fix:
Added the task_manager method to the entry of /rag/svr/task_executor.py
to make CapacityLimiter effective. Deleted the invalid async with
statement.
## Fix result:
After testing, the task executor execution meets expectations, that is:
concurrent execution of up to $MAX_CONCURRENT_TASKS tasks.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Fix: missing graph resolution and community extraction in graphrag tasks (#7586)
### What problem does this PR solve?
Info of whether applying graph resolution and community extraction is
storage in `task["kb_parser_config"]`. However, previous code get
`graphrag_conf` from `task["parser_config"]`, making `with_resolution`
and `with_community` are always false.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Two Case when local Es tag search has result which is filtered by score
1: Doc has empty tag,and not visi LLM
2: Code may use empty examples in Prompt for LLM search tag
Co-authored-by: huangfuqunze <huangfuqunze.hfqz@alibaba-inc.com>
### What problem does this PR solve?
When parsing documents containing images, the current code uses a
single-threaded approach to call the VL model, resulting in extremely
slow parsing speed (e.g., parsing a Word document with dozens of images
takes over 20 minutes).
By switching to a multithreaded approach to call the VL model, the
parsing speed can be improved to an acceptable level.
### Type of change
- [x] Performance Improvement
---------
Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
### What problem does this PR solve?
Fix the redis lock will always timeout (change the logic order release
lock first)
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
The lock is not released correctly when task_exectuor is abnormal
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
feat: Recover pending tasks while pod restart. (#7073)
### What problem does this PR solve?
If you deploy Ragflow using Kubernetes, the hostname will change during
a rolling update. This causes the consumer name of the task executor to
change, making it impossible to schedule tasks that were previously in a
pending state.
To address this, I introduced a recovery task that scans these pending
messages and re-publishes them, allowing the tasks to continue being
processed.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
---------
Co-authored-by: liuzhenghua-jk <liuzhenghua-jk@360shuke.com>
fix(nursery): Fix Closure Trap Issues in Trio Concurrent Tasks (#7106)
## Problem Description
Multiple files in the RAGFlow project contain closure trap issues when
using lambda functions with `trio.open_nursery()`. This problem causes
concurrent tasks created in loops to reference the same variable,
resulting in all tasks processing the same data (the data from the last
iteration) rather than each task processing its corresponding data from
the loop.
## Issue Details
When using a `lambda` to create a closure function and passing it to
`nursery.start_soon()` within a loop, the lambda function captures a
reference to the loop variable rather than its value. For example:
```python
# Problematic code
async with trio.open_nursery() as nursery:
for d in docs:
nursery.start_soon(lambda: doc_keyword_extraction(chat_mdl, d, topn))
```
In this pattern, when concurrent tasks begin execution, `d` has already
become the value after the loop ends (typically the last element),
causing all tasks to use the same data.
## Fix Solution
Changed the way concurrent tasks are created with `nursery.start_soon()`
by leveraging Trio's API design to directly pass the function and its
arguments separately:
```python
# Fixed code
async with trio.open_nursery() as nursery:
for d in docs:
nursery.start_soon(doc_keyword_extraction, chat_mdl, d, topn)
```
This way, each task uses the parameter values at the time of the
function call, rather than references captured through closures.
## Fixed Files
Fixed closure traps in the following files:
1. `rag/svr/task_executor.py`: 3 fixes, involving document keyword
extraction, question generation, and tag processing
2. `rag/raptor.py`: 1 fix, involving document summarization
3. `graphrag/utils.py`: 2 fixes, involving graph node and edge
processing
4. `graphrag/entity_resolution.py`: 2 fixes, involving entity resolution
and graph node merging
5. `graphrag/general/mind_map_extractor.py`: 2 fixes, involving document
processing
6. `graphrag/general/extractor.py`: 3 fixes, involving content
processing and graph node/edge merging
7. `graphrag/general/community_reports_extractor.py`: 1 fix, involving
community report extraction
## Potential Impact
This fix resolves a serious concurrency issue that could have caused:
- Data processing errors (processing duplicate data)
- Performance degradation (all tasks working on the same data)
- Inconsistent results (some data not being processed)
After the fix, all concurrent tasks should correctly process their
respective data, improving system correctness and reliability.
### What problem does this PR solve?
Removed set_entity and set_relation to avoid accessing doc engine during
graph computation.
Introduced GraphChange to avoid writing unchanged chunks.
### Type of change
- [x] Performance Improvement
Fix: Add a basic example when the example of content_tagging is empty (#6276)
### What problem does this PR solve?
When using LLM for auto-tag, if there are no examples, the tag format
generated by LLM may be wrong. This will cause Elasticsearch insert
errors. Adding basic examples can avoid this problem.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)