Browse Source

Feat: make document parsing and embedding batch sizes configurable via environment variables (#8266)

### Description

This PR introduces two new environment variables, ‎`DOC_BULK_SIZE` and
‎`EMBEDDING_BATCH_SIZE`, to allow flexible tuning of batch sizes for
document parsing and embedding vectorization in RAGFlow. By making these
parameters configurable, users can optimize performance and resource
usage according to their hardware capabilities and workload
requirements.

### What problem does this PR solve?

Previously, the batch sizes for document parsing and embedding were
hardcoded, limiting the ability to adjust throughput and memory
consumption. This PR enables users to set these values via environment
variables (in ‎`.env`, Helm chart, or directly in the deployment
environment), improving flexibility and scalability for both small and
large deployments.

- ‎`DOC_BULK_SIZE`: Controls how many document chunks are processed in a
single batch during document parsing (default: 4).
- ‎`EMBEDDING_BATCH_SIZE`: Controls how many text chunks are processed
in a single batch during embedding vectorization (default: 16).

This change updates the codebase, documentation, and configuration files
to reflect the new options.

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):

### Additional context
- Updated ‎`.env`, ‎`helm/values.yaml`, and documentation to describe
the new variables.
- Modified relevant code paths to use the environment variables instead
of hardcoded values.
- Users can now tune these parameters to achieve better throughput or
reduce memory usage as needed.

Before:
Default value:
<img width="643" alt="image"
src="https://github.com/user-attachments/assets/086e1173-18f3-419d-a0f5-68394f63866a"
/>
After:
10x:
<img width="777" alt="image"
src="https://github.com/user-attachments/assets/5722bbc0-0bcb-4536-b928-077031e550f1"
/>
tags/v0.19.1
cutiechi 4 months ago
parent
commit
8f9bcb1c74
No account linked to committer's email address

+ 8
- 0
docker/.env View File

# Note that neither `MAX_CONTENT_LENGTH` nor `client_max_body_size` sets the maximum size for files uploaded to an agent. # Note that neither `MAX_CONTENT_LENGTH` nor `client_max_body_size` sets the maximum size for files uploaded to an agent.
# See https://ragflow.io/docs/dev/begin_component for details. # See https://ragflow.io/docs/dev/begin_component for details.


# Controls how many documents are processed in a single batch.
# Defaults to 4 if DOC_BULK_SIZE is not explicitly set.
DOC_BULK_SIZE=${DOC_BULK_SIZE:-4}

# Defines the number of items to process per batch when generating embeddings.
# Defaults to 16 if EMBEDDING_BATCH_SIZE is not set in the environment.
EMBEDDING_BATCH_SIZE=${EMBEDDING_BATCH_SIZE:-16}

# Log level for the RAGFlow's own and imported packages. # Log level for the RAGFlow's own and imported packages.
# Available levels: # Available levels:
# - `DEBUG` # - `DEBUG`

+ 10
- 0
docker/README.md View File

- `MAX_CONTENT_LENGTH` - `MAX_CONTENT_LENGTH`
The maximum file size for each uploaded file, in bytes. You can uncomment this line if you wish to change the 128M file size limit. After making the change, ensure you update `client_max_body_size` in nginx/nginx.conf correspondingly. The maximum file size for each uploaded file, in bytes. You can uncomment this line if you wish to change the 128M file size limit. After making the change, ensure you update `client_max_body_size` in nginx/nginx.conf correspondingly.


### Doc bulk size

- `DOC_BULK_SIZE`
The number of document chunks processed in a single batch during document parsing. Defaults to `4`.

### Embedding batch size

- `EMBEDDING_BATCH_SIZE`
The number of text chunks processed in a single batch during embedding vectorization. Defaults to `16`.

## 🐋 Service configuration ## 🐋 Service configuration


[service_conf.yaml](./service_conf.yaml) specifies the system-level configuration for RAGFlow and is used by its API server and task executor. In a dockerized setup, this file is automatically created based on the [service_conf.yaml.template](./service_conf.yaml.template) file (replacing all environment variables by their values). [service_conf.yaml](./service_conf.yaml) specifies the system-level configuration for RAGFlow and is used by its API server and task executor. In a dockerized setup, this file is automatically created based on the [service_conf.yaml.template](./service_conf.yaml.template) file (replacing all environment variables by their values).

+ 7
- 1
docs/faq.mdx View File



All uploaded files are stored in Minio, RAGFlow's object storage solution. For instance, if you upload your file directly to a knowledge base, it is located at `<knowledgebase_id>/filename`. All uploaded files are stored in Minio, RAGFlow's object storage solution. For instance, if you upload your file directly to a knowledge base, it is located at `<knowledgebase_id>/filename`.


---
---

### How to tune batch size for document parsing and embedding?

You can control the batch size for document parsing and embedding by setting the environment variables `DOC_BULK_SIZE` and `EMBEDDING_BATCH_SIZE`. Increasing these values may improve throughput for large-scale data processing, but will also increase memory usage. Adjust them according to your hardware resources.

---

+ 18
- 12
docs/guides/agent/agent_component_reference/begin.mdx View File



### ID ### ID


The ID is the unique identifier for the component within the workflow. Unlike the IDs of other components, the ID of the **Begin** component *cannot* be changed.
The ID is the unique identifier for the component within the workflow. Unlike the IDs of other components, the ID of the **Begin** component _cannot_ be changed.


### Opening greeting ### Opening greeting




You can set global variables within the **Begin** component, which can be either required or optional. Once established, users will need to provide values for these variables when interacting or chatting with the agent. Click **+ Add variable** to add a global variable, each with the following attributes: You can set global variables within the **Begin** component, which can be either required or optional. Once established, users will need to provide values for these variables when interacting or chatting with the agent. Click **+ Add variable** to add a global variable, each with the following attributes:


- **Key**: *Required*
- **Key**: _Required_
The unique variable name. The unique variable name.
- **Name**: *Required*
- **Name**: _Required_
A descriptive name providing additional details about the variable. A descriptive name providing additional details about the variable.
For example, if **Key** is set to `lang`, you can set its **Name** to `Target language`. For example, if **Key** is set to `lang`, you can set its **Name** to `Target language`.
- **Type**: *Required*
The type of the variable:
- **Type**: _Required_
The type of the variable:
- **line**: Accepts a single line of text without line breaks. - **line**: Accepts a single line of text without line breaks.
- **paragraph**: Accepts multiple lines of text, including line breaks. - **paragraph**: Accepts multiple lines of text, including line breaks.
- **options**: Requires the user to select a value for this variable from a dropdown menu. And you are required to set *at least* one option for the dropdown menu.
- **options**: Requires the user to select a value for this variable from a dropdown menu. And you are required to set _at least_ one option for the dropdown menu.
- **file**: Requires the user to upload one or multiple files. - **file**: Requires the user to upload one or multiple files.
- **integer**: Accepts an integer as input. - **integer**: Accepts an integer as input.
- **boolean**: Requires the user to toggle between on and off. - **boolean**: Requires the user to toggle between on and off.
- **Optional**: A toggle indicating whether the variable is optional.
- **Optional**: A toggle indicating whether the variable is optional.


:::tip NOTE :::tip NOTE
To pass in parameters from a client, call: To pass in parameters from a client, call:

- HTTP method [Converse with agent](../../../references/http_api_reference.md#converse-with-agent), or - HTTP method [Converse with agent](../../../references/http_api_reference.md#converse-with-agent), or
- Python method [Converse with agent](../../../references/python_api_reference.md#converse-with-agent). - Python method [Converse with agent](../../../references/python_api_reference.md#converse-with-agent).
:::
:::


:::danger IMPORTANT :::danger IMPORTANT

- If you set the key type as **file**, ensure the token count of the uploaded file does not exceed your model provider's maximum token limit; otherwise, the plain text in your file will be truncated and incomplete. - If you set the key type as **file**, ensure the token count of the uploaded file does not exceed your model provider's maximum token limit; otherwise, the plain text in your file will be truncated and incomplete.
- If your agent's **Begin** component takes a variable, you *cannot* embed it into a webpage.
- If your agent's **Begin** component takes a variable, you _cannot_ embed it into a webpage.
:::

:::note
You can tune document parsing and embedding efficiency by setting the environment variables `DOC_BULK_SIZE` and `EMBEDDING_BATCH_SIZE`.
::: :::


## Examples ## Examples


### Is the uploaded file in a knowledge base? ### Is the uploaded file in a knowledge base?


No. Files uploaded to an agent as input are not stored in a knowledge base and hence will not be processed using RAGFlow's built-in OCR, DLR or TSR models, or chunked using RAGFlow's built-in chunking methods.
No. Files uploaded to an agent as input are not stored in a knowledge base and hence will not be processed using RAGFlow's built-in OCR, DLR or TSR models, or chunked using RAGFlow's built-in chunking methods.


### How to upload a webpage or file from a URL? ### How to upload a webpage or file from a URL?




### File size limit for an uploaded file ### File size limit for an uploaded file


There is no *specific* file size limit for a file uploaded to an agent. However, note that model providers typically have a default or explicit maximum token setting, which can range from 8196 to 128k: The plain text part of the uploaded file will be passed in as the key value, but if the file's token count exceeds this limit, the string will be truncated and incomplete.
There is no _specific_ file size limit for a file uploaded to an agent. However, note that model providers typically have a default or explicit maximum token setting, which can range from 8196 to 128k: The plain text part of the uploaded file will be passed in as the key value, but if the file's token count exceeds this limit, the string will be truncated and incomplete.


:::tip NOTE :::tip NOTE
The variables `MAX_CONTENT_LENGTH` in `/docker/.env` and `client_max_body_size` in `/docker/nginx/nginx.conf` set the file size limit for each upload to a knowledge base or **File Management**. These settings DO NOT apply in this scenario. The variables `MAX_CONTENT_LENGTH` in `/docker/.env` and `client_max_body_size` in `/docker/nginx/nginx.conf` set the file size limit for each upload to a knowledge base or **File Management**. These settings DO NOT apply in this scenario.
:::
:::

+ 6
- 0
helm/values.yaml View File

# MAX_CONTENT_LENGTH: "134217728" # MAX_CONTENT_LENGTH: "134217728"
# After making the change, ensure you update `client_max_body_size` in nginx/nginx.conf correspondingly. # After making the change, ensure you update `client_max_body_size` in nginx/nginx.conf correspondingly.


# The number of document chunks processed in a single batch during document parsing.
DOC_BULK_SIZE: 4

# The number of text chunks processed in a single batch during embedding vectorization.
EMBEDDING_BATCH_SIZE: 16

ragflow: ragflow:
deployment: deployment:
strategy: strategy:

+ 2
- 1
rag/settings.py View File

REDIS = {} REDIS = {}
pass pass
DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024)) DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024))

DOC_BULK_SIZE = int(os.environ.get("DOC_BULK_SIZE", 4))
EMBEDDING_BATCH_SIZE = int(os.environ.get("EMBEDDING_BATCH_SIZE", 16))
SVR_QUEUE_NAME = "rag_flow_svr_queue" SVR_QUEUE_NAME = "rag_flow_svr_queue"
SVR_CONSUMER_GROUP_NAME = "rag_flow_svr_task_broker" SVR_CONSUMER_GROUP_NAME = "rag_flow_svr_task_broker"
PAGERANK_FLD = "pagerank_fea" PAGERANK_FLD = "pagerank_fea"

+ 6
- 8
rag/svr/task_executor.py View File

email, tag email, tag
from rag.nlp import search, rag_tokenizer from rag.nlp import search, rag_tokenizer
from rag.raptor import RecursiveAbstractiveProcessing4TreeOrganizedRetrieval as Raptor from rag.raptor import RecursiveAbstractiveProcessing4TreeOrganizedRetrieval as Raptor
from rag.settings import DOC_MAXIMUM_SIZE, SVR_CONSUMER_GROUP_NAME, get_svr_queue_name, get_svr_queue_names, print_rag_settings, TAG_FLD, PAGERANK_FLD
from rag.settings import DOC_MAXIMUM_SIZE, DOC_BULK_SIZE, EMBEDDING_BATCH_SIZE, SVR_CONSUMER_GROUP_NAME, get_svr_queue_name, get_svr_queue_names, print_rag_settings, TAG_FLD, PAGERANK_FLD
from rag.utils import num_tokens_from_string, truncate from rag.utils import num_tokens_from_string, truncate
from rag.utils.redis_conn import REDIS_CONN, RedisDistributedLock from rag.utils.redis_conn import REDIS_CONN, RedisDistributedLock
from rag.utils.storage_factory import STORAGE_IMPL from rag.utils.storage_factory import STORAGE_IMPL
async def embedding(docs, mdl, parser_config=None, callback=None): async def embedding(docs, mdl, parser_config=None, callback=None):
if parser_config is None: if parser_config is None:
parser_config = {} parser_config = {}
batch_size = 16
tts, cnts = [], [] tts, cnts = [], []
for d in docs: for d in docs:
tts.append(d.get("docnm_kwd", "Title")) tts.append(d.get("docnm_kwd", "Title"))
tk_count += c tk_count += c


cnts_ = np.array([]) cnts_ = np.array([])
for i in range(0, len(cnts), batch_size):
vts, c = await trio.to_thread.run_sync(lambda: mdl.encode([truncate(c, mdl.max_length-10) for c in cnts[i: i + batch_size]]))
for i in range(0, len(cnts), EMBEDDING_BATCH_SIZE):
vts, c = await trio.to_thread.run_sync(lambda: mdl.encode([truncate(c, mdl.max_length-10) for c in cnts[i: i + EMBEDDING_BATCH_SIZE]]))
if len(cnts_) == 0: if len(cnts_) == 0:
cnts_ = vts cnts_ = vts
else: else:
chunk_count = len(set([chunk["id"] for chunk in chunks])) chunk_count = len(set([chunk["id"] for chunk in chunks]))
start_ts = timer() start_ts = timer()
doc_store_result = "" doc_store_result = ""
es_bulk_size = 4


async def delete_image(kb_id, chunk_id): async def delete_image(kb_id, chunk_id):
try: try:
"Deleting image of chunk {}/{}/{} got exception".format(task["location"], task["name"], chunk_id)) "Deleting image of chunk {}/{}/{} got exception".format(task["location"], task["name"], chunk_id))
raise raise


for b in range(0, len(chunks), es_bulk_size):
doc_store_result = await trio.to_thread.run_sync(lambda: settings.docStoreConn.insert(chunks[b:b + es_bulk_size], search.index_name(task_tenant_id), task_dataset_id))
for b in range(0, len(chunks), DOC_BULK_SIZE):
doc_store_result = await trio.to_thread.run_sync(lambda: settings.docStoreConn.insert(chunks[b:b + DOC_BULK_SIZE], search.index_name(task_tenant_id), task_dataset_id))
task_canceled = TaskService.do_cancel(task_id) task_canceled = TaskService.do_cancel(task_id)
if task_canceled: if task_canceled:
progress_callback(-1, msg="Task has been canceled.") progress_callback(-1, msg="Task has been canceled.")
error_message = f"Insert chunk error: {doc_store_result}, please check log file and Elasticsearch/Infinity status!" error_message = f"Insert chunk error: {doc_store_result}, please check log file and Elasticsearch/Infinity status!"
progress_callback(-1, msg=error_message) progress_callback(-1, msg=error_message)
raise Exception(error_message) raise Exception(error_message)
chunk_ids = [chunk["id"] for chunk in chunks[:b + es_bulk_size]]
chunk_ids = [chunk["id"] for chunk in chunks[:b + DOC_BULK_SIZE]]
chunk_ids_str = " ".join(chunk_ids) chunk_ids_str = " ".join(chunk_ids)
try: try:
TaskService.update_chunk_ids(task["id"], chunk_ids_str) TaskService.update_chunk_ids(task["id"], chunk_ids_str)

Loading…
Cancel
Save