You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Feat: Adds OpenSearch2.19.1 as the vector_database support (#7140) ### What problem does this PR solve? This PR adds the support for latest OpenSearch2.19.1 as the store engine & search engine option for RAGFlow. ### Main Benefit 1. OpenSearch2.19.1 is licensed under the [Apache v2.0 License] which is much better than Elasticsearch 2. For search, OpenSearch2.19.1 supports full-text search、vector_search、hybrid_search those are similar with Elasticsearch on schema 3. For store, OpenSearch2.19.1 stores text、vector those are quite simliar with Elasticsearch on schema ### Changes - Support opensearch_python_connetor. I make a lot of adaptions since the schema and api/method between ES and Opensearch differs in many ways(especially the knn_search has a significant gap) : rag/utils/opensearch_coon.py - Support static config adaptions by changing: conf/service_conf.yaml、api/settings.py、rag/settings.py - Supprt some store&search schema changes between OpenSearch and ES: conf/os_mapping.json - Support OpenSearch python sdk : pyproject.toml - Support docker config for OpenSearch2.19.1 : docker/.env、docker/docker-compose-base.yml、docker/service_conf.yaml.template ### How to use - I didn't change the priority that ES as the default doc/search engine. Only if in docker/.env , we set DOC_ENGINE=${DOC_ENGINE:-opensearch}, it will work. ### Others Our team tested a lot of docs in our environment by using OpenSearch as the vector database ,it works very well. All the conifg for OpenSearch is necessary. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
пре 6 месеци
Feat: Adds OpenSearch2.19.1 as the vector_database support (#7140) ### What problem does this PR solve? This PR adds the support for latest OpenSearch2.19.1 as the store engine & search engine option for RAGFlow. ### Main Benefit 1. OpenSearch2.19.1 is licensed under the [Apache v2.0 License] which is much better than Elasticsearch 2. For search, OpenSearch2.19.1 supports full-text search、vector_search、hybrid_search those are similar with Elasticsearch on schema 3. For store, OpenSearch2.19.1 stores text、vector those are quite simliar with Elasticsearch on schema ### Changes - Support opensearch_python_connetor. I make a lot of adaptions since the schema and api/method between ES and Opensearch differs in many ways(especially the knn_search has a significant gap) : rag/utils/opensearch_coon.py - Support static config adaptions by changing: conf/service_conf.yaml、api/settings.py、rag/settings.py - Supprt some store&search schema changes between OpenSearch and ES: conf/os_mapping.json - Support OpenSearch python sdk : pyproject.toml - Support docker config for OpenSearch2.19.1 : docker/.env、docker/docker-compose-base.yml、docker/service_conf.yaml.template ### How to use - I didn't change the priority that ES as the default doc/search engine. Only if in docker/.env , we set DOC_ENGINE=${DOC_ENGINE:-opensearch}, it will work. ### Others Our team tested a lot of docs in our environment by using OpenSearch as the vector database ,it works very well. All the conifg for OpenSearch is necessary. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yongteng Lei <yongtengrey@outlook.com> Co-authored-by: writinwaters <93570324+writinwaters@users.noreply.github.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
пре 6 месеци
Feat: make document parsing and embedding batch sizes configurable via environment variables (#8266) ### Description This PR introduces two new environment variables, ‎`DOC_BULK_SIZE` and ‎`EMBEDDING_BATCH_SIZE`, to allow flexible tuning of batch sizes for document parsing and embedding vectorization in RAGFlow. By making these parameters configurable, users can optimize performance and resource usage according to their hardware capabilities and workload requirements. ### What problem does this PR solve? Previously, the batch sizes for document parsing and embedding were hardcoded, limiting the ability to adjust throughput and memory consumption. This PR enables users to set these values via environment variables (in ‎`.env`, Helm chart, or directly in the deployment environment), improving flexibility and scalability for both small and large deployments. - ‎`DOC_BULK_SIZE`: Controls how many document chunks are processed in a single batch during document parsing (default: 4). - ‎`EMBEDDING_BATCH_SIZE`: Controls how many text chunks are processed in a single batch during embedding vectorization (default: 16). This change updates the codebase, documentation, and configuration files to reflect the new options. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [x] Performance Improvement - [ ] Other (please describe): ### Additional context - Updated ‎`.env`, ‎`helm/values.yaml`, and documentation to describe the new variables. - Modified relevant code paths to use the environment variables instead of hardcoded values. - Users can now tune these parameters to achieve better throughput or reduce memory usage as needed. Before: Default value: <img width="643" alt="image" src="https://github.com/user-attachments/assets/086e1173-18f3-419d-a0f5-68394f63866a" /> After: 10x: <img width="777" alt="image" src="https://github.com/user-attachments/assets/5722bbc0-0bcb-4536-b928-077031e550f1" />
пре 4 месеци
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
  1. #
  2. # Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
  3. #
  4. # Licensed under the Apache License, Version 2.0 (the "License");
  5. # you may not use this file except in compliance with the License.
  6. # You may obtain a copy of the License at
  7. #
  8. # http://www.apache.org/licenses/LICENSE-2.0
  9. #
  10. # Unless required by applicable law or agreed to in writing, software
  11. # distributed under the License is distributed on an "AS IS" BASIS,
  12. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. # See the License for the specific language governing permissions and
  14. # limitations under the License.
  15. #
  16. import os
  17. import logging
  18. from api.utils import get_base_config, decrypt_database_config
  19. from api.utils.file_utils import get_project_base_directory
  20. # Server
  21. RAG_CONF_PATH = os.path.join(get_project_base_directory(), "conf")
  22. # Get storage type and document engine from system environment variables
  23. STORAGE_IMPL_TYPE = os.getenv('STORAGE_IMPL', 'MINIO')
  24. DOC_ENGINE = os.getenv('DOC_ENGINE', 'elasticsearch')
  25. ES = {}
  26. INFINITY = {}
  27. AZURE = {}
  28. S3 = {}
  29. MINIO = {}
  30. OSS = {}
  31. OS = {}
  32. # Initialize the selected configuration data based on environment variables to solve the problem of initialization errors due to lack of configuration
  33. if DOC_ENGINE == 'elasticsearch':
  34. ES = get_base_config("es", {})
  35. elif DOC_ENGINE == 'opensearch':
  36. OS = get_base_config("os", {})
  37. elif DOC_ENGINE == 'infinity':
  38. INFINITY = get_base_config("infinity", {"uri": "infinity:23817"})
  39. if STORAGE_IMPL_TYPE in ['AZURE_SPN', 'AZURE_SAS']:
  40. AZURE = get_base_config("azure", {})
  41. elif STORAGE_IMPL_TYPE == 'AWS_S3':
  42. S3 = get_base_config("s3", {})
  43. elif STORAGE_IMPL_TYPE == 'MINIO':
  44. MINIO = decrypt_database_config(name="minio")
  45. elif STORAGE_IMPL_TYPE == 'OSS':
  46. OSS = get_base_config("oss", {})
  47. try:
  48. REDIS = decrypt_database_config(name="redis")
  49. except Exception:
  50. REDIS = {}
  51. pass
  52. DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024))
  53. DOC_BULK_SIZE = int(os.environ.get("DOC_BULK_SIZE", 4))
  54. EMBEDDING_BATCH_SIZE = int(os.environ.get("EMBEDDING_BATCH_SIZE", 16))
  55. SVR_QUEUE_NAME = "rag_flow_svr_queue"
  56. SVR_CONSUMER_GROUP_NAME = "rag_flow_svr_task_broker"
  57. PAGERANK_FLD = "pagerank_fea"
  58. TAG_FLD = "tag_feas"
  59. PARALLEL_DEVICES = 0
  60. try:
  61. import torch.cuda
  62. PARALLEL_DEVICES = torch.cuda.device_count()
  63. logging.info(f"found {PARALLEL_DEVICES} gpus")
  64. except Exception:
  65. logging.info("can't import package 'torch'")
  66. def print_rag_settings():
  67. logging.info(f"MAX_CONTENT_LENGTH: {DOC_MAXIMUM_SIZE}")
  68. logging.info(f"MAX_FILE_COUNT_PER_USER: {int(os.environ.get('MAX_FILE_NUM_PER_USER', 0))}")
  69. def get_svr_queue_name(priority: int) -> str:
  70. if priority == 0:
  71. return SVR_QUEUE_NAME
  72. return f"{SVR_QUEUE_NAME}_{priority}"
  73. def get_svr_queue_names():
  74. return [get_svr_queue_name(priority) for priority in [1, 0]]