Ver código fonte

Added doc for switching elasticsearch to infinity (#3370)

### What problem does this PR solve?

Added doc for switching elasticsearch to infinity

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
tags/v0.14.0
Zhichang Yu 11 meses atrás
pai
commit
9d395ab74e
Nenhuma conta vinculada ao e-mail do autor do commit

+ 20
- 1
.github/workflows/tests.yml Ver arquivo

echo "RAGFLOW_IMAGE=infiniflow/ragflow:dev" >> docker/.env echo "RAGFLOW_IMAGE=infiniflow/ragflow:dev" >> docker/.env
sudo docker compose -f docker/docker-compose.yml up -d sudo docker compose -f docker/docker-compose.yml up -d


- name: Run tests
- name: Run tests against Elasticsearch
run: | run: |
export http_proxy=""; export https_proxy=""; export no_proxy=""; export HTTP_PROXY=""; export HTTPS_PROXY=""; export NO_PROXY="" export http_proxy=""; export https_proxy=""; export no_proxy=""; export HTTP_PROXY=""; export HTTPS_PROXY=""; export NO_PROXY=""
export HOST_ADDRESS=http://host.docker.internal:9380 export HOST_ADDRESS=http://host.docker.internal:9380
if: always() # always run this step even if previous steps failed if: always() # always run this step even if previous steps failed
run: | run: |
sudo docker compose -f docker/docker-compose.yml down -v sudo docker compose -f docker/docker-compose.yml down -v

- name: Start ragflow:dev
run: |
sudo DOC_ENGINE=infinity docker compose -f docker/docker-compose.yml up -d

- name: Run tests against Infinity
run: |
export http_proxy=""; export https_proxy=""; export no_proxy=""; export HTTP_PROXY=""; export HTTPS_PROXY=""; export NO_PROXY=""
export HOST_ADDRESS=http://host.docker.internal:9380
until sudo docker exec ragflow-server curl -s --connect-timeout 5 ${HOST_ADDRESS} > /dev/null; do
echo "Waiting for service to be available..."
sleep 5
done
cd sdk/python && poetry install && source .venv/bin/activate && cd test && pytest --tb=short t_dataset.py t_chat.py t_session.py t_document.py t_chunk.py

- name: Stop ragflow:dev
if: always() # always run this step even if previous steps failed
run: |
sudo DOC_ENGINE=infinity docker compose -f docker/docker-compose.yml down -v

+ 25
- 4
README.md Ver arquivo

$ docker compose -f docker-compose.yml up -d $ docker compose -f docker-compose.yml up -d
``` ```


> - To download a RAGFlow slim Docker image of a specific version, update the `RAGFlow_IMAGE` variable in *
> - To download a RAGFlow slim Docker image of a specific version, update the `RAGFLOW_IMAGE` variable in *
*docker/.env** to your desired version. For example, `RAGFLOW_IMAGE=infiniflow/ragflow:v0.13.0-slim`. After *docker/.env** to your desired version. For example, `RAGFLOW_IMAGE=infiniflow/ragflow:v0.13.0-slim`. After
making this change, rerun the command above to initiate the download. making this change, rerun the command above to initiate the download.
> - To download the dev version of RAGFlow Docker image *including* embedding models and Python libraries, update the > - To download the dev version of RAGFlow Docker image *including* embedding models and Python libraries, update the
`RAGFlow_IMAGE` variable in **docker/.env** to `RAGFLOW_IMAGE=infiniflow/ragflow:dev`. After making this change,
`RAGFLOW_IMAGE` variable in **docker/.env** to `RAGFLOW_IMAGE=infiniflow/ragflow:dev`. After making this change,
rerun the command above to initiate the download. rerun the command above to initiate the download.
> - To download a specific version of RAGFlow Docker image *including* embedding models and Python libraries, update > - To download a specific version of RAGFlow Docker image *including* embedding models and Python libraries, update
the `RAGFlow_IMAGE` variable in **docker/.env** to your desired version. For example,
the `RAGFLOW_IMAGE` variable in **docker/.env** to your desired version. For example,
`RAGFLOW_IMAGE=infiniflow/ragflow:v0.13.0`. After making this change, rerun the command above to initiate the `RAGFLOW_IMAGE=infiniflow/ragflow:v0.13.0`. After making this change, rerun the command above to initiate the
download. download.


* Running on http://x.x.x.x:9380 * Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit INFO:werkzeug:Press CTRL+C to quit
``` ```
> If you skip this confirmation step and directly log in to RAGFlow, your browser may prompt a `network abnormal`
> If you skip this confirmation step and directly log in to RAGFlow, your browser may prompt a `network anormal`
error because, at that moment, your RAGFlow may not be fully initialized. error because, at that moment, your RAGFlow may not be fully initialized.


5. In your web browser, enter the IP address of your server and log in to RAGFlow. 5. In your web browser, enter the IP address of your server and log in to RAGFlow.
> $ docker compose -f docker/docker-compose.yml up -d > $ docker compose -f docker/docker-compose.yml up -d
> ``` > ```


### Switch doc engine from Elasticsearch to Infinity

RAGFlow uses Elasticsearch by default for storing full text and vectors. To switch to [Infinity](https://github.com/infiniflow/infinity/), follow these steps:

1. Stop all running containers:

```bash
$ docker compose -f docker/docker-compose.yml down -v
```

2. Set `DOC_ENGINE` in **docker/.env** to `infinity`.

3. Start the containers:

```bash
$ docker compose -f docker/docker-compose.yml up -d
```

> [!WARNING]
> Switching to Infinity on a Linux/arm64 machine is not yet officially supported.

## 🔧 Build a Docker image without embedding models ## 🔧 Build a Docker image without embedding models


This image is approximately 1 GB in size and relies on external LLM and embedding services. This image is approximately 1 GB in size and relies on external LLM and embedding services.

+ 1
- 1
README_id.md Ver arquivo

* Running on http://x.x.x.x:9380 * Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit INFO:werkzeug:Press CTRL+C to quit
``` ```
> Jika Anda melewatkan langkah ini dan langsung login ke RAGFlow, browser Anda mungkin menampilkan error `network abnormal`
> Jika Anda melewatkan langkah ini dan langsung login ke RAGFlow, browser Anda mungkin menampilkan error `network anormal`
karena RAGFlow mungkin belum sepenuhnya siap. karena RAGFlow mungkin belum sepenuhnya siap.


5. Buka browser web Anda, masukkan alamat IP server Anda, dan login ke RAGFlow. 5. Buka browser web Anda, masukkan alamat IP server Anda, dan login ke RAGFlow.

+ 1
- 1
README_ko.md Ver arquivo

* Running on http://x.x.x.x:9380 * Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit INFO:werkzeug:Press CTRL+C to quit
``` ```
> 만약 확인 단계를 건너뛰고 바로 RAGFlow에 로그인하면, RAGFlow가 완전히 초기화되지 않았기 때문에 브라우저에서 `network abnormal` 오류가 발생할 수 있습니다.
> 만약 확인 단계를 건너뛰고 바로 RAGFlow에 로그인하면, RAGFlow가 완전히 초기화되지 않았기 때문에 브라우저에서 `network anormal` 오류가 발생할 수 있습니다.


5. 웹 브라우저에 서버의 IP 주소를 입력하고 RAGFlow에 로그인하세요. 5. 웹 브라우저에 서버의 IP 주소를 입력하고 RAGFlow에 로그인하세요.
> 기본 설정을 사용할 경우, `http://IP_OF_YOUR_MACHINE`만 입력하면 됩니다 (포트 번호는 제외). 기본 HTTP 서비스 포트 `80`은 기본 구성으로 사용할 때 생략할 수 있습니다. > 기본 설정을 사용할 경우, `http://IP_OF_YOUR_MACHINE`만 입력하면 됩니다 (포트 번호는 제외). 기본 HTTP 서비스 포트 `80`은 기본 구성으로 사용할 때 생략할 수 있습니다.

+ 1
- 1
README_zh.md Ver arquivo

* Running on http://x.x.x.x:9380 * Running on http://x.x.x.x:9380
INFO:werkzeug:Press CTRL+C to quit INFO:werkzeug:Press CTRL+C to quit
``` ```
> 如果您跳过这一步系统确认步骤就登录 RAGFlow,你的浏览器有可能会提示 `network abnormal` 或 `网络异常`,因为 RAGFlow 可能并未完全启动成功。
> 如果您跳过这一步系统确认步骤就登录 RAGFlow,你的浏览器有可能会提示 `network anormal` 或 `网络异常`,因为 RAGFlow 可能并未完全启动成功。


5. 在你的浏览器中输入你的服务器对应的 IP 地址并登录 RAGFlow。 5. 在你的浏览器中输入你的服务器对应的 IP 地址并登录 RAGFlow。
> 上面这个例子中,您只需输入 http://IP_OF_YOUR_MACHINE 即可:未改动过配置则无需输入端口(默认的 HTTP 服务端口 80)。 > 上面这个例子中,您只需输入 http://IP_OF_YOUR_MACHINE 即可:未改动过配置则无需输入端口(默认的 HTTP 服务端口 80)。

+ 6
- 2
api/settings.py Ver arquivo

PRIVILEGE_COMMAND_WHITELIST = [] PRIVILEGE_COMMAND_WHITELIST = []
CHECK_NODES_IDENTITY = False CHECK_NODES_IDENTITY = False


if 'hosts' in get_base_config("es", {}):
DOC_ENGINE = os.environ.get('DOC_ENGINE', "elasticsearch")
if DOC_ENGINE == "elasticsearch":
docStoreConn = rag.utils.es_conn.ESConnection() docStoreConn = rag.utils.es_conn.ESConnection()
else:
elif DOC_ENGINE == "infinity":
docStoreConn = rag.utils.infinity_conn.InfinityConnection() docStoreConn = rag.utils.infinity_conn.InfinityConnection()
else:
raise Exception(f"Not supported doc engine: {DOC_ENGINE}")

retrievaler = search.Dealer(docStoreConn) retrievaler = search.Dealer(docStoreConn)
kg_retrievaler = kg_search.KGSearch(docStoreConn) kg_retrievaler = kg_search.KGSearch(docStoreConn)



+ 2
- 2
conf/service_conf.yaml Ver arquivo

user: 'root' user: 'root'
password: 'infini_rag_flow' password: 'infini_rag_flow'
host: 'mysql' host: 'mysql'
port: 3306
port: 5455
max_connections: 100 max_connections: 100
stale_timeout: 30 stale_timeout: 30
minio: minio:
password: 'infini_rag_flow' password: 'infini_rag_flow'
host: 'minio:9000' host: 'minio:9000'
es: es:
hosts: 'http://es01:9200'
hosts: 'http://es01:1200'
username: 'elastic' username: 'elastic'
password: 'infini_rag_flow' password: 'infini_rag_flow'
redis: redis:

+ 11
- 0
docker/.env Ver arquivo

# The type of doc engine to use.
# Supported values are `elasticsearch`, `infinity`.
DOC_ENGINE=${DOC_ENGINE:-elasticsearch}

# ------------------------------
# docker env var for specifying vector db type at startup
# (based on the vector db type, the corresponding docker
# compose profile will be used)
# ------------------------------
COMPOSE_PROFILES=${DOC_ENGINE}

# The version of Elasticsearch. # The version of Elasticsearch.
STACK_VERSION=8.11.3 STACK_VERSION=8.11.3



+ 29
- 25
docker/docker-compose-base.yml Ver arquivo

services: services:
es01: es01:
container_name: ragflow-es-01 container_name: ragflow-es-01
profiles:
- elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION} image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
volumes: volumes:
- esdata01:/usr/share/elasticsearch/data - esdata01:/usr/share/elasticsearch/data
- ragflow - ragflow
restart: on-failure restart: on-failure


# infinity:
# container_name: ragflow-infinity
# image: infiniflow/infinity:v0.5.0-dev2
# volumes:
# - infinity_data:/var/infinity
# ports:
# - ${INFINITY_THRIFT_PORT}:23817
# - ${INFINITY_HTTP_PORT}:23820
# - ${INFINITY_PSQL_PORT}:5432
# env_file: .env
# environment:
# - TZ=${TIMEZONE}
# mem_limit: ${MEM_LIMIT}
# ulimits:
# nofile:
# soft: 500000
# hard: 500000
# networks:
# - ragflow
# healthcheck:
# test: ["CMD", "curl", "http://localhost:23820/admin/node/current"]
# interval: 10s
# timeout: 10s
# retries: 120
# restart: on-failure
infinity:
container_name: ragflow-infinity
profiles:
- infinity
image: infiniflow/infinity:v0.5.0-dev2
volumes:
- infinity_data:/var/infinity
ports:
- ${INFINITY_THRIFT_PORT}:23817
- ${INFINITY_HTTP_PORT}:23820
- ${INFINITY_PSQL_PORT}:5432
env_file: .env
environment:
- TZ=${TIMEZONE}
mem_limit: ${MEM_LIMIT}
ulimits:
nofile:
soft: 500000
hard: 500000
networks:
- ragflow
healthcheck:
test: ["CMD", "curl", "http://localhost:23820/admin/node/current"]
interval: 10s
timeout: 10s
retries: 120
restart: on-failure




mysql: mysql:

+ 0
- 2
docker/docker-compose.yml Ver arquivo

depends_on: depends_on:
mysql: mysql:
condition: service_healthy condition: service_healthy
es01:
condition: service_healthy
image: ${RAGFLOW_IMAGE} image: ${RAGFLOW_IMAGE}
container_name: ragflow-server container_name: ragflow-server
ports: ports:

+ 1
- 1
docker/service_conf.yaml.template Ver arquivo

es: es:
hosts: 'http://${ES_HOST:-es01}:9200' hosts: 'http://${ES_HOST:-es01}:9200'
username: '${ES_USER:-elastic}' username: '${ES_USER:-elastic}'
password: '${ES_PASSWORD:-infini_rag_flow}'
password: '${ELASTIC_PASSWORD:-infini_rag_flow}'
redis: redis:
db: 1 db: 1
password: '${REDIS_PASSWORD:-infini_rag_flow}' password: '${REDIS_PASSWORD:-infini_rag_flow}'

+ 17
- 0
docs/guides/develop/build_docker_image.mdx Ver arquivo

``` ```


</TabItem> </TabItem>
<TabItem value="linux/arm64">

## 🔧 Build a Docker image for linux arm64

We are currently unable to regularly build multi-arch images with CI and have no plans to publish arm64 images in the near future.
However, you can build an image yourself on a linux/arm64 host machine:

```bash
git clone https://github.com/infiniflow/ragflow.git
cd ragflow/
pip3 install huggingface-hub nltk
python3 download_deps.py
docker build --build-arg ARCH=arm64 -f Dockerfile.slim -t infiniflow/ragflow:dev-slim .
docker build --build-arg ARCH=arm64 -f Dockerfile -t infiniflow/ragflow:dev .
```
</TabItem>

</Tabs> </Tabs>

+ 16
- 12
rag/utils/es_conn.py Ver arquivo

import os import os
from typing import List, Dict from typing import List, Dict


import elasticsearch
import copy import copy
from elasticsearch import Elasticsearch from elasticsearch import Elasticsearch
from elasticsearch_dsl import UpdateByQuery, Q, Search, Index from elasticsearch_dsl import UpdateByQuery, Q, Search, Index
from rag.utils.doc_store_conn import DocStoreConnection, MatchExpr, OrderByExpr, MatchTextExpr, MatchDenseExpr, FusionExpr from rag.utils.doc_store_conn import DocStoreConnection, MatchExpr, OrderByExpr, MatchTextExpr, MatchDenseExpr, FusionExpr
from rag.nlp import is_english, rag_tokenizer from rag.nlp import is_english, rag_tokenizer


logger.info("Elasticsearch sdk version: "+str(elasticsearch.__version__))



@singleton @singleton
class ESConnection(DocStoreConnection): class ESConnection(DocStoreConnection):
def __init__(self): def __init__(self):
self.info = {} self.info = {}
for _ in range(10):
logger.info(f"Use Elasticsearch {settings.ES['hosts']} as the doc engine.")
for _ in range(24):
try: try:
self.es = Elasticsearch( self.es = Elasticsearch(
settings.ES["hosts"].split(","), settings.ES["hosts"].split(","),
) )
if self.es: if self.es:
self.info = self.es.info() self.info = self.es.info()
logger.info("Connect to es.")
break break
except Exception:
logger.exception("Fail to connect to es")
time.sleep(1)
except Exception as e:
logger.warn(f"{str(e)}. Waiting Elasticsearch {settings.ES['hosts']} to be healthy.")
time.sleep(5)
if not self.es.ping(): if not self.es.ping():
raise Exception("Can't connect to ES cluster")
v = self.info.get("version", {"number": "5.6"})
msg = f"Elasticsearch {settings.ES['hosts']} didn't become healthy in 120s."
logger.error(msg)
raise Exception(msg)
v = self.info.get("version", {"number": "8.11.3"})
v = v["number"].split(".")[0] v = v["number"].split(".")[0]
if int(v) < 8: if int(v) < 8:
raise Exception(f"ES version must be greater than or equal to 8, current version: {v}")
msg = f"Elasticsearch version must be greater than or equal to 8, current version: {v}"
logger.error(msg)
raise Exception(msg)
fp_mapping = os.path.join(get_project_base_directory(), "conf", "mapping.json") fp_mapping = os.path.join(get_project_base_directory(), "conf", "mapping.json")
if not os.path.exists(fp_mapping): if not os.path.exists(fp_mapping):
raise Exception(f"Mapping file not found at {fp_mapping}")
msg = f"Elasticsearch mapping file not found at {fp_mapping}"
logger.error(msg)
raise Exception(msg)
self.mapping = json.load(open(fp_mapping, "r")) self.mapping = json.load(open(fp_mapping, "r"))
logger.info(f"Elasticsearch {settings.ES['hosts']} is healthy.")


""" """
Database operations Database operations

+ 25
- 8
rag/utils/infinity_conn.py Ver arquivo

import os import os
import re import re
import json import json
import time
from typing import List, Dict from typing import List, Dict
import infinity import infinity
from infinity.common import ConflictType, InfinityException from infinity.common import ConflictType, InfinityException
from infinity.index import IndexInfo, IndexType from infinity.index import IndexInfo, IndexType
from infinity.connection_pool import ConnectionPool from infinity.connection_pool import ConnectionPool
from rag import settings
from api.utils.log_utils import logger from api.utils.log_utils import logger
from rag import settings
from rag.utils import singleton from rag.utils import singleton
import polars as pl import polars as pl
from polars.series.series import Series from polars.series.series import Series
if ":" in infinity_uri: if ":" in infinity_uri:
host, port = infinity_uri.split(":") host, port = infinity_uri.split(":")
infinity_uri = infinity.common.NetworkAddress(host, int(port)) infinity_uri = infinity.common.NetworkAddress(host, int(port))
self.connPool = ConnectionPool(infinity_uri)
logger.info(f"Connected to infinity {infinity_uri}.")
self.connPool = None
logger.info(f"Use Infinity {infinity_uri} as the doc engine.")
for _ in range(24):
try:
connPool = ConnectionPool(infinity_uri)
inf_conn = connPool.get_conn()
_ = inf_conn.show_current_node()
connPool.release_conn(inf_conn)
self.connPool = connPool
break
except Exception as e:
logger.warn(f"{str(e)}. Waiting Infinity {infinity_uri} to be healthy.")
time.sleep(5)
if self.connPool is None:
msg = f"Infinity {infinity_uri} didn't become healthy in 120s."
logger.error(msg)
raise Exception(msg)
logger.info(f"Infinity {infinity_uri} is healthy.")


""" """
Database operations Database operations
_ = db_instance.get_table(table_name) _ = db_instance.get_table(table_name)
self.connPool.release_conn(inf_conn) self.connPool.release_conn(inf_conn)
return True return True
except Exception:
logger.exception("INFINITY indexExist")
except Exception as e:
logger.warn(f"INFINITY indexExist {str(e)}")
return False return False


""" """
) )
if len(filter_cond) != 0: if len(filter_cond) != 0:
filter_fulltext = f"({filter_cond}) AND {filter_fulltext}" filter_fulltext = f"({filter_cond}) AND {filter_fulltext}"
# doc_store_logger.info(f"filter_fulltext: {filter_fulltext}")
# logger.info(f"filter_fulltext: {filter_fulltext}")
minimum_should_match = "0%" minimum_should_match = "0%"
if "minimum_should_match" in matchExpr.extra_options: if "minimum_should_match" in matchExpr.extra_options:
minimum_should_match = ( minimum_should_match = (
for k, v in d.items(): for k, v in d.items():
if k.endswith("_kwd") and isinstance(v, list): if k.endswith("_kwd") and isinstance(v, list):
d[k] = " ".join(v) d[k] = " ".join(v)
ids = [f"{d['id']}" for d in documents]
ids = ["'{}'".format(d["id"]) for d in documents]
str_ids = ", ".join(ids) str_ids = ", ".join(ids)
str_filter = f"id IN ({str_ids})" str_filter = f"id IN ({str_ids})"
table_instance.delete(str_filter) table_instance.delete(str_filter)
# logger.info(f"InfinityConnection.insert {json.dumps(documents)}") # logger.info(f"InfinityConnection.insert {json.dumps(documents)}")
table_instance.insert(documents) table_instance.insert(documents)
self.connPool.release_conn(inf_conn) self.connPool.release_conn(inf_conn)
doc_store_logger.info(f"inserted into {table_name} {str_ids}.")
logger.info(f"inserted into {table_name} {str_ids}.")
return [] return []


def update( def update(

+ 2
- 2
sdk/python/test/t_chunk.py Ver arquivo

docs = ds.upload_documents(documents) docs = ds.upload_documents(documents)
doc = docs[0] doc = docs[0]
chunk = doc.add_chunk(content="This is a chunk addition test") chunk = doc.add_chunk(content="This is a chunk addition test")
# For ElasticSearch, the chunk is not searchable in shot time (~2s).
# For Elasticsearch, the chunk is not searchable in shot time (~2s).
sleep(3) sleep(3)
chunk.update({"content":"This is a updated content"}) chunk.update({"content":"This is a updated content"})


docs = ds.upload_documents(documents) docs = ds.upload_documents(documents)
doc = docs[0] doc = docs[0]
chunk = doc.add_chunk(content="This is a chunk addition test") chunk = doc.add_chunk(content="This is a chunk addition test")
# For ElasticSearch, the chunk is not searchable in shot time (~2s).
# For Elasticsearch, the chunk is not searchable in shot time (~2s).
sleep(3) sleep(3)
chunk.update({"available":0}) chunk.update({"available":0})



Carregando…
Cancelar
Salvar