sidebar_position: 5
RAGFlow supports deploying models locally using Ollama or Xinference. If you have locally deployed models to leverage or wish to enable GPU or CUDA for inference acceleration, you can bind Ollama or Xinference into RAGFlow and use either of them as a local “server” for interacting with your local models.
:::note
To deploy a local model, e.g., Llama3, using Ollama:
Ensure that your host machine’s firewall allows inbound connections on port 11434. For example:
sudo ufw allow 11434/tcp
Restart system and use curl or your web browser to check if the service URL of your Ollama service at http://localhost:11434 is accessible.
Ollama is running
ollama run llama3
docker exec -it ollama ollama run llama3
In RAGFlow, click on your logo on the top right of the page > Model Providers and add Ollama to RAGFlow:
In the popup window, complete basic settings for Ollama:
:::caution NOTE
http://localhost:11434 as base URL.http://host.docker.internal:11434 as base URL.http://<IP_OF_OLLAMA_MACHINE>:11434 as base URL.
::::::danger WARNING
If your Ollama runs on a different machine, you may also need to set the OLLAMA_HOST environment variable to 0.0.0.0 in ollama.service (Note that this is NOT the base URL):
Environment="OLLAMA_HOST=0.0.0.0"
:::caution WARNING Improper base URL settings will trigger the following error:
Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff98b81ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))
:::
Click on your logo > Model Providers > System Model Settings to update your model:
*You should now be able to find llama3 from the dropdown list under Chat model.*
If your local model is an embedding model, you should find your local model under Embedding model.
Update your chat model accordingly in Chat Configuration:
If your local model is an embedding model, update it on the configruation page of your knowledge base.
:::note
To deploy a local model, e.g., Mistral, using Xinference:
Ensure that your host machine’s firewall allows inbound connections on port 9997.
$ xinference-local --host 0.0.0.0 --port 9997
Launch your local model (Mistral), ensuring that you replace ${quantization} with your chosen quantization method
:
$ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
In RAGFlow, click on your logo on the top right of the page > Model Providers and add Xinference to RAGFlow:
Enter an accessible base URL, such as http://<your-xinference-endpoint-domain>:9997/v1.
Click on your logo > Model Providers > System Model Settings to update your model.
*You should now be able to find mistral from the dropdown list under Chat model.*
If your local model is an embedding model, you should find your local model under Embedding model.
Update your chat model accordingly in Chat Configuration:
If your local model is an embedding model, update it on the configruation page of your knowledge base.
IPEX-LLM(IPEX-LLM) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency
To deploy a local model, eg., Qwen2, using IPEX-LLM, follow the steps below:
Ensure that your host machine’s firewall allows inbound connections on port 11434. For example:
sudo ufw allow 11434/tcp
IPEX-LLM’s support for ollama now is available for Linux system and Windows system.
Visit Run llama.cpp with IPEX-LLM on Intel GPU Guide, and follow the instructions in section Prerequisites to setup and section Install IPEX-LLM cpp to install the IPEX-LLM with Ollama binaries.
After the installation, you should have created a conda environment, named llm-cpp for instance, for running ollama commands with IPEX-LLM.
Activate the llm-cpp conda environment and initialize Ollama by executing the commands below. A symbolic link to ollama will appear in your current directory.
conda activate llm-cpp
init-ollama
Please run the following command with administrator privilege in Miniforge Prompt.
conda activate llm-cpp
init-ollama.bat
[!NOTE] If you have installed higher version
ipex-llm[cpp]and want to upgrade your ollama binary file, don’t forget to remove old binary files first and initialize again withinit-ollamaorinit-ollama.bat.
Now you can use this executable file by standard ollama’s usage.
You may launch the Ollama service as below:
export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
./ollama serve
Please run the following command in Miniforge Prompt.
set OLLAMA_NUM_GPU=999
set no_proxy=localhost,127.0.0.1
set ZES_ENABLE_SYSMAN=1
set SYCL_CACHE_PERSISTENT=1
ollama serve
Please set environment variable
OLLAMA_NUM_GPUto999to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing
ollama serve:> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 > ``` > To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. The console will display messages similar to the following:  ### 3. Pull and Run Ollama Model Keep the Ollama service on and open another terminal and run `./ollama pull <model_name>` in Linux (`ollama.exe pull <model_name>` in Windows) to automatically pull a model. e.g. `qwen2:latest`:  #### Run Ollama Model - For **Linux users**: ```bash ./ollama run qwen2:latest
For Windows users:
ollama run qwen2:latest
The confiugraiton follows the steps in
Ollama Section 4 Add Ollama,
Section 5 Complete basic Ollama settings,
Section 6 Update System Model Settings,
Section 7 Update Chat Configuration