Вы не можете выбрать более 25 тем Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.

deploy_local_llm.md 11KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283
  1. ---
  2. sidebar_position: 5
  3. slug: /deploy_local_llm
  4. ---
  5. # Deploy a local LLM
  6. RAGFlow supports deploying models locally using Ollama or Xinference. If you have locally deployed models to leverage or wish to enable GPU or CUDA for inference acceleration, you can bind Ollama or Xinference into RAGFlow and use either of them as a local "server" for interacting with your local models.
  7. RAGFlow seamlessly integrates with Ollama and Xinference, without the need for further environment configurations. You can use them to deploy two types of local models in RAGFlow: chat models and embedding models.
  8. :::tip NOTE
  9. This user guide does not intend to cover much of the installation or configuration details of Ollama or Xinference; its focus is on configurations inside RAGFlow. For the most current information, you may need to check out the official site of Ollama or Xinference.
  10. :::
  11. ## Deploy a local model using Ollama
  12. [Ollama](https://github.com/ollama/ollama) enables you to run open-source large language models that you deployed locally. It bundles model weights, configurations, and data into a single package, defined by a Modelfile, and optimizes setup and configurations, including GPU usage.
  13. :::note
  14. - For information about downloading Ollama, see [here](https://github.com/ollama/ollama?tab=readme-ov-file#ollama).
  15. - For information about configuring Ollama server, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server).
  16. - For a complete list of supported models and variants, see the [Ollama model library](https://ollama.com/library).
  17. :::
  18. To deploy a local model, e.g., **Llama3**, using Ollama:
  19. ### 1. Check firewall settings
  20. Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
  21. ```bash
  22. sudo ufw allow 11434/tcp
  23. ```
  24. ### 2. Ensure Ollama is accessible
  25. Restart system and use curl or your web browser to check if the service URL of your Ollama service at `http://localhost:11434` is accessible.
  26. ```bash
  27. Ollama is running
  28. ```
  29. ### 3. Run your local model
  30. ```bash
  31. ollama run llama3
  32. ```
  33. <details>
  34. <summary>If your Ollama is installed through Docker, run the following instead:</summary>
  35. ```bash
  36. docker exec -it ollama ollama run llama3
  37. ```
  38. </details>
  39. ### 4. Add Ollama
  40. In RAGFlow, click on your logo on the top right of the page **>** **Model Providers** and add Ollama to RAGFlow:
  41. ![add ollama](https://github.com/infiniflow/ragflow/assets/93570324/10635088-028b-4b3d-add9-5c5a6e626814)
  42. ### 5. Complete basic Ollama settings
  43. In the popup window, complete basic settings for Ollama:
  44. 1. Because **llama3** is a chat model, choose **chat** as the model type.
  45. 2. Ensure that the model name you enter here *precisely* matches the name of the local model you are running with Ollama.
  46. 3. Ensure that the base URL you enter is accessible to RAGFlow.
  47. 4. OPTIONAL: Switch on the toggle under **Does it support Vision?** if your model includes an image-to-text model.
  48. :::caution NOTE
  49. - If your Ollama and RAGFlow run on the same machine, use `http://localhost:11434` as base URL.
  50. - If your Ollama and RAGFlow run on the same machine and Ollama is in Docker, use `http://host.docker.internal:11434` as base URL.
  51. - If your Ollama runs on a different machine from RAGFlow, use `http://<IP_OF_OLLAMA_MACHINE>:11434` as base URL.
  52. :::
  53. :::danger WARNING
  54. If your Ollama runs on a different machine, you may also need to set the `OLLAMA_HOST` environment variable to `0.0.0.0` in **ollama.service** (Note that this is *NOT* the base URL):
  55. ```bash
  56. Environment="OLLAMA_HOST=0.0.0.0"
  57. ```
  58. See [this guide](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server) for more information.
  59. :::
  60. :::caution WARNING
  61. Improper base URL settings will trigger the following error:
  62. ```bash
  63. Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff98b81ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))
  64. ```
  65. :::
  66. ### 6. Update System Model Settings
  67. Click on your logo **>** **Model Providers** **>** **System Model Settings** to update your model:
  68. *You should now be able to find **llama3** from the dropdown list under **Chat model**.*
  69. > If your local model is an embedding model, you should find your local model under **Embedding model**.
  70. ### 7. Update Chat Configuration
  71. Update your chat model accordingly in **Chat Configuration**:
  72. > If your local model is an embedding model, update it on the configruation page of your knowledge base.
  73. ## Deploy a local model using Xinference
  74. Xorbits Inference([Xinference](https://github.com/xorbitsai/inference)) enables you to unleash the full potential of cutting-edge AI models.
  75. :::note
  76. - For information about installing Xinference Ollama, see [here](https://inference.readthedocs.io/en/latest/getting_started/).
  77. - For a complete list of supported models, see the [Builtin Models](https://inference.readthedocs.io/en/latest/models/builtin/).
  78. :::
  79. To deploy a local model, e.g., **Mistral**, using Xinference:
  80. ### 1. Check firewall settings
  81. Ensure that your host machine's firewall allows inbound connections on port 9997.
  82. ### 2. Start an Xinference instance
  83. ```bash
  84. $ xinference-local --host 0.0.0.0 --port 9997
  85. ```
  86. ### 3. Launch your local model
  87. Launch your local model (**Mistral**), ensuring that you replace `${quantization}` with your chosen quantization method
  88. :
  89. ```bash
  90. $ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
  91. ```
  92. ### 4. Add Xinference
  93. In RAGFlow, click on your logo on the top right of the page **>** **Model Providers** and add Xinference to RAGFlow:
  94. ![add xinference](https://github.com/infiniflow/ragflow/assets/93570324/10635088-028b-4b3d-add9-5c5a6e626814)
  95. ### 5. Complete basic Xinference settings
  96. Enter an accessible base URL, such as `http://<your-xinference-endpoint-domain>:9997/v1`.
  97. ### 6. Update System Model Settings
  98. Click on your logo **>** **Model Providers** **>** **System Model Settings** to update your model.
  99. *You should now be able to find **mistral** from the dropdown list under **Chat model**.*
  100. > If your local model is an embedding model, you should find your local model under **Embedding model**.
  101. ### 7. Update Chat Configuration
  102. Update your chat model accordingly in **Chat Configuration**:
  103. > If your local model is an embedding model, update it on the configruation page of your knowledge base.
  104. ## Deploy a local model using IPEX-LLM
  105. IPEX-LLM([IPEX-LLM](https://github.com/intel-analytics/ipex-llm)) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency
  106. To deploy a local model, eg., **Qwen2**, using IPEX-LLM, follow the steps below:
  107. ### 1. Check firewall settings
  108. Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
  109. ```bash
  110. sudo ufw allow 11434/tcp
  111. ```
  112. ### 2. Install and Start Ollama serve using IPEX-LLM
  113. #### 2.1 Install IPEX-LLM for Ollama
  114. IPEX-LLM's support for `ollama` now is available for Linux system and Windows system.
  115. Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md), and follow the instructions in section [Prerequisites](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#0-prerequisites) to setup and section [Install IPEX-LLM cpp](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp) to install the IPEX-LLM with Ollama binaries.
  116. **After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `ollama` commands with IPEX-LLM.**
  117. #### 2.2 Initialize Ollama
  118. Activate the `llm-cpp` conda environment and initialize Ollama by executing the commands below. A symbolic link to `ollama` will appear in your current directory.
  119. - For **Linux users**:
  120. ```bash
  121. conda activate llm-cpp
  122. init-ollama
  123. ```
  124. - For **Windows users**:
  125. Please run the following command with **administrator privilege in Miniforge Prompt**.
  126. ```cmd
  127. conda activate llm-cpp
  128. init-ollama.bat
  129. ```
  130. > [!NOTE]
  131. > If you have installed higher version `ipex-llm[cpp]` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with `init-ollama` or `init-ollama.bat`.
  132. **Now you can use this executable file by standard ollama's usage.**
  133. #### 2.3 Run Ollama Serve
  134. You may launch the Ollama service as below:
  135. - For **Linux users**:
  136. ```bash
  137. export OLLAMA_NUM_GPU=999
  138. export no_proxy=localhost,127.0.0.1
  139. export ZES_ENABLE_SYSMAN=1
  140. source /opt/intel/oneapi/setvars.sh
  141. export SYCL_CACHE_PERSISTENT=1
  142. ./ollama serve
  143. ```
  144. - For **Windows users**:
  145. Please run the following command in Miniforge Prompt.
  146. ```cmd
  147. set OLLAMA_NUM_GPU=999
  148. set no_proxy=localhost,127.0.0.1
  149. set ZES_ENABLE_SYSMAN=1
  150. set SYCL_CACHE_PERSISTENT=1
  151. ollama serve
  152. ```
  153. > Please set environment variable `OLLAMA_NUM_GPU` to `999` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
  154. > If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
  155. >
  156. > ```bash
  157. > export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
  158. > ```
  159. > To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
  160. The console will display messages similar to the following:
  161. ![](https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png)
  162. ### 3. Pull and Run Ollama Model
  163. Keep the Ollama service on and open another terminal and run `./ollama pull <model_name>` in Linux (`ollama.exe pull <model_name>` in Windows) to automatically pull a model. e.g. `qwen2:latest`:
  164. ![](https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png)
  165. #### Run Ollama Model
  166. - For **Linux users**:
  167. ```bash
  168. ./ollama run qwen2:latest
  169. ```
  170. - For **Windows users**:
  171. ```cmd
  172. ollama run qwen2:latest
  173. ```
  174. ### 4. Configure RAGflow to use IPEX-LLM accelerated Ollama
  175. The confiugraiton follows the steps in
  176. Ollama Section 4 [Add Ollama](#4-add-ollama),
  177. Section 5 [Complete basic Ollama settings](#5-complete-basic-ollama-settings),
  178. Section 6 [Update System Model Settings](#6-update-system-model-settings),
  179. Section 7 [Update Chat Configuration](#7-update-chat-configuration)