Вы не можете выбрать более 25 тем Темы должны начинаться с буквы или цифры, могут содержать дефисы(-) и должны содержать не более 35 символов.

deploy_local_llm.mdx 12KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348
  1. ---
  2. sidebar_position: 6
  3. slug: /deploy_local_llm
  4. ---
  5. # Deploy a local LLM
  6. import Tabs from '@theme/Tabs';
  7. import TabItem from '@theme/TabItem';
  8. RAGFlow supports deploying models locally using Ollama, Xinference, IPEX-LLM, or jina. If you have locally deployed models to leverage or wish to enable GPU or CUDA for inference acceleration, you can bind Ollama or Xinference into RAGFlow and use either of them as a local "server" for interacting with your local models.
  9. RAGFlow seamlessly integrates with Ollama and Xinference, without the need for further environment configurations. You can use them to deploy two types of local models in RAGFlow: chat models and embedding models.
  10. :::tip NOTE
  11. This user guide does not intend to cover much of the installation or configuration details of Ollama or Xinference; its focus is on configurations inside RAGFlow. For the most current information, you may need to check out the official site of Ollama or Xinference.
  12. :::
  13. ## Deploy a local model using Ollama
  14. [Ollama](https://github.com/ollama/ollama) enables you to run open-source large language models that you deployed locally. It bundles model weights, configurations, and data into a single package, defined by a Modelfile, and optimizes setup and configurations, including GPU usage.
  15. :::note
  16. - For information about downloading Ollama, see [here](https://github.com/ollama/ollama?tab=readme-ov-file#ollama).
  17. - For information about configuring Ollama server, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server).
  18. - For a complete list of supported models and variants, see the [Ollama model library](https://ollama.com/library).
  19. :::
  20. To deploy a local model, e.g., **Llama3**, using Ollama:
  21. ### 1. Check firewall settings
  22. Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
  23. ```bash
  24. sudo ufw allow 11434/tcp
  25. ```
  26. ### 2. Ensure Ollama is accessible
  27. Restart system and use curl or your web browser to check if the service URL of your Ollama service at `http://localhost:11434` is accessible.
  28. ```bash
  29. Ollama is running
  30. ```
  31. ### 3. Run your local model
  32. ```bash
  33. ollama run llama3
  34. ```
  35. <details>
  36. <summary>If your Ollama is installed through Docker, run the following instead:</summary>
  37. ```bash
  38. docker exec -it ollama ollama run llama3
  39. ```
  40. </details>
  41. ### 4. Add Ollama
  42. In RAGFlow, click on your logo on the top right of the page **>** **Model Providers** and add Ollama to RAGFlow:
  43. ![add ollama](https://github.com/infiniflow/ragflow/assets/93570324/10635088-028b-4b3d-add9-5c5a6e626814)
  44. ### 5. Complete basic Ollama settings
  45. In the popup window, complete basic settings for Ollama:
  46. 1. Because **llama3** is a chat model, choose **chat** as the model type.
  47. 2. Ensure that the model name you enter here *precisely* matches the name of the local model you are running with Ollama.
  48. 3. Ensure that the base URL you enter is accessible to RAGFlow.
  49. 4. OPTIONAL: Switch on the toggle under **Does it support Vision?** if your model includes an image-to-text model.
  50. :::caution NOTE
  51. - If your Ollama and RAGFlow run on the same machine, use `http://localhost:11434` as base URL.
  52. - If your Ollama and RAGFlow run on the same machine and Ollama is in Docker, use `http://host.docker.internal:11434` as base URL.
  53. - If your Ollama runs on a different machine from RAGFlow, use `http://<IP_OF_OLLAMA_MACHINE>:11434` as base URL.
  54. :::
  55. :::danger WARNING
  56. If your Ollama runs on a different machine, you may also need to set the `OLLAMA_HOST` environment variable to `0.0.0.0` in **ollama.service** (Note that this is *NOT* the base URL):
  57. ```bash
  58. Environment="OLLAMA_HOST=0.0.0.0"
  59. ```
  60. See [this guide](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server) for more information.
  61. :::
  62. :::caution WARNING
  63. Improper base URL settings will trigger the following error:
  64. ```bash
  65. Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff98b81ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))
  66. ```
  67. :::
  68. ### 6. Update System Model Settings
  69. Click on your logo **>** **Model Providers** **>** **System Model Settings** to update your model:
  70. *You should now be able to find **llama3** from the dropdown list under **Chat model**.*
  71. > If your local model is an embedding model, you should find your local model under **Embedding model**.
  72. ### 7. Update Chat Configuration
  73. Update your chat model accordingly in **Chat Configuration**:
  74. > If your local model is an embedding model, update it on the configruation page of your knowledge base.
  75. ## Deploy a local model using Xinference
  76. Xorbits Inference ([Xinference](https://github.com/xorbitsai/inference)) enables you to unleash the full potential of cutting-edge AI models.
  77. :::note
  78. - For information about installing Xinference Ollama, see [here](https://inference.readthedocs.io/en/latest/getting_started/).
  79. - For a complete list of supported models, see the [Builtin Models](https://inference.readthedocs.io/en/latest/models/builtin/).
  80. :::
  81. To deploy a local model, e.g., **Mistral**, using Xinference:
  82. ### 1. Check firewall settings
  83. Ensure that your host machine's firewall allows inbound connections on port 9997.
  84. ### 2. Start an Xinference instance
  85. ```bash
  86. $ xinference-local --host 0.0.0.0 --port 9997
  87. ```
  88. ### 3. Launch your local model
  89. Launch your local model (**Mistral**), ensuring that you replace `${quantization}` with your chosen quantization method:
  90. ```bash
  91. $ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
  92. ```
  93. ### 4. Add Xinference
  94. In RAGFlow, click on your logo on the top right of the page **>** **Model Providers** and add Xinference to RAGFlow:
  95. ![add xinference](https://github.com/infiniflow/ragflow/assets/93570324/10635088-028b-4b3d-add9-5c5a6e626814)
  96. ### 5. Complete basic Xinference settings
  97. Enter an accessible base URL, such as `http://<your-xinference-endpoint-domain>:9997/v1`.
  98. > For rerank model, please use the `http://<your-xinference-endpoint-domain>:9997/v1/rerank` as the base URL.
  99. ### 6. Update System Model Settings
  100. Click on your logo **>** **Model Providers** **>** **System Model Settings** to update your model.
  101. *You should now be able to find **mistral** from the dropdown list under **Chat model**.*
  102. > If your local model is an embedding model, you should find your local model under **Embedding model**.
  103. ### 7. Update Chat Configuration
  104. Update your chat model accordingly in **Chat Configuration**:
  105. > If your local model is an embedding model, update it on the configruation page of your knowledge base.
  106. ## Deploy a local model using IPEX-LLM
  107. [IPEX-LLM](https://github.com/intel-analytics/ipex-llm) is a PyTorch library for running LLMs on local Intel CPUs or GPUs (including iGPU or discrete GPUs like Arc, Flex, and Max) with low latency. It supports Ollama on Linux and Windows systems.
  108. To deploy a local model, e.g., **Qwen2**, using IPEX-LLM-accelerated Ollama:
  109. ### 1. Check firewall settings
  110. Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
  111. ```bash
  112. sudo ufw allow 11434/tcp
  113. ```
  114. ### 2. Launch Ollama service using IPEX-LLM
  115. #### 2.1 Install IPEX-LLM for Ollama
  116. :::tip NOTE
  117. IPEX-LLM's supports Ollama on Linux and Windows systems.
  118. :::
  119. For detailed information about installing IPEX-LLM for Ollama, see [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md):
  120. - [Prerequisites](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#0-prerequisites)
  121. - [Install IPEX-LLM cpp with Ollama binaries](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp)
  122. *After the installation, you should have created a Conda environment, e.g., `llm-cpp`, for running Ollama commands with IPEX-LLM.*
  123. #### 2.2 Initialize Ollama
  124. 1. Activate the `llm-cpp` Conda environment and initialize Ollama:
  125. <Tabs
  126. defaultValue="linux"
  127. values={[
  128. {label: 'Linux', value: 'linux'},
  129. {label: 'Windows', value: 'windows'},
  130. ]}>
  131. <TabItem value="linux">
  132. ```bash
  133. conda activate llm-cpp
  134. init-ollama
  135. ```
  136. </TabItem>
  137. <TabItem value="windows">
  138. Run these commands with *administrator privileges in Miniforge Prompt*:
  139. ```cmd
  140. conda activate llm-cpp
  141. init-ollama.bat
  142. ```
  143. </TabItem>
  144. </Tabs>
  145. 2. If the installed `ipex-llm[cpp]` requires an upgrade to the Ollama binary files, remove the old binary files and reinitialize Ollama using `init-ollama` (Linux) or `init-ollama.bat` (Windows).
  146. *A symbolic link to Ollama appears in your current directory, and you can use this executable file following standard Ollama commands.*
  147. #### 2.3 Launch Ollama service
  148. 1. Set the environment variable `OLLAMA_NUM_GPU` to `999` to ensure that all layers of your model run on the Intel GPU; otherwise, some layers may default to CPU.
  149. 2. For optimal performance on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), set the following environment variable before launching the Ollama service:
  150. ```bash
  151. export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
  152. ```
  153. 3. Launch the Ollama service:
  154. <Tabs
  155. defaultValue="linux"
  156. values={[
  157. {label: 'Linux', value: 'linux'},
  158. {label: 'Windows', value: 'windows'},
  159. ]}>
  160. <TabItem value="linux">
  161. ```bash
  162. export OLLAMA_NUM_GPU=999
  163. export no_proxy=localhost,127.0.0.1
  164. export ZES_ENABLE_SYSMAN=1
  165. source /opt/intel/oneapi/setvars.sh
  166. export SYCL_CACHE_PERSISTENT=1
  167. ./ollama serve
  168. ```
  169. </TabItem>
  170. <TabItem value="windows">
  171. Run the following command *in Miniforge Prompt*:
  172. ```cmd
  173. set OLLAMA_NUM_GPU=999
  174. set no_proxy=localhost,127.0.0.1
  175. set ZES_ENABLE_SYSMAN=1
  176. set SYCL_CACHE_PERSISTENT=1
  177. ollama serve
  178. ```
  179. </TabItem>
  180. </Tabs>
  181. :::tip NOTE
  182. To enable the Ollama service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` rather than simply `./ollama serve`.
  183. :::
  184. *The console displays messages similar to the following:*
  185. ![](https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png)
  186. ### 3. Pull and Run Ollama model
  187. #### 3.1 Pull Ollama model
  188. With the Ollama service running, open a new terminal and run `./ollama pull <model_name>` (Linux) or `ollama.exe pull <model_name>` (Windows) to pull the desired model. e.g., `qwen2:latest`:
  189. ![](https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png)
  190. #### 3.2 Run Ollama model
  191. <Tabs
  192. defaultValue="linux"
  193. values={[
  194. {label: 'Linux', value: 'linux'},
  195. {label: 'Windows', value: 'windows'},
  196. ]}>
  197. <TabItem value="linux">
  198. ```bash
  199. ./ollama run qwen2:latest
  200. ```
  201. </TabItem>
  202. <TabItem value="windows">
  203. ```cmd
  204. ollama run qwen2:latest
  205. ```
  206. </TabItem>
  207. </Tabs>
  208. ### 4. Configure RAGflow
  209. To enable IPEX-LLM accelerated Ollama in RAGFlow, you must also complete the configurations in RAGFlow. The steps are identical to those outlined in the *Deploy a local model using Ollama* section:
  210. 1. [Add Ollama](#4-add-ollama)
  211. 2. [Complete basic Ollama settings](#5-complete-basic-ollama-settings)
  212. 3. [Update System Model Settings](#6-update-system-model-settings)
  213. 4. [Update Chat Configuration](#7-update-chat-configuration)
  214. ## Deploy a local model using jina
  215. To deploy a local model, e.g., **gpt2**, using jina:
  216. ### 1. Check firewall settings
  217. Ensure that your host machine's firewall allows inbound connections on port 12345.
  218. ```bash
  219. sudo ufw allow 12345/tcp
  220. ```
  221. ### 2. Install jina package
  222. ```bash
  223. pip install jina
  224. ```
  225. ### 3. Deploy a local model
  226. Step 1: Navigate to the **rag/svr** directory.
  227. ```bash
  228. cd rag/svr
  229. ```
  230. Step 2: Run **jina_server.py**, specifying either the model's name or its local directory:
  231. ```bash
  232. python jina_server.py --model_name gpt2
  233. ```
  234. > The script only supports models downloaded from Hugging Face.