您最多选择25个主题 主题必须以字母或数字开头,可以包含连字符 (-),并且长度不得超过35个字符

deploy_local_llm.mdx 13KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353
  1. ---
  2. sidebar_position: 2
  3. slug: /deploy_local_llm
  4. ---
  5. # Deploy local models
  6. import Tabs from '@theme/Tabs';
  7. import TabItem from '@theme/TabItem';
  8. Deploy and run local models using Ollama, Xinference, or other frameworks.
  9. ---
  10. RAGFlow supports deploying models locally using Ollama, Xinference, IPEX-LLM, or jina. If you have locally deployed models to leverage or wish to enable GPU or CUDA for inference acceleration, you can bind Ollama or Xinference into RAGFlow and use either of them as a local "server" for interacting with your local models.
  11. RAGFlow seamlessly integrates with Ollama and Xinference, without the need for further environment configurations. You can use them to deploy two types of local models in RAGFlow: chat models and embedding models.
  12. :::tip NOTE
  13. This user guide does not intend to cover much of the installation or configuration details of Ollama or Xinference; its focus is on configurations inside RAGFlow. For the most current information, you may need to check out the official site of Ollama or Xinference.
  14. :::
  15. ## Deploy local models using Ollama
  16. [Ollama](https://github.com/ollama/ollama) enables you to run open-source large language models that you deployed locally. It bundles model weights, configurations, and data into a single package, defined by a Modelfile, and optimizes setup and configurations, including GPU usage.
  17. :::note
  18. - For information about downloading Ollama, see [here](https://github.com/ollama/ollama?tab=readme-ov-file#ollama).
  19. - For a complete list of supported models and variants, see the [Ollama model library](https://ollama.com/library).
  20. :::
  21. ### 1. Deploy Ollama using Docker
  22. Ollama can be [installed from binaries](https://ollama.com/download) or [deployed with Docker](https://hub.docker.com/r/ollama/ollama). Here are the instructions to deploy with Docker:
  23. ```bash
  24. $ sudo docker run --name ollama -p 11434:11434 ollama/ollama
  25. > time=2024-12-02T02:20:21.360Z level=INFO source=routes.go:1248 msg="Listening on [::]:11434 (version 0.4.6)"
  26. > time=2024-12-02T02:20:21.360Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12]"
  27. ```
  28. Ensure Ollama is listening on all IP address:
  29. ```bash
  30. $ sudo ss -tunlp | grep 11434
  31. > tcp LISTEN 0 4096 0.0.0.0:11434 0.0.0.0:* users:(("docker-proxy",pid=794507,fd=4))
  32. > tcp LISTEN 0 4096 [::]:11434 [::]:* users:(("docker-proxy",pid=794513,fd=4))
  33. ```
  34. Pull models as you need. We recommend that you start with `llama3.2` (a 3B chat model) and `bge-m3` (a 567M embedding model):
  35. ```bash
  36. $ sudo docker exec ollama ollama pull llama3.2
  37. > pulling dde5aa3fc5ff... 100% ▕████████████████▏ 2.0 GB
  38. > success
  39. ```
  40. ```bash
  41. $ sudo docker exec ollama ollama pull bge-m3
  42. > pulling daec91ffb5dd... 100% ▕████████████████▏ 1.2 GB
  43. > success
  44. ```
  45. ### 2. Find Ollama URL and ensure it is accessible
  46. - If RAGFlow runs in Docker, the localhost is mapped within the RAGFlow Docker container as `host.docker.internal`. If Ollama runs on the same host machine, the right URL to use for Ollama would be `http://host.docker.internal:11434/' and you should check that Ollama is accessible from inside the RAGFlow container with:
  47. ```bash
  48. $ sudo docker exec -it ragflow-server bash
  49. $ curl http://host.docker.internal:11434/
  50. > Ollama is running
  51. ```
  52. - If RAGFlow is launched from source code and Ollama runs on the same host machine as RAGFlow, check if Ollama is accessible from RAGFlow's host machine:
  53. ```bash
  54. $ curl http://localhost:11434/
  55. > Ollama is running
  56. ```
  57. - If RAGFlow and Ollama run on different machines, check if Ollama is accessible from RAGFlow's host machine:
  58. ```bash
  59. $ curl http://${IP_OF_OLLAMA_MACHINE}:11434/
  60. > Ollama is running
  61. ```
  62. ### 3. Add Ollama
  63. In RAGFlow, click on your logo on the top right of the page **>** **Model providers** and add Ollama to RAGFlow:
  64. ![add ollama](https://github.com/infiniflow/ragflow/assets/93570324/10635088-028b-4b3d-add9-5c5a6e626814)
  65. ### 4. Complete basic Ollama settings
  66. In the popup window, complete basic settings for Ollama:
  67. 1. Ensure that your model name and type match those been pulled at step 1 (Deploy Ollama using Docker). For example, (`llama3.2` and `chat`) or (`bge-m3` and `embedding`).
  68. 2. In Ollama base URL, put the URL you found in step 2 followed by `/v1`, i.e. `http://host.docker.internal:11434/v1`, `http://localhost:11434/v1` or `http://${IP_OF_OLLAMA_MACHINE}:11434/v1`.
  69. 3. OPTIONAL: Switch on the toggle under **Does it support Vision?** if your model includes an image-to-text model.
  70. :::caution WARNING
  71. Improper base URL settings will trigger the following error:
  72. ```bash
  73. Max retries exceeded with url: /api/chat (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff98b81ff0>: Failed to establish a new connection: [Errno 111] Connection refused'))
  74. ```
  75. :::
  76. ### 5. Update System Model Settings
  77. Click on your logo **>** **Model providers** **>** **System Model Settings** to update your model:
  78. - *You should now be able to find **llama3.2** from the dropdown list under **Chat model**, and **bge-m3** from the dropdown list under **Embedding model**.*
  79. - _If your local model is an embedding model, you should find it under **Embedding model**._
  80. ### 6. Update Chat Configuration
  81. Update your model(s) accordingly in **Chat Configuration**.
  82. ## Deploy a local model using Xinference
  83. Xorbits Inference ([Xinference](https://github.com/xorbitsai/inference)) enables you to unleash the full potential of cutting-edge AI models.
  84. :::note
  85. - For information about installing Xinference Ollama, see [here](https://inference.readthedocs.io/en/latest/getting_started/).
  86. - For a complete list of supported models, see the [Builtin Models](https://inference.readthedocs.io/en/latest/models/builtin/).
  87. :::
  88. To deploy a local model, e.g., **Mistral**, using Xinference:
  89. ### 1. Check firewall settings
  90. Ensure that your host machine's firewall allows inbound connections on port 9997.
  91. ### 2. Start an Xinference instance
  92. ```bash
  93. $ xinference-local --host 0.0.0.0 --port 9997
  94. ```
  95. ### 3. Launch your local model
  96. Launch your local model (**Mistral**), ensuring that you replace `${quantization}` with your chosen quantization method:
  97. ```bash
  98. $ xinference launch -u mistral --model-name mistral-v0.1 --size-in-billions 7 --model-format pytorch --quantization ${quantization}
  99. ```
  100. ### 4. Add Xinference
  101. In RAGFlow, click on your logo on the top right of the page **>** **Model providers** and add Xinference to RAGFlow:
  102. ![add xinference](https://github.com/infiniflow/ragflow/assets/93570324/10635088-028b-4b3d-add9-5c5a6e626814)
  103. ### 5. Complete basic Xinference settings
  104. Enter an accessible base URL, such as `http://<your-xinference-endpoint-domain>:9997/v1`.
  105. > For rerank model, please use the `http://<your-xinference-endpoint-domain>:9997/v1/rerank` as the base URL.
  106. ### 6. Update System Model Settings
  107. Click on your logo **>** **Model providers** **>** **System Model Settings** to update your model.
  108. *You should now be able to find **mistral** from the dropdown list under **Chat model**.*
  109. > If your local model is an embedding model, you should find your local model under **Embedding model**.
  110. ### 7. Update Chat Configuration
  111. Update your chat model accordingly in **Chat Configuration**:
  112. > If your local model is an embedding model, update it on the configuration page of your knowledge base.
  113. ## Deploy a local model using IPEX-LLM
  114. [IPEX-LLM](https://github.com/intel-analytics/ipex-llm) is a PyTorch library for running LLMs on local Intel CPUs or GPUs (including iGPU or discrete GPUs like Arc, Flex, and Max) with low latency. It supports Ollama on Linux and Windows systems.
  115. To deploy a local model, e.g., **Qwen2**, using IPEX-LLM-accelerated Ollama:
  116. ### 1. Check firewall settings
  117. Ensure that your host machine's firewall allows inbound connections on port 11434. For example:
  118. ```bash
  119. sudo ufw allow 11434/tcp
  120. ```
  121. ### 2. Launch Ollama service using IPEX-LLM
  122. #### 2.1 Install IPEX-LLM for Ollama
  123. :::tip NOTE
  124. IPEX-LLM's supports Ollama on Linux and Windows systems.
  125. :::
  126. For detailed information about installing IPEX-LLM for Ollama, see [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md):
  127. - [Prerequisites](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#0-prerequisites)
  128. - [Install IPEX-LLM cpp with Ollama binaries](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp)
  129. *After the installation, you should have created a Conda environment, e.g., `llm-cpp`, for running Ollama commands with IPEX-LLM.*
  130. #### 2.2 Initialize Ollama
  131. 1. Activate the `llm-cpp` Conda environment and initialize Ollama:
  132. <Tabs
  133. defaultValue="linux"
  134. values={[
  135. {label: 'Linux', value: 'linux'},
  136. {label: 'Windows', value: 'windows'},
  137. ]}>
  138. <TabItem value="linux">
  139. ```bash
  140. conda activate llm-cpp
  141. init-ollama
  142. ```
  143. </TabItem>
  144. <TabItem value="windows">
  145. Run these commands with *administrator privileges in Miniforge Prompt*:
  146. ```cmd
  147. conda activate llm-cpp
  148. init-ollama.bat
  149. ```
  150. </TabItem>
  151. </Tabs>
  152. 2. If the installed `ipex-llm[cpp]` requires an upgrade to the Ollama binary files, remove the old binary files and reinitialize Ollama using `init-ollama` (Linux) or `init-ollama.bat` (Windows).
  153. *A symbolic link to Ollama appears in your current directory, and you can use this executable file following standard Ollama commands.*
  154. #### 2.3 Launch Ollama service
  155. 1. Set the environment variable `OLLAMA_NUM_GPU` to `999` to ensure that all layers of your model run on the Intel GPU; otherwise, some layers may default to CPU.
  156. 2. For optimal performance on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), set the following environment variable before launching the Ollama service:
  157. ```bash
  158. export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
  159. ```
  160. 3. Launch the Ollama service:
  161. <Tabs
  162. defaultValue="linux"
  163. values={[
  164. {label: 'Linux', value: 'linux'},
  165. {label: 'Windows', value: 'windows'},
  166. ]}>
  167. <TabItem value="linux">
  168. ```bash
  169. export OLLAMA_NUM_GPU=999
  170. export no_proxy=localhost,127.0.0.1
  171. export ZES_ENABLE_SYSMAN=1
  172. source /opt/intel/oneapi/setvars.sh
  173. export SYCL_CACHE_PERSISTENT=1
  174. ./ollama serve
  175. ```
  176. </TabItem>
  177. <TabItem value="windows">
  178. Run the following command *in Miniforge Prompt*:
  179. ```cmd
  180. set OLLAMA_NUM_GPU=999
  181. set no_proxy=localhost,127.0.0.1
  182. set ZES_ENABLE_SYSMAN=1
  183. set SYCL_CACHE_PERSISTENT=1
  184. ollama serve
  185. ```
  186. </TabItem>
  187. </Tabs>
  188. :::tip NOTE
  189. To enable the Ollama service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` rather than simply `./ollama serve`.
  190. :::
  191. *The console displays messages similar to the following:*
  192. ![](https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png)
  193. ### 3. Pull and Run Ollama model
  194. #### 3.1 Pull Ollama model
  195. With the Ollama service running, open a new terminal and run `./ollama pull <model_name>` (Linux) or `ollama.exe pull <model_name>` (Windows) to pull the desired model. e.g., `qwen2:latest`:
  196. ![](https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png)
  197. #### 3.2 Run Ollama model
  198. <Tabs
  199. defaultValue="linux"
  200. values={[
  201. {label: 'Linux', value: 'linux'},
  202. {label: 'Windows', value: 'windows'},
  203. ]}>
  204. <TabItem value="linux">
  205. ```bash
  206. ./ollama run qwen2:latest
  207. ```
  208. </TabItem>
  209. <TabItem value="windows">
  210. ```cmd
  211. ollama run qwen2:latest
  212. ```
  213. </TabItem>
  214. </Tabs>
  215. ### 4. Configure RAGflow
  216. To enable IPEX-LLM accelerated Ollama in RAGFlow, you must also complete the configurations in RAGFlow. The steps are identical to those outlined in the *Deploy a local model using Ollama* section:
  217. 1. [Add Ollama](#4-add-ollama)
  218. 2. [Complete basic Ollama settings](#5-complete-basic-ollama-settings)
  219. 3. [Update System Model Settings](#6-update-system-model-settings)
  220. 4. [Update Chat Configuration](#7-update-chat-configuration)
  221. ## Deploy a local model using jina
  222. To deploy a local model, e.g., **gpt2**, using jina:
  223. ### 1. Check firewall settings
  224. Ensure that your host machine's firewall allows inbound connections on port 12345.
  225. ```bash
  226. sudo ufw allow 12345/tcp
  227. ```
  228. ### 2. Install jina package
  229. ```bash
  230. pip install jina
  231. ```
  232. ### 3. Deploy a local model
  233. Step 1: Navigate to the **rag/svr** directory.
  234. ```bash
  235. cd rag/svr
  236. ```
  237. Step 2: Run **jina_server.py**, specifying either the model's name or its local directory:
  238. ```bash
  239. python jina_server.py --model_name gpt2
  240. ```
  241. > The script only supports models downloaded from Hugging Face.