Du kan inte välja fler än 25 ämnen Ämnen måste starta med en bokstav eller siffra, kan innehålla bindestreck ('-') och vara max 35 tecken långa.

indexing_runner.py 32KB

2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
3 månader sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
Introduce Plugins (#13836) Signed-off-by: yihong0618 <zouzou0208@gmail.com> Signed-off-by: -LAN- <laipz8200@outlook.com> Signed-off-by: xhe <xw897002528@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: takatost <takatost@gmail.com> Co-authored-by: kurokobo <kuro664@gmail.com> Co-authored-by: Novice Lee <novicelee@NoviPro.local> Co-authored-by: zxhlyh <jasonapring2015@outlook.com> Co-authored-by: AkaraChen <akarachen@outlook.com> Co-authored-by: Yi <yxiaoisme@gmail.com> Co-authored-by: Joel <iamjoel007@gmail.com> Co-authored-by: JzoNg <jzongcode@gmail.com> Co-authored-by: twwu <twwu@dify.ai> Co-authored-by: Hiroshi Fujita <fujita-h@users.noreply.github.com> Co-authored-by: AkaraChen <85140972+AkaraChen@users.noreply.github.com> Co-authored-by: NFish <douxc512@gmail.com> Co-authored-by: Wu Tianwei <30284043+WTW0313@users.noreply.github.com> Co-authored-by: 非法操作 <hjlarry@163.com> Co-authored-by: Novice <857526207@qq.com> Co-authored-by: Hiroki Nagai <82458324+nagaihiroki-git@users.noreply.github.com> Co-authored-by: Gen Sato <52241300+halogen22@users.noreply.github.com> Co-authored-by: eux <euxuuu@gmail.com> Co-authored-by: huangzhuo1949 <167434202+huangzhuo1949@users.noreply.github.com> Co-authored-by: huangzhuo <huangzhuo1@xiaomi.com> Co-authored-by: lotsik <lotsik@mail.ru> Co-authored-by: crazywoola <100913391+crazywoola@users.noreply.github.com> Co-authored-by: nite-knite <nkCoding@gmail.com> Co-authored-by: Jyong <76649700+JohnJyong@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: gakkiyomi <gakkiyomi@aliyun.com> Co-authored-by: CN-P5 <heibai2006@gmail.com> Co-authored-by: CN-P5 <heibai2006@qq.com> Co-authored-by: Chuehnone <1897025+chuehnone@users.noreply.github.com> Co-authored-by: yihong <zouzou0208@gmail.com> Co-authored-by: Kevin9703 <51311316+Kevin9703@users.noreply.github.com> Co-authored-by: -LAN- <laipz8200@outlook.com> Co-authored-by: Boris Feld <lothiraldan@gmail.com> Co-authored-by: mbo <himabo@gmail.com> Co-authored-by: mabo <mabo@aeyes.ai> Co-authored-by: Warren Chen <warren.chen830@gmail.com> Co-authored-by: JzoNgKVO <27049666+JzoNgKVO@users.noreply.github.com> Co-authored-by: jiandanfeng <chenjh3@wangsu.com> Co-authored-by: zhu-an <70234959+xhdd123321@users.noreply.github.com> Co-authored-by: zhaoqingyu.1075 <zhaoqingyu.1075@bytedance.com> Co-authored-by: 海狸大師 <86974027+yenslife@users.noreply.github.com> Co-authored-by: Xu Song <xusong.vip@gmail.com> Co-authored-by: rayshaw001 <396301947@163.com> Co-authored-by: Ding Jiatong <dingjiatong@gmail.com> Co-authored-by: Bowen Liang <liangbowen@gf.com.cn> Co-authored-by: JasonVV <jasonwangiii@outlook.com> Co-authored-by: le0zh <newlight@qq.com> Co-authored-by: zhuxinliang <zhuxinliang@didiglobal.com> Co-authored-by: k-zaku <zaku99@outlook.jp> Co-authored-by: luckylhb90 <luckylhb90@gmail.com> Co-authored-by: hobo.l <hobo.l@binance.com> Co-authored-by: jiangbo721 <365065261@qq.com> Co-authored-by: 刘江波 <jiangbo721@163.com> Co-authored-by: Shun Miyazawa <34241526+miya@users.noreply.github.com> Co-authored-by: EricPan <30651140+Egfly@users.noreply.github.com> Co-authored-by: crazywoola <427733928@qq.com> Co-authored-by: sino <sino2322@gmail.com> Co-authored-by: Jhvcc <37662342+Jhvcc@users.noreply.github.com> Co-authored-by: lowell <lowell.hu@zkteco.in> Co-authored-by: Boris Polonsky <BorisPolonsky@users.noreply.github.com> Co-authored-by: Ademílson Tonato <ademilsonft@outlook.com> Co-authored-by: Ademílson Tonato <ademilson.tonato@refurbed.com> Co-authored-by: IWAI, Masaharu <iwaim.sub@gmail.com> Co-authored-by: Yueh-Po Peng (Yabi) <94939112+y10ab1@users.noreply.github.com> Co-authored-by: Jason <ggbbddjm@gmail.com> Co-authored-by: Xin Zhang <sjhpzx@gmail.com> Co-authored-by: yjc980121 <3898524+yjc980121@users.noreply.github.com> Co-authored-by: heyszt <36215648+hieheihei@users.noreply.github.com> Co-authored-by: Abdullah AlOsaimi <osaimiacc@gmail.com> Co-authored-by: Abdullah AlOsaimi <189027247+osaimi@users.noreply.github.com> Co-authored-by: Yingchun Lai <laiyingchun@apache.org> Co-authored-by: Hash Brown <hi@xzd.me> Co-authored-by: zuodongxu <192560071+zuodongxu@users.noreply.github.com> Co-authored-by: Masashi Tomooka <tmokmss@users.noreply.github.com> Co-authored-by: aplio <ryo.091219@gmail.com> Co-authored-by: Obada Khalili <54270856+obadakhalili@users.noreply.github.com> Co-authored-by: Nam Vu <zuzoovn@gmail.com> Co-authored-by: Kei YAMAZAKI <1715090+kei-yamazaki@users.noreply.github.com> Co-authored-by: TechnoHouse <13776377+deephbz@users.noreply.github.com> Co-authored-by: Riddhimaan-Senapati <114703025+Riddhimaan-Senapati@users.noreply.github.com> Co-authored-by: MaFee921 <31881301+2284730142@users.noreply.github.com> Co-authored-by: te-chan <t-nakanome@sakura-is.co.jp> Co-authored-by: HQidea <HQidea@users.noreply.github.com> Co-authored-by: Joshbly <36315710+Joshbly@users.noreply.github.com> Co-authored-by: xhe <xw897002528@gmail.com> Co-authored-by: weiwenyan-dev <154779315+weiwenyan-dev@users.noreply.github.com> Co-authored-by: ex_wenyan.wei <ex_wenyan.wei@tcl.com> Co-authored-by: engchina <12236799+engchina@users.noreply.github.com> Co-authored-by: engchina <atjapan2015@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: 呆萌闷油瓶 <253605712@qq.com> Co-authored-by: Kemal <kemalmeler@outlook.com> Co-authored-by: Lazy_Frog <4590648+lazyFrogLOL@users.noreply.github.com> Co-authored-by: Yi Xiao <54782454+YIXIAO0@users.noreply.github.com> Co-authored-by: Steven sun <98230804+Tuyohai@users.noreply.github.com> Co-authored-by: steven <sunzwj@digitalchina.com> Co-authored-by: Kalo Chin <91766386+fdb02983rhy@users.noreply.github.com> Co-authored-by: Katy Tao <34019945+KatyTao@users.noreply.github.com> Co-authored-by: depy <42985524+h4ckdepy@users.noreply.github.com> Co-authored-by: 胡春东 <gycm520@gmail.com> Co-authored-by: Junjie.M <118170653@qq.com> Co-authored-by: MuYu <mr.muzea@gmail.com> Co-authored-by: Naoki Takashima <39912547+takatea@users.noreply.github.com> Co-authored-by: Summer-Gu <37869445+gubinjie@users.noreply.github.com> Co-authored-by: Fei He <droxer.he@gmail.com> Co-authored-by: ybalbert001 <120714773+ybalbert001@users.noreply.github.com> Co-authored-by: Yuanbo Li <ybalbert@amazon.com> Co-authored-by: douxc <7553076+douxc@users.noreply.github.com> Co-authored-by: liuzhenghua <1090179900@qq.com> Co-authored-by: Wu Jiayang <62842862+Wu-Jiayang@users.noreply.github.com> Co-authored-by: Your Name <you@example.com> Co-authored-by: kimjion <45935338+kimjion@users.noreply.github.com> Co-authored-by: AugNSo <song.tiankai@icloud.com> Co-authored-by: llinvokerl <38915183+llinvokerl@users.noreply.github.com> Co-authored-by: liusurong.lsr <liusurong.lsr@alibaba-inc.com> Co-authored-by: Vasu Negi <vasu-negi@users.noreply.github.com> Co-authored-by: Hundredwz <1808096180@qq.com> Co-authored-by: Xiyuan Chen <52963600+GareArc@users.noreply.github.com>
8 månader sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
2 år sedan
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752
  1. import concurrent.futures
  2. import json
  3. import logging
  4. import re
  5. import threading
  6. import time
  7. import uuid
  8. from typing import Any, Optional, cast
  9. from flask import current_app
  10. from sqlalchemy.orm.exc import ObjectDeletedError
  11. from configs import dify_config
  12. from core.entities.knowledge_entities import IndexingEstimate, PreviewDetail, QAPreviewDetail
  13. from core.errors.error import ProviderTokenNotInitError
  14. from core.model_manager import ModelInstance, ModelManager
  15. from core.model_runtime.entities.model_entities import ModelType
  16. from core.rag.cleaner.clean_processor import CleanProcessor
  17. from core.rag.datasource.keyword.keyword_factory import Keyword
  18. from core.rag.docstore.dataset_docstore import DatasetDocumentStore
  19. from core.rag.extractor.entity.extract_setting import ExtractSetting
  20. from core.rag.index_processor.constant.index_type import IndexType
  21. from core.rag.index_processor.index_processor_base import BaseIndexProcessor
  22. from core.rag.index_processor.index_processor_factory import IndexProcessorFactory
  23. from core.rag.models.document import ChildDocument, Document
  24. from core.rag.splitter.fixed_text_splitter import (
  25. EnhanceRecursiveCharacterTextSplitter,
  26. FixedRecursiveCharacterTextSplitter,
  27. )
  28. from core.rag.splitter.text_splitter import TextSplitter
  29. from core.tools.utils.web_reader_tool import get_image_upload_file_ids
  30. from extensions.ext_database import db
  31. from extensions.ext_redis import redis_client
  32. from extensions.ext_storage import storage
  33. from libs import helper
  34. from libs.datetime_utils import naive_utc_now
  35. from models.dataset import ChildChunk, Dataset, DatasetProcessRule, DocumentSegment
  36. from models.dataset import Document as DatasetDocument
  37. from models.model import UploadFile
  38. from services.feature_service import FeatureService
  39. logger = logging.getLogger(__name__)
  40. class IndexingRunner:
  41. def __init__(self):
  42. self.storage = storage
  43. self.model_manager = ModelManager()
  44. def run(self, dataset_documents: list[DatasetDocument]):
  45. """Run the indexing process."""
  46. for dataset_document in dataset_documents:
  47. try:
  48. # get dataset
  49. dataset = db.session.query(Dataset).filter_by(id=dataset_document.dataset_id).first()
  50. if not dataset:
  51. raise ValueError("no dataset found")
  52. # get the process rule
  53. processing_rule = (
  54. db.session.query(DatasetProcessRule)
  55. .where(DatasetProcessRule.id == dataset_document.dataset_process_rule_id)
  56. .first()
  57. )
  58. if not processing_rule:
  59. raise ValueError("no process rule found")
  60. index_type = dataset_document.doc_form
  61. index_processor = IndexProcessorFactory(index_type).init_index_processor()
  62. # extract
  63. text_docs = self._extract(index_processor, dataset_document, processing_rule.to_dict())
  64. # transform
  65. documents = self._transform(
  66. index_processor, dataset, text_docs, dataset_document.doc_language, processing_rule.to_dict()
  67. )
  68. # save segment
  69. self._load_segments(dataset, dataset_document, documents)
  70. # load
  71. self._load(
  72. index_processor=index_processor,
  73. dataset=dataset,
  74. dataset_document=dataset_document,
  75. documents=documents,
  76. )
  77. except DocumentIsPausedError:
  78. raise DocumentIsPausedError(f"Document paused, document id: {dataset_document.id}")
  79. except ProviderTokenNotInitError as e:
  80. dataset_document.indexing_status = "error"
  81. dataset_document.error = str(e.description)
  82. dataset_document.stopped_at = naive_utc_now()
  83. db.session.commit()
  84. except ObjectDeletedError:
  85. logger.warning("Document deleted, document id: %s", dataset_document.id)
  86. except Exception as e:
  87. logger.exception("consume document failed")
  88. dataset_document.indexing_status = "error"
  89. dataset_document.error = str(e)
  90. dataset_document.stopped_at = naive_utc_now()
  91. db.session.commit()
  92. def run_in_splitting_status(self, dataset_document: DatasetDocument):
  93. """Run the indexing process when the index_status is splitting."""
  94. try:
  95. # get dataset
  96. dataset = db.session.query(Dataset).filter_by(id=dataset_document.dataset_id).first()
  97. if not dataset:
  98. raise ValueError("no dataset found")
  99. # get exist document_segment list and delete
  100. document_segments = (
  101. db.session.query(DocumentSegment)
  102. .filter_by(dataset_id=dataset.id, document_id=dataset_document.id)
  103. .all()
  104. )
  105. for document_segment in document_segments:
  106. db.session.delete(document_segment)
  107. if dataset_document.doc_form == IndexType.PARENT_CHILD_INDEX:
  108. # delete child chunks
  109. db.session.query(ChildChunk).where(ChildChunk.segment_id == document_segment.id).delete()
  110. db.session.commit()
  111. # get the process rule
  112. processing_rule = (
  113. db.session.query(DatasetProcessRule)
  114. .where(DatasetProcessRule.id == dataset_document.dataset_process_rule_id)
  115. .first()
  116. )
  117. if not processing_rule:
  118. raise ValueError("no process rule found")
  119. index_type = dataset_document.doc_form
  120. index_processor = IndexProcessorFactory(index_type).init_index_processor()
  121. # extract
  122. text_docs = self._extract(index_processor, dataset_document, processing_rule.to_dict())
  123. # transform
  124. documents = self._transform(
  125. index_processor, dataset, text_docs, dataset_document.doc_language, processing_rule.to_dict()
  126. )
  127. # save segment
  128. self._load_segments(dataset, dataset_document, documents)
  129. # load
  130. self._load(
  131. index_processor=index_processor, dataset=dataset, dataset_document=dataset_document, documents=documents
  132. )
  133. except DocumentIsPausedError:
  134. raise DocumentIsPausedError(f"Document paused, document id: {dataset_document.id}")
  135. except ProviderTokenNotInitError as e:
  136. dataset_document.indexing_status = "error"
  137. dataset_document.error = str(e.description)
  138. dataset_document.stopped_at = naive_utc_now()
  139. db.session.commit()
  140. except Exception as e:
  141. logger.exception("consume document failed")
  142. dataset_document.indexing_status = "error"
  143. dataset_document.error = str(e)
  144. dataset_document.stopped_at = naive_utc_now()
  145. db.session.commit()
  146. def run_in_indexing_status(self, dataset_document: DatasetDocument):
  147. """Run the indexing process when the index_status is indexing."""
  148. try:
  149. # get dataset
  150. dataset = db.session.query(Dataset).filter_by(id=dataset_document.dataset_id).first()
  151. if not dataset:
  152. raise ValueError("no dataset found")
  153. # get exist document_segment list and delete
  154. document_segments = (
  155. db.session.query(DocumentSegment)
  156. .filter_by(dataset_id=dataset.id, document_id=dataset_document.id)
  157. .all()
  158. )
  159. documents = []
  160. if document_segments:
  161. for document_segment in document_segments:
  162. # transform segment to node
  163. if document_segment.status != "completed":
  164. document = Document(
  165. page_content=document_segment.content,
  166. metadata={
  167. "doc_id": document_segment.index_node_id,
  168. "doc_hash": document_segment.index_node_hash,
  169. "document_id": document_segment.document_id,
  170. "dataset_id": document_segment.dataset_id,
  171. },
  172. )
  173. if dataset_document.doc_form == IndexType.PARENT_CHILD_INDEX:
  174. child_chunks = document_segment.get_child_chunks()
  175. if child_chunks:
  176. child_documents = []
  177. for child_chunk in child_chunks:
  178. child_document = ChildDocument(
  179. page_content=child_chunk.content,
  180. metadata={
  181. "doc_id": child_chunk.index_node_id,
  182. "doc_hash": child_chunk.index_node_hash,
  183. "document_id": document_segment.document_id,
  184. "dataset_id": document_segment.dataset_id,
  185. },
  186. )
  187. child_documents.append(child_document)
  188. document.children = child_documents
  189. documents.append(document)
  190. # build index
  191. index_type = dataset_document.doc_form
  192. index_processor = IndexProcessorFactory(index_type).init_index_processor()
  193. self._load(
  194. index_processor=index_processor, dataset=dataset, dataset_document=dataset_document, documents=documents
  195. )
  196. except DocumentIsPausedError:
  197. raise DocumentIsPausedError(f"Document paused, document id: {dataset_document.id}")
  198. except ProviderTokenNotInitError as e:
  199. dataset_document.indexing_status = "error"
  200. dataset_document.error = str(e.description)
  201. dataset_document.stopped_at = naive_utc_now()
  202. db.session.commit()
  203. except Exception as e:
  204. logger.exception("consume document failed")
  205. dataset_document.indexing_status = "error"
  206. dataset_document.error = str(e)
  207. dataset_document.stopped_at = naive_utc_now()
  208. db.session.commit()
  209. def indexing_estimate(
  210. self,
  211. tenant_id: str,
  212. extract_settings: list[ExtractSetting],
  213. tmp_processing_rule: dict,
  214. doc_form: Optional[str] = None,
  215. doc_language: str = "English",
  216. dataset_id: Optional[str] = None,
  217. indexing_technique: str = "economy",
  218. ) -> IndexingEstimate:
  219. """
  220. Estimate the indexing for the document.
  221. """
  222. # check document limit
  223. features = FeatureService.get_features(tenant_id)
  224. if features.billing.enabled:
  225. count = len(extract_settings)
  226. batch_upload_limit = dify_config.BATCH_UPLOAD_LIMIT
  227. if count > batch_upload_limit:
  228. raise ValueError(f"You have reached the batch upload limit of {batch_upload_limit}.")
  229. embedding_model_instance = None
  230. if dataset_id:
  231. dataset = db.session.query(Dataset).filter_by(id=dataset_id).first()
  232. if not dataset:
  233. raise ValueError("Dataset not found.")
  234. if dataset.indexing_technique == "high_quality" or indexing_technique == "high_quality":
  235. if dataset.embedding_model_provider:
  236. embedding_model_instance = self.model_manager.get_model_instance(
  237. tenant_id=tenant_id,
  238. provider=dataset.embedding_model_provider,
  239. model_type=ModelType.TEXT_EMBEDDING,
  240. model=dataset.embedding_model,
  241. )
  242. else:
  243. embedding_model_instance = self.model_manager.get_default_model_instance(
  244. tenant_id=tenant_id,
  245. model_type=ModelType.TEXT_EMBEDDING,
  246. )
  247. else:
  248. if indexing_technique == "high_quality":
  249. embedding_model_instance = self.model_manager.get_default_model_instance(
  250. tenant_id=tenant_id,
  251. model_type=ModelType.TEXT_EMBEDDING,
  252. )
  253. preview_texts = [] # type: ignore
  254. total_segments = 0
  255. index_type = doc_form
  256. index_processor = IndexProcessorFactory(index_type).init_index_processor()
  257. for extract_setting in extract_settings:
  258. # extract
  259. processing_rule = DatasetProcessRule(
  260. mode=tmp_processing_rule["mode"], rules=json.dumps(tmp_processing_rule["rules"])
  261. )
  262. text_docs = index_processor.extract(extract_setting, process_rule_mode=tmp_processing_rule["mode"])
  263. documents = index_processor.transform(
  264. text_docs,
  265. embedding_model_instance=embedding_model_instance,
  266. process_rule=processing_rule.to_dict(),
  267. tenant_id=tenant_id,
  268. doc_language=doc_language,
  269. preview=True,
  270. )
  271. total_segments += len(documents)
  272. for document in documents:
  273. if len(preview_texts) < 10:
  274. if doc_form and doc_form == "qa_model":
  275. preview_detail = QAPreviewDetail(
  276. question=document.page_content, answer=document.metadata.get("answer") or ""
  277. )
  278. preview_texts.append(preview_detail)
  279. else:
  280. preview_detail = PreviewDetail(content=document.page_content) # type: ignore
  281. if document.children:
  282. preview_detail.child_chunks = [child.page_content for child in document.children] # type: ignore
  283. preview_texts.append(preview_detail)
  284. # delete image files and related db records
  285. image_upload_file_ids = get_image_upload_file_ids(document.page_content)
  286. for upload_file_id in image_upload_file_ids:
  287. image_file = db.session.query(UploadFile).where(UploadFile.id == upload_file_id).first()
  288. if image_file is None:
  289. continue
  290. try:
  291. storage.delete(image_file.key)
  292. except Exception:
  293. logger.exception(
  294. "Delete image_files failed while indexing_estimate, \
  295. image_upload_file_is: %s",
  296. upload_file_id,
  297. )
  298. db.session.delete(image_file)
  299. if doc_form and doc_form == "qa_model":
  300. return IndexingEstimate(total_segments=total_segments * 20, qa_preview=preview_texts, preview=[])
  301. return IndexingEstimate(total_segments=total_segments, preview=preview_texts) # type: ignore
  302. def _extract(
  303. self, index_processor: BaseIndexProcessor, dataset_document: DatasetDocument, process_rule: dict
  304. ) -> list[Document]:
  305. # load file
  306. if dataset_document.data_source_type not in {"upload_file", "notion_import", "website_crawl"}:
  307. return []
  308. data_source_info = dataset_document.data_source_info_dict
  309. text_docs = []
  310. if dataset_document.data_source_type == "upload_file":
  311. if not data_source_info or "upload_file_id" not in data_source_info:
  312. raise ValueError("no upload file found")
  313. file_detail = (
  314. db.session.query(UploadFile).where(UploadFile.id == data_source_info["upload_file_id"]).one_or_none()
  315. )
  316. if file_detail:
  317. extract_setting = ExtractSetting(
  318. datasource_type="upload_file", upload_file=file_detail, document_model=dataset_document.doc_form
  319. )
  320. text_docs = index_processor.extract(extract_setting, process_rule_mode=process_rule["mode"])
  321. elif dataset_document.data_source_type == "notion_import":
  322. if (
  323. not data_source_info
  324. or "notion_workspace_id" not in data_source_info
  325. or "notion_page_id" not in data_source_info
  326. ):
  327. raise ValueError("no notion import info found")
  328. extract_setting = ExtractSetting(
  329. datasource_type="notion_import",
  330. notion_info={
  331. "credential_id": data_source_info["credential_id"],
  332. "notion_workspace_id": data_source_info["notion_workspace_id"],
  333. "notion_obj_id": data_source_info["notion_page_id"],
  334. "notion_page_type": data_source_info["type"],
  335. "document": dataset_document,
  336. "tenant_id": dataset_document.tenant_id,
  337. },
  338. document_model=dataset_document.doc_form,
  339. )
  340. text_docs = index_processor.extract(extract_setting, process_rule_mode=process_rule["mode"])
  341. elif dataset_document.data_source_type == "website_crawl":
  342. if (
  343. not data_source_info
  344. or "provider" not in data_source_info
  345. or "url" not in data_source_info
  346. or "job_id" not in data_source_info
  347. ):
  348. raise ValueError("no website import info found")
  349. extract_setting = ExtractSetting(
  350. datasource_type="website_crawl",
  351. website_info={
  352. "provider": data_source_info["provider"],
  353. "job_id": data_source_info["job_id"],
  354. "tenant_id": dataset_document.tenant_id,
  355. "url": data_source_info["url"],
  356. "mode": data_source_info["mode"],
  357. "only_main_content": data_source_info["only_main_content"],
  358. },
  359. document_model=dataset_document.doc_form,
  360. )
  361. text_docs = index_processor.extract(extract_setting, process_rule_mode=process_rule["mode"])
  362. # update document status to splitting
  363. self._update_document_index_status(
  364. document_id=dataset_document.id,
  365. after_indexing_status="splitting",
  366. extra_update_params={
  367. DatasetDocument.word_count: sum(len(text_doc.page_content) for text_doc in text_docs),
  368. DatasetDocument.parsing_completed_at: naive_utc_now(),
  369. },
  370. )
  371. # replace doc id to document model id
  372. text_docs = cast(list[Document], text_docs)
  373. for text_doc in text_docs:
  374. if text_doc.metadata is not None:
  375. text_doc.metadata["document_id"] = dataset_document.id
  376. text_doc.metadata["dataset_id"] = dataset_document.dataset_id
  377. return text_docs
  378. @staticmethod
  379. def filter_string(text):
  380. text = re.sub(r"<\|", "<", text)
  381. text = re.sub(r"\|>", ">", text)
  382. text = re.sub(r"[\x00-\x08\x0B\x0C\x0E-\x1F\x7F\xEF\xBF\xBE]", "", text)
  383. # Unicode U+FFFE
  384. text = re.sub("\ufffe", "", text)
  385. return text
  386. @staticmethod
  387. def _get_splitter(
  388. processing_rule_mode: str,
  389. max_tokens: int,
  390. chunk_overlap: int,
  391. separator: str,
  392. embedding_model_instance: Optional[ModelInstance],
  393. ) -> TextSplitter:
  394. """
  395. Get the NodeParser object according to the processing rule.
  396. """
  397. if processing_rule_mode in ["custom", "hierarchical"]:
  398. # The user-defined segmentation rule
  399. max_segmentation_tokens_length = dify_config.INDEXING_MAX_SEGMENTATION_TOKENS_LENGTH
  400. if max_tokens < 50 or max_tokens > max_segmentation_tokens_length:
  401. raise ValueError(f"Custom segment length should be between 50 and {max_segmentation_tokens_length}.")
  402. if separator:
  403. separator = separator.replace("\\n", "\n")
  404. character_splitter = FixedRecursiveCharacterTextSplitter.from_encoder(
  405. chunk_size=max_tokens,
  406. chunk_overlap=chunk_overlap,
  407. fixed_separator=separator,
  408. separators=["\n\n", "。", ". ", " ", ""],
  409. embedding_model_instance=embedding_model_instance,
  410. )
  411. else:
  412. # Automatic segmentation
  413. automatic_rules: dict[str, Any] = dict(DatasetProcessRule.AUTOMATIC_RULES["segmentation"])
  414. character_splitter = EnhanceRecursiveCharacterTextSplitter.from_encoder(
  415. chunk_size=automatic_rules["max_tokens"],
  416. chunk_overlap=automatic_rules["chunk_overlap"],
  417. separators=["\n\n", "。", ". ", " ", ""],
  418. embedding_model_instance=embedding_model_instance,
  419. )
  420. return character_splitter # type: ignore
  421. def _split_to_documents_for_estimate(
  422. self, text_docs: list[Document], splitter: TextSplitter, processing_rule: DatasetProcessRule
  423. ) -> list[Document]:
  424. """
  425. Split the text documents into nodes.
  426. """
  427. all_documents: list[Document] = []
  428. for text_doc in text_docs:
  429. # document clean
  430. document_text = self._document_clean(text_doc.page_content, processing_rule)
  431. text_doc.page_content = document_text
  432. # parse document to nodes
  433. documents = splitter.split_documents([text_doc])
  434. split_documents = []
  435. for document in documents:
  436. if document.page_content is None or not document.page_content.strip():
  437. continue
  438. if document.metadata is not None:
  439. doc_id = str(uuid.uuid4())
  440. hash = helper.generate_text_hash(document.page_content)
  441. document.metadata["doc_id"] = doc_id
  442. document.metadata["doc_hash"] = hash
  443. split_documents.append(document)
  444. all_documents.extend(split_documents)
  445. return all_documents
  446. @staticmethod
  447. def _document_clean(text: str, processing_rule: DatasetProcessRule) -> str:
  448. """
  449. Clean the document text according to the processing rules.
  450. """
  451. if processing_rule.mode == "automatic":
  452. rules = DatasetProcessRule.AUTOMATIC_RULES
  453. else:
  454. rules = json.loads(processing_rule.rules) if processing_rule.rules else {}
  455. document_text = CleanProcessor.clean(text, {"rules": rules})
  456. return document_text
  457. @staticmethod
  458. def format_split_text(text: str) -> list[QAPreviewDetail]:
  459. regex = r"Q\d+:\s*(.*?)\s*A\d+:\s*([\s\S]*?)(?=Q\d+:|$)"
  460. matches = re.findall(regex, text, re.UNICODE)
  461. return [QAPreviewDetail(question=q, answer=re.sub(r"\n\s*", "\n", a.strip())) for q, a in matches if q and a]
  462. def _load(
  463. self,
  464. index_processor: BaseIndexProcessor,
  465. dataset: Dataset,
  466. dataset_document: DatasetDocument,
  467. documents: list[Document],
  468. ) -> None:
  469. """
  470. insert index and update document/segment status to completed
  471. """
  472. embedding_model_instance = None
  473. if dataset.indexing_technique == "high_quality":
  474. embedding_model_instance = self.model_manager.get_model_instance(
  475. tenant_id=dataset.tenant_id,
  476. provider=dataset.embedding_model_provider,
  477. model_type=ModelType.TEXT_EMBEDDING,
  478. model=dataset.embedding_model,
  479. )
  480. # chunk nodes by chunk size
  481. indexing_start_at = time.perf_counter()
  482. tokens = 0
  483. if dataset_document.doc_form != IndexType.PARENT_CHILD_INDEX and dataset.indexing_technique == "economy":
  484. # create keyword index
  485. create_keyword_thread = threading.Thread(
  486. target=self._process_keyword_index,
  487. args=(current_app._get_current_object(), dataset.id, dataset_document.id, documents), # type: ignore
  488. )
  489. create_keyword_thread.start()
  490. max_workers = 10
  491. if dataset.indexing_technique == "high_quality":
  492. with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
  493. futures = []
  494. # Distribute documents into multiple groups based on the hash values of page_content
  495. # This is done to prevent multiple threads from processing the same document,
  496. # Thereby avoiding potential database insertion deadlocks
  497. document_groups: list[list[Document]] = [[] for _ in range(max_workers)]
  498. for document in documents:
  499. hash = helper.generate_text_hash(document.page_content)
  500. group_index = int(hash, 16) % max_workers
  501. document_groups[group_index].append(document)
  502. for chunk_documents in document_groups:
  503. if len(chunk_documents) == 0:
  504. continue
  505. futures.append(
  506. executor.submit(
  507. self._process_chunk,
  508. current_app._get_current_object(), # type: ignore
  509. index_processor,
  510. chunk_documents,
  511. dataset,
  512. dataset_document,
  513. embedding_model_instance,
  514. )
  515. )
  516. for future in futures:
  517. tokens += future.result()
  518. if dataset_document.doc_form != IndexType.PARENT_CHILD_INDEX and dataset.indexing_technique == "economy":
  519. create_keyword_thread.join()
  520. indexing_end_at = time.perf_counter()
  521. # update document status to completed
  522. self._update_document_index_status(
  523. document_id=dataset_document.id,
  524. after_indexing_status="completed",
  525. extra_update_params={
  526. DatasetDocument.tokens: tokens,
  527. DatasetDocument.completed_at: naive_utc_now(),
  528. DatasetDocument.indexing_latency: indexing_end_at - indexing_start_at,
  529. DatasetDocument.error: None,
  530. },
  531. )
  532. @staticmethod
  533. def _process_keyword_index(flask_app, dataset_id, document_id, documents):
  534. with flask_app.app_context():
  535. dataset = db.session.query(Dataset).filter_by(id=dataset_id).first()
  536. if not dataset:
  537. raise ValueError("no dataset found")
  538. keyword = Keyword(dataset)
  539. keyword.create(documents)
  540. if dataset.indexing_technique != "high_quality":
  541. document_ids = [document.metadata["doc_id"] for document in documents]
  542. db.session.query(DocumentSegment).where(
  543. DocumentSegment.document_id == document_id,
  544. DocumentSegment.dataset_id == dataset_id,
  545. DocumentSegment.index_node_id.in_(document_ids),
  546. DocumentSegment.status == "indexing",
  547. ).update(
  548. {
  549. DocumentSegment.status: "completed",
  550. DocumentSegment.enabled: True,
  551. DocumentSegment.completed_at: naive_utc_now(),
  552. }
  553. )
  554. db.session.commit()
  555. def _process_chunk(
  556. self, flask_app, index_processor, chunk_documents, dataset, dataset_document, embedding_model_instance
  557. ):
  558. with flask_app.app_context():
  559. # check document is paused
  560. self._check_document_paused_status(dataset_document.id)
  561. tokens = 0
  562. if embedding_model_instance:
  563. page_content_list = [document.page_content for document in chunk_documents]
  564. tokens += sum(embedding_model_instance.get_text_embedding_num_tokens(page_content_list))
  565. # load index
  566. index_processor.load(dataset, chunk_documents, with_keywords=False)
  567. document_ids = [document.metadata["doc_id"] for document in chunk_documents]
  568. db.session.query(DocumentSegment).where(
  569. DocumentSegment.document_id == dataset_document.id,
  570. DocumentSegment.dataset_id == dataset.id,
  571. DocumentSegment.index_node_id.in_(document_ids),
  572. DocumentSegment.status == "indexing",
  573. ).update(
  574. {
  575. DocumentSegment.status: "completed",
  576. DocumentSegment.enabled: True,
  577. DocumentSegment.completed_at: naive_utc_now(),
  578. }
  579. )
  580. db.session.commit()
  581. return tokens
  582. @staticmethod
  583. def _check_document_paused_status(document_id: str):
  584. indexing_cache_key = f"document_{document_id}_is_paused"
  585. result = redis_client.get(indexing_cache_key)
  586. if result:
  587. raise DocumentIsPausedError()
  588. @staticmethod
  589. def _update_document_index_status(
  590. document_id: str, after_indexing_status: str, extra_update_params: Optional[dict] = None
  591. ) -> None:
  592. """
  593. Update the document indexing status.
  594. """
  595. count = db.session.query(DatasetDocument).filter_by(id=document_id, is_paused=True).count()
  596. if count > 0:
  597. raise DocumentIsPausedError()
  598. document = db.session.query(DatasetDocument).filter_by(id=document_id).first()
  599. if not document:
  600. raise DocumentIsDeletedPausedError()
  601. update_params = {DatasetDocument.indexing_status: after_indexing_status}
  602. if extra_update_params:
  603. update_params.update(extra_update_params)
  604. db.session.query(DatasetDocument).filter_by(id=document_id).update(update_params) # type: ignore
  605. db.session.commit()
  606. @staticmethod
  607. def _update_segments_by_document(dataset_document_id: str, update_params: dict) -> None:
  608. """
  609. Update the document segment by document id.
  610. """
  611. db.session.query(DocumentSegment).filter_by(document_id=dataset_document_id).update(update_params)
  612. db.session.commit()
  613. def _transform(
  614. self,
  615. index_processor: BaseIndexProcessor,
  616. dataset: Dataset,
  617. text_docs: list[Document],
  618. doc_language: str,
  619. process_rule: dict,
  620. ) -> list[Document]:
  621. # get embedding model instance
  622. embedding_model_instance = None
  623. if dataset.indexing_technique == "high_quality":
  624. if dataset.embedding_model_provider:
  625. embedding_model_instance = self.model_manager.get_model_instance(
  626. tenant_id=dataset.tenant_id,
  627. provider=dataset.embedding_model_provider,
  628. model_type=ModelType.TEXT_EMBEDDING,
  629. model=dataset.embedding_model,
  630. )
  631. else:
  632. embedding_model_instance = self.model_manager.get_default_model_instance(
  633. tenant_id=dataset.tenant_id,
  634. model_type=ModelType.TEXT_EMBEDDING,
  635. )
  636. documents = index_processor.transform(
  637. text_docs,
  638. embedding_model_instance=embedding_model_instance,
  639. process_rule=process_rule,
  640. tenant_id=dataset.tenant_id,
  641. doc_language=doc_language,
  642. )
  643. return documents
  644. def _load_segments(self, dataset, dataset_document, documents):
  645. # save node to document segment
  646. doc_store = DatasetDocumentStore(
  647. dataset=dataset, user_id=dataset_document.created_by, document_id=dataset_document.id
  648. )
  649. # add document segments
  650. doc_store.add_documents(docs=documents, save_child=dataset_document.doc_form == IndexType.PARENT_CHILD_INDEX)
  651. # update document status to indexing
  652. cur_time = naive_utc_now()
  653. self._update_document_index_status(
  654. document_id=dataset_document.id,
  655. after_indexing_status="indexing",
  656. extra_update_params={
  657. DatasetDocument.cleaning_completed_at: cur_time,
  658. DatasetDocument.splitting_completed_at: cur_time,
  659. },
  660. )
  661. # update segment status to indexing
  662. self._update_segments_by_document(
  663. dataset_document_id=dataset_document.id,
  664. update_params={
  665. DocumentSegment.status: "indexing",
  666. DocumentSegment.indexing_at: naive_utc_now(),
  667. },
  668. )
  669. pass
  670. class DocumentIsPausedError(Exception):
  671. pass
  672. class DocumentIsDeletedPausedError(Exception):
  673. pass