Du kan inte välja fler än 25 ämnen Ämnen måste starta med en bokstav eller siffra, kan innehålla bindestreck ('-') och vara max 35 tecken långa.

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
  1. English | [简体中文](./README_zh.md)
  2. # *Deep*Doc
  3. - [1. Introduction](#1)
  4. - [2. Vision](#2)
  5. - [3. Parser](#3)
  6. <a name="1"></a>
  7. ## 1. Introduction
  8. With a bunch of documents from various domains with various formats and along with diverse retrieval requirements,
  9. an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
  10. There are 2 parts in *Deep*Doc so far: vision and parser.
  11. You can run the flowing test programs if you're interested in our results of OCR, layout recognition and TSR.
  12. ```bash
  13. python deepdoc/vision/t_ocr.py -h
  14. usage: t_ocr.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR]
  15. options:
  16. -h, --help show this help message and exit
  17. --inputs INPUTS Directory where to store images or PDFs, or a file path to a single image or PDF
  18. --output_dir OUTPUT_DIR
  19. Directory where to store the output images. Default: './ocr_outputs'
  20. ```
  21. ```bash
  22. python deepdoc/vision/t_recognizer.py -h
  23. usage: t_recognizer.py [-h] --inputs INPUTS [--output_dir OUTPUT_DIR] [--threshold THRESHOLD] [--mode {layout,tsr}]
  24. options:
  25. -h, --help show this help message and exit
  26. --inputs INPUTS Directory where to store images or PDFs, or a file path to a single image or PDF
  27. --output_dir OUTPUT_DIR
  28. Directory where to store the output images. Default: './layouts_outputs'
  29. --threshold THRESHOLD
  30. A threshold to filter out detections. Default: 0.5
  31. --mode {layout,tsr} Task mode: layout recognition or table structure recognition
  32. ```
  33. Our models are served on HuggingFace. If you have trouble downloading HuggingFace models, this might help!!
  34. ```bash
  35. export HF_ENDPOINT=https://hf-mirror.com
  36. ```
  37. <a name="2"></a>
  38. ## 2. Vision
  39. We use vision information to resolve problems as human being.
  40. - OCR. Since a lot of documents presented as images or at least be able to transform to image,
  41. OCR is a very essential and fundamental or even universal solution for text extraction.
  42. ```bash
  43. python deepdoc/vision/t_ocr.py --inputs=path_to_images_or_pdfs --output_dir=path_to_store_result
  44. ```
  45. The inputs could be directory to images or PDF, or a image or PDF.
  46. You can look into the folder 'path_to_store_result' where has images which demonstrate the positions of results,
  47. txt files which contain the OCR text.
  48. <div align="center" style="margin-top:20px;margin-bottom:20px;">
  49. <img src="https://github.com/infiniflow/ragflow/assets/12318111/f25bee3d-aaf7-4102-baf5-d5208361d110" width="900"/>
  50. </div>
  51. - Layout recognition. Documents from different domain may have various layouts,
  52. like, newspaper, magazine, book and résumé are distinct in terms of layout.
  53. Only when machine have an accurate layout analysis, it can decide if these text parts are successive or not,
  54. or this part needs Table Structure Recognition(TSR) to process, or this part is a figure and described with this caption.
  55. We have 10 basic layout components which covers most cases:
  56. - Text
  57. - Title
  58. - Figure
  59. - Figure caption
  60. - Table
  61. - Table caption
  62. - Header
  63. - Footer
  64. - Reference
  65. - Equation
  66. Have a try on the following command to see the layout detection results.
  67. ```bash
  68. python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=layout --output_dir=path_to_store_result
  69. ```
  70. The inputs could be directory to images or PDF, or a image or PDF.
  71. You can look into the folder 'path_to_store_result' where has images which demonstrate the detection results as following:
  72. <div align="center" style="margin-top:20px;margin-bottom:20px;">
  73. <img src="https://github.com/infiniflow/ragflow/assets/12318111/07e0f625-9b28-43d0-9fbb-5bf586cd286f" width="1000"/>
  74. </div>
  75. - Table Structure Recognition(TSR). Data table is a frequently used structure to present data including numbers or text.
  76. And the structure of a table might be very complex, like hierarchy headers, spanning cells and projected row headers.
  77. Along with TSR, we also reassemble the content into sentences which could be well comprehended by LLM.
  78. We have five labels for TSR task:
  79. - Column
  80. - Row
  81. - Column header
  82. - Projected row header
  83. - Spanning cell
  84. Have a try on the following command to see the layout detection results.
  85. ```bash
  86. python deepdoc/vision/t_recognizer.py --inputs=path_to_images_or_pdfs --threshold=0.2 --mode=tsr --output_dir=path_to_store_result
  87. ```
  88. The inputs could be directory to images or PDF, or a image or PDF.
  89. You can look into the folder 'path_to_store_result' where has both images and html pages which demonstrate the detection results as following:
  90. <div align="center" style="margin-top:20px;margin-bottom:20px;">
  91. <img src="https://github.com/infiniflow/ragflow/assets/12318111/cb24e81b-f2ba-49f3-ac09-883d75606f4c" width="1000"/>
  92. </div>
  93. <a name="3"></a>
  94. ## 3. Parser
  95. Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser.
  96. The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:
  97. - Text chunks with their own positions in PDF(page number and rectangular positions).
  98. - Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
  99. - Figures with caption and text in the figures.
  100. ### Résumé
  101. The résumé is a very complicated kind of document. A résumé which is composed of unstructured text
  102. with various layouts could be resolved into structured data composed of nearly a hundred of fields.
  103. We haven't opened the parser yet, as we open the processing method after parsing procedure.