Vitaliy Groshev
7e75b9d778
fix parsing spaces in russian language PDFs (#1987) (#2427)
### What problem does this PR solve?
[#1987 ](https://github.com/infiniflow/ragflow/issues/1987 )
When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf )
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.
There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816 )
and old [a-zA-Z...] regex parses it correctly with spaces.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
H
0cb588f7bf
Fix docx parser line bug (#1715)
### What problem does this PR solve?
#1704
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
1 rok temu
Jason Lee
ebdd71ce68
fix: When parsing the bold content in PDF, the result is duplicated. (#1729)
### What problem does this PR solve?
_fix: When parsing the bold content in PDF, the result is duplicated._
the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718 )
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
H
b24abee364
Fix pdfparser content confusion (#1700)
### What problem does this PR solve?
#1407 #1656
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
Kevin Hu
100b3165d8
pypdf2 to pypdf (#1684)
### What problem does this PR solve?
pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59
### Type of change
- [x] Refactoring
1 rok temu
Kevin Hu
d29fd52e14
fix bug about divided by zero (#1482)
### What problem does this PR solve?
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
Yuhao Tsui
7f4c63d102
fix: Delete hardcode (#1464)
### What problem does this PR solve?
After checking the language of the pdf, the line will hardcode the
language into Chinese
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
H
2290c2a2f0
fix pdf_paser char content confusion (#1462)
### What problem does this PR solve?
#1407
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
H
dbb8f7b77b
fix pdf_parser content confusion (#1458)
### What problem does this PR solve?
#1407
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
Zhedong Cen
45853505bb
Fix occasional errors in pdf table recognition (#1277)
### What problem does this PR solve?
Fix occasional errors in pdf table recognition
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
4454ba7a1e
add self-rag (#1070)
### What problem does this PR solve?
#1069
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
1 rok temu
Jin Hai
cdea1d0a85
Update readme and add license (#1018)
### What problem does this PR solve?
- Update readme
- Add license
### Type of change
- [x] Documentation Update
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
1 rok temu
KevinHuSh
843720f958
fix bug in pdf parser (#986)
### What problem does this PR solve?
#963
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
7eee193956
fix #917 #915 (#946)
### What problem does this PR solve?
#917
#915
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
xinzhuang
3bbdf3b770
fixbug for computing 'not concating feature' (#896)
### What problem does this PR solve?
When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
99be226c7c
fix coordinate error (#686)
### What problem does this PR solve?
#683
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
cab274f560
remove PyMuPDF (#618)
### What problem does this PR solve?
#613
### Type of change
- [x] Other (please describe):
1 rok temu
KevinHuSh
8c07992b6c
refine code (#595)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 rok temu
KevinHuSh
d589b0f568
fix exception in pdf parser (#584)
### What problem does this PR solve?
#451
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
9d60a84958
refactor code (#583)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 rok temu
KevinHuSh
66f8d35632
Refactor (#537)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 rok temu
KevinHuSh
0dfc8ddc0f
enlarge docker memory usage (#501)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 rok temu
KevinHuSh
962c66714e
fix divide by zero bug (#447)
### What problem does this PR solve?
#445
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
加帆
39f1feaccb
Bug fix pdf parse index out of range (#440)
### What problem does this PR solve?
fix a bug comes when parse some pdf file #436
### Type of change
- [☑️ ] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
0499a3f621
rm page number exception for pdf parser (#424)
### What problem does this PR solve?
#423
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 rok temu
KevinHuSh
453c29170f
make sure the models will not be load twice (#422)
### What problem does this PR solve?
#381
### Type of change
- [x] Refactoring
1 rok temu
KevinHuSh
a5384446e3
let's load model from local (#163)
1 rok temu
KevinHuSh
fd7fcb5baf
apply pep8 formalize (#155)
1 rok temu
KevinHuSh
979b3a5b4b
support snapshot download from local (#153)
* support snapshot download from local
* let snapshot download from local
1 rok temu
KevinHuSh
da21320b88
fix plainPdf bugs (#152)
1 rok temu
KevinHuSh
71fe314955
refine page ranges (#147)
1 rok temu
KevinHuSh
f6aee7f230
add use layout or not option (#145)
* add use layout or not option
* trival
1 rok temu
KevinHuSh
6c6b144de2
refine manual parser (#140)
1 rok temu
KevinHuSh
6999598101
refine for English corpus (#135)
1 rok temu
KevinHuSh
9a843667b3
fix github account login issue (#132)
1 rok temu
KevinHuSh
9da671b951
refine manul parser (#131)
1 rok temu
KevinHuSh
675a9f8d9a
add dockerfile for cuda envirement. Refine table search strategy, (#123)
1 rok temu
KevinHuSh
8f86ab9f7f
refine pdf parser, add time zone to userinfo (#112)
1 rok temu
KevinHuSh
602038ac49
fix task cancling bug (#98)
1 rok temu
KevinHuSh
8a57f2afd5
change callback strategy, add timezone to docker (#96)
1 rok temu
KevinHuSh
7bfaf0df29
fix position extraction bug (#93)
* fix position extraction bug
* remove delimiter for naive parser
1 rok temu
KevinHuSh
685b4d8a95
fix table desc bugs, add positions to chunks (#91)
1 rok temu
KevinHuSh
8a726fb04b
solve task execution issues (#90)
1 rok temu
KevinHuSh
3d4315c42a
resolve the issue of naive parser (#87)
1 rok temu
KevinHuSh
0429107e80
fix user login issue (#85)
1 rok temu
KevinHuSh
4568a4b2cb
refine admin initialization (#75)
1 rok temu
KevinHuSh
d32322c081
rename vision, add layour and tsr recognizer (#70)
* rename vision, add layour and tsr recognizer
* trivial fixing
1 rok temu
KevinHuSh
cacd36c5e1
use onnx models, new deepdoc (#68)
1 rok temu
KevinHuSh
a8294f2168
Refine resume parts and fix bugs in retrival using sql (#66)
1 rok temu
KevinHuSh
407b2523b6
remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55)
1 rok temu