Vitaliy Groshev
7e75b9d778
fix parsing spaces in russian language PDFs (#1987) (#2427)
### What problem does this PR solve?
[#1987 ](https://github.com/infiniflow/ragflow/issues/1987 )
When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf )
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.
There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816 )
and old [a-zA-Z...] regex parses it correctly with spaces.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
H
0cb588f7bf
Fix docx parser line bug (#1715)
### What problem does this PR solve?
#1704
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
1 anno fa
Jason Lee
ebdd71ce68
fix: When parsing the bold content in PDF, the result is duplicated. (#1729)
### What problem does this PR solve?
_fix: When parsing the bold content in PDF, the result is duplicated._
the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718 )
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
H
b24abee364
Fix pdfparser content confusion (#1700)
### What problem does this PR solve?
#1407 #1656
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
Kevin Hu
100b3165d8
pypdf2 to pypdf (#1684)
### What problem does this PR solve?
pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59
### Type of change
- [x] Refactoring
1 anno fa
Kevin Hu
d29fd52e14
fix bug about divided by zero (#1482)
### What problem does this PR solve?
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
Yuhao Tsui
7f4c63d102
fix: Delete hardcode (#1464)
### What problem does this PR solve?
After checking the language of the pdf, the line will hardcode the
language into Chinese
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
H
2290c2a2f0
fix pdf_paser char content confusion (#1462)
### What problem does this PR solve?
#1407
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
H
dbb8f7b77b
fix pdf_parser content confusion (#1458)
### What problem does this PR solve?
#1407
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
Zhedong Cen
45853505bb
Fix occasional errors in pdf table recognition (#1277)
### What problem does this PR solve?
Fix occasional errors in pdf table recognition
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
4454ba7a1e
add self-rag (#1070)
### What problem does this PR solve?
#1069
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
1 anno fa
Jin Hai
cdea1d0a85
Update readme and add license (#1018)
### What problem does this PR solve?
- Update readme
- Add license
### Type of change
- [x] Documentation Update
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
1 anno fa
KevinHuSh
843720f958
fix bug in pdf parser (#986)
### What problem does this PR solve?
#963
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
7eee193956
fix #917 #915 (#946)
### What problem does this PR solve?
#917
#915
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
xinzhuang
3bbdf3b770
fixbug for computing 'not concating feature' (#896)
### What problem does this PR solve?
When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
99be226c7c
fix coordinate error (#686)
### What problem does this PR solve?
#683
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
cab274f560
remove PyMuPDF (#618)
### What problem does this PR solve?
#613
### Type of change
- [x] Other (please describe):
1 anno fa
KevinHuSh
8c07992b6c
refine code (#595)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 anno fa
KevinHuSh
d589b0f568
fix exception in pdf parser (#584)
### What problem does this PR solve?
#451
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
9d60a84958
refactor code (#583)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 anno fa
KevinHuSh
66f8d35632
Refactor (#537)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 anno fa
KevinHuSh
0dfc8ddc0f
enlarge docker memory usage (#501)
### What problem does this PR solve?
### Type of change
- [x] Refactoring
1 anno fa
KevinHuSh
962c66714e
fix divide by zero bug (#447)
### What problem does this PR solve?
#445
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
加帆
39f1feaccb
Bug fix pdf parse index out of range (#440)
### What problem does this PR solve?
fix a bug comes when parse some pdf file #436
### Type of change
- [☑️ ] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
0499a3f621
rm page number exception for pdf parser (#424)
### What problem does this PR solve?
#423
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
1 anno fa
KevinHuSh
453c29170f
make sure the models will not be load twice (#422)
### What problem does this PR solve?
#381
### Type of change
- [x] Refactoring
1 anno fa
KevinHuSh
a5384446e3
let's load model from local (#163)
1 anno fa
KevinHuSh
fd7fcb5baf
apply pep8 formalize (#155)
1 anno fa
KevinHuSh
979b3a5b4b
support snapshot download from local (#153)
* support snapshot download from local
* let snapshot download from local
1 anno fa
KevinHuSh
da21320b88
fix plainPdf bugs (#152)
1 anno fa
KevinHuSh
71fe314955
refine page ranges (#147)
1 anno fa
KevinHuSh
f6aee7f230
add use layout or not option (#145)
* add use layout or not option
* trival
1 anno fa
KevinHuSh
6c6b144de2
refine manual parser (#140)
1 anno fa
KevinHuSh
6999598101
refine for English corpus (#135)
1 anno fa
KevinHuSh
9a843667b3
fix github account login issue (#132)
1 anno fa
KevinHuSh
9da671b951
refine manul parser (#131)
1 anno fa
KevinHuSh
675a9f8d9a
add dockerfile for cuda envirement. Refine table search strategy, (#123)
1 anno fa
KevinHuSh
8f86ab9f7f
refine pdf parser, add time zone to userinfo (#112)
1 anno fa
KevinHuSh
602038ac49
fix task cancling bug (#98)
1 anno fa
KevinHuSh
8a57f2afd5
change callback strategy, add timezone to docker (#96)
1 anno fa
KevinHuSh
7bfaf0df29
fix position extraction bug (#93)
* fix position extraction bug
* remove delimiter for naive parser
1 anno fa
KevinHuSh
685b4d8a95
fix table desc bugs, add positions to chunks (#91)
1 anno fa
KevinHuSh
8a726fb04b
solve task execution issues (#90)
1 anno fa
KevinHuSh
3d4315c42a
resolve the issue of naive parser (#87)
1 anno fa
KevinHuSh
0429107e80
fix user login issue (#85)
1 anno fa
KevinHuSh
4568a4b2cb
refine admin initialization (#75)
1 anno fa
KevinHuSh
d32322c081
rename vision, add layour and tsr recognizer (#70)
* rename vision, add layour and tsr recognizer
* trivial fixing
1 anno fa
KevinHuSh
cacd36c5e1
use onnx models, new deepdoc (#68)
1 anno fa
KevinHuSh
a8294f2168
Refine resume parts and fix bugs in retrival using sql (#66)
1 anno fa
KevinHuSh
407b2523b6
remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55)
1 anno fa