Vitaliy Groshev
							
						 
						
							
								7e75b9d778
								
									
										
											 
										
									
								
							 
						 
						
							
									fix parsing spaces in russian language PDFs (#1987) (#2427) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
[#1987 ](https://github.com/infiniflow/ragflow/issues/1987 )
When scanning PDF files character by character, the parser excluded
spaces if the string did not match regex. Text from [Russian
documents](https://github.com/user-attachments/files/16659706/dogovor_oferta.pdf )
needs spaces, but it does not match the regex because it uses different
alphabet. That's why PDFs were parsed incorrectly and were almost
unusable as source. Fixed that by adding Russian alphabet to regex.
There might be problems with other languages that use different
alphabets. I additionally tested [PDF in
Spanish](https://www.scusd.edu/sites/main/files/file-attachments/howtohelpyourchildsucceedinschoolspanish.pdf?1338307816 )
and old [a-zA-Z...] regex parses it correctly with spaces.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   H
							
						 
						
							
								0cb588f7bf
								
									
										
											 
										
									
								
							 
						 
						
							
									Fix docx parser line bug (#1715) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#1704  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> 
							
						 
						1 рік тому  
					 
				
					
						
							
								   Jason Lee
							
						 
						
							
								ebdd71ce68
								
									
										
											 
										
									
								
							 
						 
						
							
									fix: When parsing the bold content in PDF, the result is duplicated. (#1729) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
_fix: When parsing the bold content in PDF, the result is duplicated._
the detail: [When using OCR to recognize Chinese titles, the structure
appears to be
duplicated](https://github.com/infiniflow/ragflow/issues/1718 )
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   H
							
						 
						
							
								b24abee364
								
									
										
											 
										
									
								
							 
						 
						
							
									Fix pdfparser content confusion (#1700) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#1407  #1656  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   Kevin Hu
							
						 
						
							
								100b3165d8
								
									
										
											 
										
									
								
							 
						 
						
							
									pypdf2 to pypdf (#1684) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by
a character #59 
### Type of change
- [x] Refactoring 
							
						 
						1 рік тому  
					 
				
					
						
							
								   Kevin Hu
							
						 
						
							
								d29fd52e14
								
									
										
											 
										
									
								
							 
						 
						
							
									fix bug about divided by zero (#1482) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   Yuhao Tsui
							
						 
						
							
								7f4c63d102
								
									
										
											 
										
									
								
							 
						 
						
							
									fix: Delete hardcode (#1464) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
After checking the language of the pdf, the line will hardcode the
language into Chinese
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   H
							
						 
						
							
								2290c2a2f0
								
									
										
											 
										
									
								
							 
						 
						
							
									fix pdf_paser char content confusion (#1462) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#1407  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   H
							
						 
						
							
								dbb8f7b77b
								
									
										
											 
										
									
								
							 
						 
						
							
									fix pdf_parser content confusion (#1458) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#1407  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   Zhedong Cen
							
						 
						
							
								45853505bb
								
									
										
											 
										
									
								
							 
						 
						
							
									Fix occasional errors in pdf table recognition (#1277) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
Fix occasional errors in pdf table recognition
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								4454ba7a1e
								
									
										
											 
										
									
								
							 
						 
						
							
									add self-rag (#1070) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#1069  
### Type of change
- [x] New Feature (non-breaking change which adds functionality) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   Jin Hai
							
						 
						
							
								cdea1d0a85
								
									
										
											 
										
									
								
							 
						 
						
							
									Update readme and add license (#1018) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
- Update readme
- Add license
### Type of change
- [x] Documentation Update
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com> 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								843720f958
								
									
										
											 
										
									
								
							 
						 
						
							
									fix bug in pdf parser (#986) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#963  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								7eee193956
								
									
										
											 
										
									
								
							 
						 
						
							
									fix #917 #915 (#946) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#917  
#915 
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   xinzhuang
							
						 
						
							
								3bbdf3b770
								
									
										
											 
										
									
								
							 
						 
						
							
									fixbug for computing 'not concating feature' (#896) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
When pdfparser call `_naive_vertical_merge` method,there is a "not
concating feature " value by computing difference between `b` and `b_`'s
layoutno ,but actually is `b` and `b`. I think it's a bug, so fix it.
Please check again.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								99be226c7c
								
									
										
											 
										
									
								
							 
						 
						
							
									fix coordinate error (#686) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#683  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								cab274f560
								
									
										
											 
										
									
								
							 
						 
						
							
									remove PyMuPDF (#618) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#613  
### Type of change
- [x] Other (please describe): 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								8c07992b6c
								
									
										
											 
										
									
								
							 
						 
						
							
									refine code (#595) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
### Type of change
- [x] Refactoring 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								d589b0f568
								
									
										
											 
										
									
								
							 
						 
						
							
									fix exception in pdf parser (#584) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#451  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								9d60a84958
								
									
										
											 
										
									
								
							 
						 
						
							
									refactor code (#583) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
### Type of change
- [x] Refactoring 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								66f8d35632
								
									
										
											 
										
									
								
							 
						 
						
							
									Refactor (#537) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
### Type of change
- [x] Refactoring 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								0dfc8ddc0f
								
									
										
											 
										
									
								
							 
						 
						
							
									enlarge docker memory usage (#501) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
### Type of change
- [x] Refactoring 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								962c66714e
								
									
										
											 
										
									
								
							 
						 
						
							
									fix divide by zero bug (#447) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#445  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   加帆
							
						 
						
							
								39f1feaccb
								
									
										
											 
										
									
								
							 
						 
						
							
									Bug fix pdf parse index out of range (#440) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
fix a bug comes when parse some pdf file #436  
### Type of change
- [☑️ ] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								0499a3f621
								
									
										
											 
										
									
								
							 
						 
						
							
									rm page number exception for pdf parser (#424) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#423  
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue) 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								453c29170f
								
									
										
											 
										
									
								
							 
						 
						
							
									make sure the models will not be load twice (#422) 
							 
							
							 
							
							
							
							
### What problem does this PR solve?
#381  
### Type of change
- [x] Refactoring 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								a5384446e3
								
									
										
											 
										
									
								
							 
						 
						
							
									let's load model from local (#163) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								fd7fcb5baf
								
									
										
											 
										
									
								
							 
						 
						
							
									apply pep8 formalize (#155) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								979b3a5b4b
								
									
										
											 
										
									
								
							 
						 
						
							
									support snapshot download from local (#153) 
							 
							
							 
							
							
							
							
* support snapshot download from local
* let snapshot download from local 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								da21320b88
								
									
										
											 
										
									
								
							 
						 
						
							
									fix plainPdf bugs (#152) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								71fe314955
								
									
										
											 
										
									
								
							 
						 
						
							
									refine page ranges (#147) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								f6aee7f230
								
									
										
											 
										
									
								
							 
						 
						
							
									add use layout or not option (#145) 
							 
							
							 
							
							
							
							
* add use layout or not option
* trival 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								6c6b144de2
								
									
										
											 
										
									
								
							 
						 
						
							
									refine manual parser (#140) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								6999598101
								
									
										
											 
										
									
								
							 
						 
						
							
									refine for English corpus (#135) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								9a843667b3
								
									
										
											 
										
									
								
							 
						 
						
							
									fix github account login issue (#132) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								9da671b951
								
									
										
											 
										
									
								
							 
						 
						
							
									refine manul parser (#131) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								675a9f8d9a
								
									
										
											 
										
									
								
							 
						 
						
							
									add dockerfile for cuda envirement. Refine table search strategy, (#123) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								8f86ab9f7f
								
									
										
											 
										
									
								
							 
						 
						
							
									refine pdf parser, add time zone to userinfo (#112) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								602038ac49
								
									
										
											 
										
									
								
							 
						 
						
							
									fix task cancling bug (#98) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								8a57f2afd5
								
									
										
											 
										
									
								
							 
						 
						
							
									change callback strategy, add timezone to docker (#96) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								7bfaf0df29
								
									
										
											 
										
									
								
							 
						 
						
							
									fix position extraction bug (#93) 
							 
							
							 
							
							
							
							
* fix position extraction bug
* remove delimiter for naive parser 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								685b4d8a95
								
									
										
											 
										
									
								
							 
						 
						
							
									fix table desc bugs, add positions to chunks (#91) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								8a726fb04b
								
									
										
											 
										
									
								
							 
						 
						
							
									solve task execution issues (#90) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								3d4315c42a
								
									
										
											 
										
									
								
							 
						 
						
							
									resolve the issue of naive parser (#87) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								0429107e80
								
									
										
											 
										
									
								
							 
						 
						
							
									fix user login issue (#85) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								4568a4b2cb
								
									
										
											 
										
									
								
							 
						 
						
							
									refine admin initialization (#75) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								d32322c081
								
									
										
											 
										
									
								
							 
						 
						
							
									rename vision, add layour and tsr recognizer (#70) 
							 
							
							 
							
							
							
							
* rename vision, add layour and tsr recognizer
* trivial fixing 
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								cacd36c5e1
								
									
										
											 
										
									
								
							 
						 
						
							
									use onnx models, new deepdoc (#68) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								a8294f2168
								
							 
						 
						
							
									Refine resume parts and fix bugs in retrival using sql (#66) 
							 
							
							
							
						 
						1 рік тому  
					 
				
					
						
							
								   KevinHuSh
							
						 
						
							
								407b2523b6
								
							 
						 
						
							
									remove unused codes, seperate layout detection out as a new api. Add new rag methed 'table' (#55) 
							 
							
							
							
						 
						1 рік тому