Fix ratelimit errors during document parsing (#6413)
### What problem does this PR solve?
When using the online large model API knowledge base to extract
knowledge graphs, frequent Rate Limit Errors were triggered,
causing document parsing to fail. This commit fixes the issue by
optimizing API calls in the following way:
Added exponential backoff and jitter to the API call to reduce the
frequency of Rate Limit Errors.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Feat: Adds hierarchical title path tracking for tables in DOCX documents to improve context association (#6374)
### What problem does this PR solve?
Adds hierarchical title path tracking for tables in DOCX documents to
improve context association. Previously, extracted tables lacked
positional context within document structure.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Fix: the error of Ollama embeddings interface returning "500 Internal Server Error" (#6350)
### What problem does this PR solve?
Fix the error where the Ollama embeddings interface returns a “500
Internal Server Error” when using models such as xiaobu-embedding-v2 for
embedding.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Call register_scripts on connecting redis
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add fallback for PDF figure parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Optimize setting configuration initialization to resolve Minio
initialization error caused by using a specific storage.
Reproduction Scenario:
Using Aliyun OSS as the backend storage with the STORAGE_IMPL
environment variable set to OSS.
The service_conf.yaml.template configuration file contains OSS-related
configurations, while other storage configurations are commented out.
When the service starts, it still attempts to initialize the Minio
storage. Since there is no Minio configuration in
service_conf.yaml.template, it results in an error due to the missing
configuration file.
Optimization Measures:
Automatically determine the required initialization configuration based
on the environment variable.
Do not initialize configurations for unused resources.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add VLM-boosted PDF parser if VLM is set.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Fix: Add a basic example when the example of content_tagging is empty (#6276)
### What problem does this PR solve?
When using LLM for auto-tag, if there are no examples, the tag format
generated by LLM may be wrong. This will cause Elasticsearch insert
errors. Adding basic examples can avoid this problem.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add vision LLM PDF parser
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Regards kb_id at ElasticSearch insert, update, delete. (#6105)
### What problem does this PR solve?
Regards kb_id at ElasticSearch insert, update, delete. Close #6066
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
…ions
### What problem does this PR solve?
This PR fixes an issue where the application was repeatedly reading the
llm_factories.json file from disk in multiple places, which could lead
to "Too many open files" errors under high load conditions. The fix
centralizes the file reading operation in the settings.py module and
stores the data in a global variable that can be accessed by other
modules.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [x] Performance Improvement
- [ ] Other (please describe):
Fix: optimize OCR garbage identification to reduce unnecessary filtering (#6027)
### What problem does this PR solve?
Optimize OCR garbage identification to reduce unnecessary filtering.
#5713
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add CSV file parsing support #4552, #5849, #5870
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Fix:signal.SIGUSR1 and signal.SIGUSR2 can't use in window. so don't bind signal.SIGUSR1 and signal.SIGUSR2 in the windows env (#5941)
### What problem does this PR solve?
Fix:signal.SIGUSR1 and signal.SIGUSR2 can't use in window. so don't bind
signal.SIGUSR1 and signal.SIGUSR2 in the windows env
### Type of change
- [✓ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Co-authored-by: tangyu <1@1.com>
Fix: CoHereRerank not respecting base_url when provided (#5784)
### What problem does this PR solve?
vLLM provider with a reranking model does not work : as vLLM uses under
the hood the [CoHereRerank
provider](https://github.com/infiniflow/ragflow/blob/v0.17.0/rag/llm/__init__.py#L250)
with a `base_url`, if this URL [is not passed to the Cohere
client](https://github.com/infiniflow/ragflow/blob/v0.17.0/rag/llm/rerank_model.py#L379-L382)
any attempt will endup on the Cohere SaaS (sending your private api key
in the process) instead of your vLLM instance.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):