Fix "File contains no valid workbook part" (#9360)
### What problem does this PR solve?
fix "File contains no valid workbook part"
stacktrace:
```
Traceback (most recent call last):
File "/ragflow/deepdoc/parser/excel_parser.py", line 54, in _load_excel_to_workbook
return RAGFlowExcelParser._dataframe_to_workbook(df)
File "/ragflow/deepdoc/parser/excel_parser.py", line 69, in _dataframe_to_workbook
ws.cell(row=row_num, column=col_num, value=value)
File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/worksheet/worksheet.py", line 246, in cell
cell.value = value
File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 218, in value
self._bind_value(value)
File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 197, in _bind_value
value = self.check_string(value)
File "/ragflow/.venv/lib/python3.10/site-packages/openpyxl/cell/cell.py", line 165, in check_string
raise IllegalCharacterError(f"{value} cannot be used in worksheets.")
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Add fallback to use 'calamine' parse engine in excel_parser.py (#9374)
### What problem does this PR solve?
add fallback to `calamine` engine when parse error raised using the
default `openpyxl` / `xlrd` engine.
e.g. the following error can be fixed:
```
Traceback (most recent call last):
File "/ragflow/deepdoc/parser/excel_parser.py", line 53, in _load_excel_to_workbook
df = pd.read_excel(file_like_object)
File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel
io = ExcelFile(
File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__
self._reader = self._engines[engine](
File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__
super().__init__(
File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__
self.book = self.load_workbook(self.handles.handle, engine_kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook
return open_workbook(file_contents=data, **engine_kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook
bk = open_workbook_xls(
File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 68, in open_workbook_xls
bk.biff2_8_load(
File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/book.py", line 641, in biff2_8_load
cd.locate_named_stream(UNICODE_LITERAL(qname))
File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 398, in locate_named_stream
result = self._locate_stream(
File "/ragflow/.venv/lib/python3.10/site-packages/xlrd/compdoc.py", line 429, in _locate_stream
raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4
```
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Fix: When Excel is a formula, the parsed result is a formula, but cannot be correctly parsed as a value type (#6613)
### What problem does this PR solve?
Fix: When Excel is a formula, the parsed result is a formula, but cannot
be correctly parsed as a value type
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: tangyu <1@1.com>
### What problem does this PR solve?
Add CSV file parsing support #4552, #5849, #5870
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
if *.xls file is too large, .eg >50M, I get error.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
fix chunk method "Table" losing content when the Excel file has multi… (#4123)
…ple sheets
### What problem does this PR solve?
discussed in https://github.com/infiniflow/ragflow/pull/4102
- In excel_parser.py, `total` means the total number of rows in Excel,
but it return in the first iterate, that lead to the wrong `to_page`
- In table.py, it when Excel file has multiple sheets, it will be
divided into multiple parts, every part size is 3000, `data` may be
empty, because it has recorded in the last iterate.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
- Update readme
- Add license
### Type of change
- [x] Documentation Update
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
Split Excel into different chunk
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
Issue link:#[[Link the issue
here](https://github.com/infiniflow/ragflow/issues/196)]
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Breaking Change (fix or feature that could cause existing
functionality not to work as expected)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Test cases
- [ ] Python SDK impacted, Need to update PyPI
- [ ] Other (please describe):