瀏覽代碼

make language judgement robuster (#3287)

### What problem does this PR solve?



### Type of change

- [x] Performance Improvement
tags/v0.14.0
Kevin Hu 11 月之前
父節點
當前提交
d88f0d43ea
沒有連結到貢獻者的電子郵件帳戶。
共有 1 個檔案被更改,包括 2 行新增1 行删除
  1. 2
    1
      rag/nlp/query.py

+ 2
- 1
rag/nlp/query.py 查看文件

rag_tokenizer.tradi2simp( rag_tokenizer.tradi2simp(
rag_tokenizer.strQ2B( rag_tokenizer.strQ2B(
txt.lower()))).strip() txt.lower()))).strip()
txt = EsQueryer.rmWWW(txt)


if not self.isChinese(txt): if not self.isChinese(txt):
txt = EsQueryer.rmWWW(txt)
tks = rag_tokenizer.tokenize(txt).split(" ") tks = rag_tokenizer.tokenize(txt).split(" ")
tks_w = self.tw.weights(tks) tks_w = self.tw.weights(tks)
tks_w = [(re.sub(r"[ \\\"'^]", "", tk), w) for tk, w in tks_w] tks_w = [(re.sub(r"[ \\\"'^]", "", tk), w) for tk, w in tks_w]
return False return False
return True return True


txt = EsQueryer.rmWWW(txt)
qs, keywords = [], [] qs, keywords = [], []
for tt in self.tw.split(txt)[:256]: # .split(" "): for tt in self.tw.split(txt)[:256]: # .split(" "):
if not tt: if not tt:

Loading…
取消
儲存