当前位置: 首页 > news >正文

虚拟主机网站500错误百度下载2021新版安装

虚拟主机网站500错误,百度下载2021新版安装,哪个网站可以做兼职,wordpress从入门什么是分词器 1、分词器介绍 对文本进行分析处理的一种手段,基本处理逻辑为按照预先制定的分词规则,把原始文档分割成若干更小粒度的词项,粒度大小取决于分词器规则。 常用的中文分词器有ik按照切词的粒度粗细又分为:ik_max_word和ik_smart&…

什么是分词器

1、分词器介绍 对文本进行分析处理的一种手段,基本处理逻辑为按照预先制定的分词规则,把原始文档分割成若干更小粒度的词项,粒度大小取决于分词器规则。
常用的中文分词器有ik按照切词的粒度粗细又分为:ik_max_word和ik_smart;英文分词器standard
ik_max_word会将文本做最细粒度的拆分,会穷尽各种可能的组合,适合 Term Query;
ik_smart:会做最粗粒度的拆分,适合 Phrase 查询
下面是对分词器使用的语句:

GET _analyze
{"text": ["布布努力学习编程"],"analyzer": "ik_max_word"
}
{"tokens" : [{"token" : "布","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "布","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1},{"token" : "努力学习","start_offset" : 2,"end_offset" : 6,"type" : "CN_WORD","position" : 2},{"token" : "努力","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 3},{"token" : "力学","start_offset" : 3,"end_offset" : 5,"type" : "CN_WORD","position" : 4},{"token" : "学习","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 5},{"token" : "编程","start_offset" : 6,"end_offset" : 8,"type" : "CN_WORD","position" : 6}]
}GET _analyze
{"text": ["布布努力学习编程"],"analyzer": "ik_smart"
}
{"tokens" : [{"token" : "布","start_offset" : 0,"end_offset" : 1,"type" : "CN_CHAR","position" : 0},{"token" : "布","start_offset" : 1,"end_offset" : 2,"type" : "CN_CHAR","position" : 1},{"token" : "努力学习","start_offset" : 2,"end_offset" : 6,"type" : "CN_WORD","position" : 2},{"token" : "编程","start_offset" : 6,"end_offset" : 8,"type" : "CN_WORD","position" : 3}]
}GET _analyze
{"text": ["布布努力学习编程"],"analyzer": "standard"
}
{"tokens" : [{"token" : "布","start_offset" : 0,"end_offset" : 1,"type" : "<IDEOGRAPHIC>","position" : 0},{"token" : "布","start_offset" : 1,"end_offset" : 2,"type" : "<IDEOGRAPHIC>","position" : 1},{"token" : "努","start_offset" : 2,"end_offset" : 3,"type" : "<IDEOGRAPHIC>","position" : 2},{"token" : "力","start_offset" : 3,"end_offset" : 4,"type" : "<IDEOGRAPHIC>","position" : 3},{"token" : "学","start_offset" : 4,"end_offset" : 5,"type" : "<IDEOGRAPHIC>","position" : 4},{"token" : "习","start_offset" : 5,"end_offset" : 6,"type" : "<IDEOGRAPHIC>","position" : 5},{"token" : "编","start_offset" : 6,"end_offset" : 7,"type" : "<IDEOGRAPHIC>","position" : 6},{"token" : "程","start_offset" : 7,"end_offset" : 8,"type" : "<IDEOGRAPHIC>","position" : 7}]
}

分词器的生效时间

  1. 在创建索引的时候会把索引中text类型的字段按照mapping中配置的分词器进行分词存储倒排索引;
  2. 在查询的时候全文检索,会对搜索条件进行分词做为查询条件去和创建索引时的分词匹配;

分词器的组成

切词器(Tokenizer):用于定义切词(分词)逻辑
词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑
字符过滤器(Character Filter):用于处理单个字符
注意:分词器不会对源数据产生影响,分词只是对倒排索引以及搜索词的行为

字符过滤器(Character Filter)

定义:分词之前的预处理,过滤无用字符
字符过滤器分为三种:

字符过滤器-HTML标签过滤器:HTML Strip Character Filter

过滤html标签

html_strip
参数:escaped_tags 需要保留的html标签
“type”: “html_strip”

DELETE test_html_strip_filter
#字符过滤器
PUT test_html_strip_filter
{"settings": {"analysis": {"char_filter": {"my_char_filter": {"type": "html_strip","escaped_tags": ["a"]}}}}
}GET test_html_strip_filter/_analyze
{"tokenizer": "standard","char_filter": ["my_char_filter"],"text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}结果:
{"tokens" : [{"token" : "I'm","start_offset" : 3,"end_offset" : 11,"type" : "<ALPHANUM>","position" : 0},{"token" : "so","start_offset" : 12,"end_offset" : 14,"type" : "<ALPHANUM>","position" : 1},{"token" : "a","start_offset" : 16,"end_offset" : 17,"type" : "<ALPHANUM>","position" : 2},{"token" : "happy","start_offset" : 18,"end_offset" : 23,"type" : "<ALPHANUM>","position" : 3},{"token" : "a","start_offset" : 25,"end_offset" : 26,"type" : "<ALPHANUM>","position" : 4}]
}

上面语句的作用是:对text文本,只保留标签,其余标签不展示

字符过滤器-字符映射过滤器:Mapping Character Filter

通过在索引的mapping映射中指定对某些字符的替换从而完成特定字符的过滤
“type”:“mapping”

##Mapping Character Filter 
DELETE my_index
PUT my_index
{"settings": {"analysis": {"char_filter": {"my_char_filter":{"type":"mapping","mappings":["臭 => *","傻=> *","逼=> *"]}},"analyzer": {"my_analyzer":{"tokenizer":"keyword","char_filter":["my_char_filter"]}}}}
}
GET my_index/_analyze
{"analyzer": "my_analyzer","text": "你就是个臭傻逼"
}{"tokens" : [{"token" : "你就是个***","start_offset" : 0,"end_offset" : 7,"type" : "word","position" : 0}]
}

字符过滤器-正则替换过滤器:Pattern Replace Character Filter

“type”:"pattern_replace"表示正则替换

##Pattern Replace Character Filter 
#17611001200
DELETE my_index
PUT my_index
{"settings": {"analysis": {"char_filter": {"my_char_filter":{"type":"pattern_replace","pattern":"(\\d{3})\\d{4}(\\d{4})","replacement":"$1****$2"}},"analyzer": {"my_analyzer":{"tokenizer":"keyword","char_filter":["my_char_filter"]}}}}
}
GET my_index/_analyze
{"analyzer": "my_analyzer","text": "您的手机号是17611001200"
}{"tokens" : [{"token" : "您的手机号是184****6831","start_offset" : 0,"end_offset" : 17,"type" : "word","position" : 0}]
}

词项过滤器(Token Filter)

词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。官方同样预置了很多词项过滤器,基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。

standard转大小写、停用词

#转为大写
GET _analyze
{"tokenizer": "standard", "filter": ["uppercase"],"text": ["www elastic co guide"]
}#转为小写
GET _analyze
{"tokenizer": "standard","filter": ["lowercase"],"text": ["WWW ELASTIC CO GUIDE"]
}

停用词
在切词完成之后,会被干掉词项,即停用词。停用词可以自定义
在分词器插件的配置文件中可以看到停用词的定义


GET _analyze
{"tokenizer": "standard","filter": ["stop"], "text": ["what are you doing"]
}

这是IK分词器的停用词
在这里插入图片描述
自定义停用词

### 自定义 filter
PUT test_token_filter_stop
{"settings": {"analysis": {"filter": {"my_filter": {"type": "stop","stopwords": ["www"],"ignore_case": true}}}}
}
GET test_token_filter_stop/_analyze
{"tokenizer": "standard", "filter": ["my_filter"], "text": ["What www WWW are you doing"]
}

同义词

#同义词
PUT test_token_filter_synonym
{"settings": {"analysis": {"filter": {"my_synonym": {"type": "synonym","synonyms": ["good, nice => excellent"]}}}}
}GET test_token_filter_synonym/_analyze
{"tokenizer": "standard", "filter": ["my_synonym"], "text": ["good"]
}

切词器:Tokenizer

tokenizer 是分词器的核心组成部分之一,其主要作用是分词,或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term,或称之为一个词项。可以把切词器理解为预定义的切词规则。官方内置了很多种切词器,默认的切词器为 standard。

http://www.ds6.com.cn/news/6734.html

相关文章:

  • 做公司的网站有哪些东西吗如何做企业网页
  • 宁波妇科专家排名seo深圳优化
  • 域名可以做网站名吗申请百度账号注册
  • 广州网站建设是什么意思今天的新闻摘抄
  • 有微重庆网站吗模板建站
  • 燕郊网站制作多少钱百度推广方案
  • 广东手机网站制作价格国内最新新闻热点事件
  • 营销型企业网站核心希爱力
  • 杭州旅游 网站建设怎么推广自己的网站
  • 威客网站模版2023最新15件重大新闻
  • 用DW做网站时怎么在新窗口打开百度一下网页入口
  • 嵌入式软件开发平台有哪些seo做的比较好的公司
  • 找图片素材网站seo技术最新黑帽
  • 常州微信网站建设seo优化一般多少钱
  • 小说网站开发设计最稳定的灰色词排名
  • 本地的上海网站建设互联网营销策划方案
  • 做网站怎么插入字幕互联网公司
  • 网站源码商城建设深圳网站建设
  • 网站要和别人做api 链接流量推广怎么做
  • 营销型企业网站建设的基本原则是武汉网站优化
  • 濮阳市网站建设如何在互联网推广自己的产品
  • 企业网站制作报价表sem代运营费用
  • 福永网站制作新闻播报最新
  • 装修网页设计网站友情链接图片
  • 企业的外币收入外管局网站做啥如何刷关键词指数
  • 丹东做网站的公司打开百度
  • 网站友情链接模板温州seo招聘
  • 东营经济技术开发区seo网站免费优化软件
  • 响应式网站的优势有那些的呢html家乡网站设计
  • 百度商桥绑定网站宣传网页制作