lucene_02_IKAnalyre

发布时间：2023-09-06 01:47责任编辑：胡小海关键词：暂无标签

前言

在lucene中虽然已经提供了许多的分词器：StandardAnalyzer、CJKAnalyzer等，但在解析中文的时候都会把文中拆成一个个的单子。

毕竟老外不懂中文。这里介绍一个中文的分词器：IKAnalyre。虽然在其在分词的时候还不够完美

例如：将“高富帅，是2012年之后才有的词汇”

拆分为下图：

但是它可以通过配置文件来，增加新词和过滤不许出现的词比如：“的、啊、呀”等等没有具体意思的修饰副词和语气词等等。

配置IK解析器

第一步：在pom.xml 引入IK，注意：这个分词器由于从2012年之后就没有更新过，所以只能在低版本的lucene的版本中使用，该例使用的是：4.10.3

<!--ik 中文分词器--> ???<!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer --> ???<dependency> ?????<groupId>com.janeluo</groupId> ?????<artifactId>ikanalyzer</artifactId> ?????<version>2012_u6</version> ???</dependency>

完整pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ?xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> ?<modelVersion>4.0.0</modelVersion> ?<groupId>com.chen</groupId> ?<artifactId>lucene</artifactId> ?<version>1.0-SNAPSHOT</version> ?<packaging>jar</packaging> ?<name>lucene</name> ?<url>http://maven.apache.org</url> ?<properties> ???<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> ?</properties> ?<dependencies> ???<dependency> ?????<groupId>junit</groupId> ?????<artifactId>junit</artifactId> ?????<version>3.8.1</version> ?????<scope>test</scope> ???</dependency> ???<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core --> ???<dependency> ?????<groupId>org.apache.lucene</groupId> ?????<artifactId>lucene-core</artifactId> ?????<version>4.10.3</version> ???</dependency> ???<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser --> ???<dependency> ?????<groupId>org.apache.lucene</groupId> ?????<artifactId>lucene-queryparser</artifactId> ?????<version>4.10.3</version> ???</dependency> ???<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common --> ???<dependency> ?????<groupId>org.apache.lucene</groupId> ?????<artifactId>lucene-analyzers-common</artifactId> ?????<version>4.10.3</version> ???</dependency> ???<!-- https://mvnrepository.com/artifact/commons-io/commons-io --> ???<dependency> ?????<groupId>commons-io</groupId> ?????<artifactId>commons-io</artifactId> ?????<version>2.6</version> ???</dependency> ???<dependency> ?????<groupId>junit</groupId> ?????<artifactId>junit</artifactId> ?????<version>RELEASE</version> ???</dependency> ???<!-- https://mvnrepository.com/artifact/io.github.zacker330.es/ik-analysis-core --> ???<!--ik 中文分词器--> ???<!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer --> ???<dependency> ?????<groupId>com.janeluo</groupId> ?????<artifactId>ikanalyzer</artifactId> ?????<version>2012_u6</version> ???</dependency> ?</dependencies> ?<build> ???<plugins> ?????<plugin> ???????<groupId>org.apache.maven.plugins</groupId> ???????<artifactId>maven-compiler-plugin</artifactId> ???????<version>3.6.0</version> ???????<configuration> ?????????<source>1.8</source> ?????????<target>1.8</target> ???????</configuration> ?????</plugin> ???</plugins> ?</build></project>

第二步：在资源目录下引入配置文件和扩展词汇文件、过滤词文件

IKAnalyzer.cfg.xml，是该分词器的核心配置文件，管理着ext.dic(扩展词汇文件)、stopword.dic(禁词文件)

内容如下：

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> ?<properties> ?????<comment>IK Analyzer 扩展配置</comment> ???<!--用户可以在这里配置自己的扩展字典 --> ???<entry key="ext_dict">ext.dic;</entry> ????????<!--用户可以在这里配置自己的扩展停止词字典--> ???<entry key="ext_stopwords">stopword.dic;</entry> ????</properties>

ext.dic 内容示例：

高富帅白富美java工程师

stopword.dic内容示例：

我是用的你它他她aanandareasatbebutbyforifinintoisitnonotofonorsuchthatthetheirthentherethesetheythistowaswillwith

测试代码

 // 查看标准分析器的分词效果 ???@Test ???public void testTokenStream() throws Exception { ???????// 创建一个标准分析器对象// ???????Analyzer analyzer = new StandardAnalyzer();// ???????Analyzer analyzer = new CJKAnalyzer();// ???????Analyzer analyzer = new SmartChineseAnalyzer(); ???????Analyzer analyzer = new IKAnalyzer(); ???????// 获得tokenStream对象 ???????// 第一个参数：域名，可以随便给一个 ???????// 第二个参数：要分析的文本内容// ???????TokenStream tokenStream = analyzer.tokenStream("test",// ???????????????"The Spring Framework provides a comprehensive programming and configuration model."); ???????TokenStream tokenStream = analyzer.tokenStream("test", ???????????????"高富帅，是2012年之后才有的词汇"); ???????// 添加一个引用，可以获得每个关键词 ???????CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); ???????// 添加一个偏移量的引用，记录了关键词的开始位置以及结束位置 ???????OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); ???????// 将指针调整到列表的头部 ???????tokenStream.reset(); ???????// 遍历关键词列表，通过incrementToken方法判断列表是否结束 ???????while (tokenStream.incrementToken()) { ???????????// 关键词的起始位置 ???????????System.out.println("start->" + offsetAttribute.startOffset()); ???????????// 取关键词 ???????????System.out.println(charTermAttribute); ???????????// 结束位置 ???????????System.out.println("end->" + offsetAttribute.endOffset()); ???????} ???????tokenStream.close(); ???}

lucene_02_IKAnalyre

原文地址：https://www.cnblogs.com/getchen/p/8676459.html

lucene_02_IKAnalyre

知识推荐