Compare commits
57 Commits
Author | SHA1 | Date | |
---|---|---|---|
|
f0059e2c78 | ||
|
fb4defedb7 | ||
|
9ed56a0f41 | ||
|
70d40bc3af | ||
|
ef2dbeb979 | ||
|
872aac8298 | ||
|
052e8f476e | ||
|
6cc121e98d | ||
|
32a9369680 | ||
|
dd7822b6be | ||
|
6ef4798752 | ||
|
56f23a9027 | ||
|
5ab517079b | ||
|
020f83e665 | ||
|
ab50f161e6 | ||
|
47439fa94b | ||
|
c938bf1f2b | ||
|
92cb2a28d6 | ||
|
f173925dc0 | ||
|
f9bc7a12fa | ||
|
df29bdc4df | ||
|
3ec8076730 | ||
|
7149c54de7 | ||
|
fff131a45a | ||
|
0c6560fe61 | ||
|
bf3df0e58c | ||
|
4180900415 | ||
|
eaf58c0f1d | ||
|
aba7aeff19 | ||
|
a234d047cd | ||
|
8d67193774 | ||
|
9b731bb002 | ||
|
8b04070253 | ||
|
4dd4a86b4d | ||
|
59b86a341f | ||
|
a45434162d | ||
|
95ba364090 | ||
|
0c8992fd80 | ||
|
356d9d9ae9 | ||
|
cd43d4b9ee | ||
|
5068a58e0f | ||
|
107b4ced30 | ||
|
e6817ecba5 | ||
|
0916f3f8bd | ||
|
a3705f5753 | ||
|
cc9f0d7bea | ||
|
7a2515a134 | ||
|
2ebf31fed5 | ||
|
fd35153cce | ||
|
48a0234bb6 | ||
|
a00336d239 | ||
|
32d874b1ba | ||
|
7b0d009d32 | ||
|
e3695f9192 | ||
|
33ac83d619 | ||
|
fa1451042b | ||
|
c2e408473a |
38
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
38
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
@ -0,0 +1,38 @@
|
|||||||
|
---
|
||||||
|
name: Bug report
|
||||||
|
about: Create a report to help us improve
|
||||||
|
title: ''
|
||||||
|
labels: ''
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Describe the bug**
|
||||||
|
A clear and concise description of what the bug is.
|
||||||
|
|
||||||
|
**To Reproduce**
|
||||||
|
Steps to reproduce the behavior:
|
||||||
|
1. Go to '...'
|
||||||
|
2. Click on '....'
|
||||||
|
3. Scroll down to '....'
|
||||||
|
4. See error
|
||||||
|
|
||||||
|
**Expected behavior**
|
||||||
|
A clear and concise description of what you expected to happen.
|
||||||
|
|
||||||
|
**Screenshots**
|
||||||
|
If applicable, add screenshots to help explain your problem.
|
||||||
|
|
||||||
|
**Desktop (please complete the following information):**
|
||||||
|
- OS: [e.g. iOS]
|
||||||
|
- Browser [e.g. chrome, safari]
|
||||||
|
- Version [e.g. 22]
|
||||||
|
|
||||||
|
**Smartphone (please complete the following information):**
|
||||||
|
- Device: [e.g. iPhone6]
|
||||||
|
- OS: [e.g. iOS8.1]
|
||||||
|
- Browser [e.g. stock browser, safari]
|
||||||
|
- Version [e.g. 22]
|
||||||
|
|
||||||
|
**Additional context**
|
||||||
|
Add any other context about the problem here.
|
@ -1,3 +1,4 @@
|
|||||||
language: java
|
language: java
|
||||||
|
|
||||||
jdk:
|
jdk:
|
||||||
- oraclejdk8
|
- openjdk8
|
1
CONTRIBUTING.md
Normal file
1
CONTRIBUTING.md
Normal file
@ -0,0 +1 @@
|
|||||||
|
|
@ -25,7 +25,7 @@
|
|||||||
```
|
```
|
||||||
|
|
||||||
4. 配置Solr的`managed-schema`,添加`ik分词器`,示例如下;
|
4. 配置Solr的`managed-schema`,添加`ik分词器`,示例如下;
|
||||||
```console
|
```xml
|
||||||
<!-- ik分词器 -->
|
<!-- ik分词器 -->
|
||||||
<fieldType name="text_ik" class="solr.TextField">
|
<fieldType name="text_ik" class="solr.TextField">
|
||||||
<analyzer type="index">
|
<analyzer type="index">
|
||||||
|
90
README.md
90
README.md
@ -1,19 +1,21 @@
|
|||||||
# ik-analyzer-solr7
|
# ik-analyzer-solr
|
||||||
ik-analyzer for solr7.x
|
ik-analyzer for solr 7.x-8.x
|
||||||
|
|
||||||
<!-- Badges section here. -->
|
<!-- Badges section here. -->
|
||||||
[](https://search.maven.org/search?q=g:com.github.magese%20AND%20a:ik-analyzer&core=gav)
|
[](https://search.maven.org/search?q=g:com.github.magese%20AND%20a:ik-analyzer&core=gav)
|
||||||
[](https://github.com/magese/ik-analyzer-solr7/releases)
|
[](https://github.com/magese/ik-analyzer-solr/releases)
|
||||||
[](./LICENSE)
|
[](./LICENSE)
|
||||||
[](https://travis-ci.org/magese/ik-analyzer-solr7)
|
[](https://travis-ci.org/magese/ik-analyzer-solr)
|
||||||
|
|
||||||
[](https://github.com/magese/ik-analyzer-solr7/fork)
|
[](https://github.com/magese/ik-analyzer-solr/network/members)
|
||||||
[](https://github.com/magese/ik-analyzer-solr7/stargazers)
|
[](https://github.com/magese/ik-analyzer-solr/stargazers)
|
||||||
<!-- /Badges section end. -->
|
<!-- /Badges section end. -->
|
||||||
|
|
||||||
## 简介
|
## 简介
|
||||||
#### 适配最新版本solr7;
|
**适配最新版本solr 7&8;**
|
||||||
#### 扩展IK原有词库:
|
|
||||||
|
**扩展IK原有词库:**
|
||||||
|
|
||||||
| 分词工具 | 词库中词的数量 | 最后更新时间 |
|
| 分词工具 | 词库中词的数量 | 最后更新时间 |
|
||||||
| :------: | :------: | :------: |
|
| :------: | :------: | :------: |
|
||||||
| ik | 27.5万 | 2012年 |
|
| ik | 27.5万 | 2012年 |
|
||||||
@ -21,23 +23,27 @@ ik-analyzer for solr7.x
|
|||||||
| word | 64.2万 | 2014年 |
|
| word | 64.2万 | 2014年 |
|
||||||
| jieba | 58.4万 | 2012年 |
|
| jieba | 58.4万 | 2012年 |
|
||||||
| jcesg | 16.6万 | 2018年 |
|
| jcesg | 16.6万 | 2018年 |
|
||||||
| sougou词库 | 115.2万 | 2018年 |
|
| sougou词库 | 115.2万 | 2020年 |
|
||||||
#### 将以上词库进行整理后约188.5万条词汇;
|
|
||||||
#### 添加动态加载词典表功能,在不需要重启solr服务的情况下加载新增的词典。
|
**将以上词库进行整理后约187.1万条词汇;**
|
||||||
* IKAnalyzer的原作者为林良益<linliangyi2007@gmail.com>,项目网站为<http://code.google.com/p/ik-analyzer>
|
|
||||||
* 该项目动态加载功能根据博主[@星火燎原智勇](http://www.cnblogs.com/liang1101/articles/6395016.html)的博客进行修改,其GITHUB地址为[@liang68](https://github.com/liang68)
|
**添加动态加载词典表功能,在不需要重启solr服务的情况下加载新增的词典。**
|
||||||
|
|
||||||
|
> <small>关闭默认主词典请在`IKAnalyzer.cfg.xml`配置文件中设置`use_main_dict`为`false`。</small>
|
||||||
|
> * IKAnalyzer的原作者为林良益<linliangyi2007@gmail.com>,项目网站为<http://code.google.com/p/ik-analyzer>
|
||||||
|
> * 该项目动态加载功能根据博主[@星火燎原智勇](http://www.cnblogs.com/liang1101/articles/6395016.html)的博客进行修改,其GITHUB地址为[@liang68](https://github.com/liang68)
|
||||||
|
|
||||||
|
|
||||||
## 使用说明
|
## 使用说明
|
||||||
* jar包下载地址:[](https://search.maven.org/remotecontent?filepath=com/github/magese/ik-analyzer/7.7.0/ik-analyzer-7.7.0.jar)
|
* jar包下载地址:[](https://search.maven.org/remotecontent?filepath=com/github/magese/ik-analyzer/8.5.0/ik-analyzer-8.5.0.jar)
|
||||||
* 历史版本:[](https://search.maven.org/search?q=g:com.github.magese%20AND%20a:ik-analyzer&core=gav)
|
* 历史版本:[](https://search.maven.org/search?q=g:com.github.magese%20AND%20a:ik-analyzer&core=gav)
|
||||||
|
|
||||||
```console
|
```xml
|
||||||
<!-- Maven仓库地址 -->
|
<!-- Maven仓库地址 -->
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>com.github.magese</groupId>
|
<groupId>com.github.magese</groupId>
|
||||||
<artifactId>ik-analyzer</artifactId>
|
<artifactId>ik-analyzer</artifactId>
|
||||||
<version>7.7.0</version>
|
<version>8.5.0</version>
|
||||||
</dependency>
|
</dependency>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -57,7 +63,7 @@ ik-analyzer for solr7.x
|
|||||||
```
|
```
|
||||||
|
|
||||||
3. 配置Solr的`managed-schema`,添加`ik分词器`,示例如下;
|
3. 配置Solr的`managed-schema`,添加`ik分词器`,示例如下;
|
||||||
```console
|
```xml
|
||||||
<!-- ik分词器 -->
|
<!-- ik分词器 -->
|
||||||
<fieldType name="text_ik" class="solr.TextField">
|
<fieldType name="text_ik" class="solr.TextField">
|
||||||
<analyzer type="index">
|
<analyzer type="index">
|
||||||
@ -75,8 +81,16 @@ ik-analyzer for solr7.x
|
|||||||
|
|
||||||

|

|
||||||
|
|
||||||
5. `ik.conf`文件说明:
|
5. `IKAnalyzer.cfg.xml`配置文件说明:
|
||||||
```console
|
|
||||||
|
| 名称 | 类型 | 描述 | 默认 |
|
||||||
|
| ------ | ------ | ------ | ------ |
|
||||||
|
| use_main_dict | boolean | 是否使用默认主词典 | true |
|
||||||
|
| ext_dict | String | 扩展词典文件名称,多个用分号隔开 | ext.dic; |
|
||||||
|
| ext_stopwords | String | 停用词典文件名称,多个用分号隔开 | stopword.dic; |
|
||||||
|
|
||||||
|
6. `ik.conf`文件说明:
|
||||||
|
```properties
|
||||||
files=dynamicdic.txt
|
files=dynamicdic.txt
|
||||||
lastupdate=0
|
lastupdate=0
|
||||||
```
|
```
|
||||||
@ -84,32 +98,48 @@ ik-analyzer for solr7.x
|
|||||||
1. `files`为动态词典列表,可以设置多个词典表,用逗号进行分隔,默认动态词典表为`dynamicdic.txt`;
|
1. `files`为动态词典列表,可以设置多个词典表,用逗号进行分隔,默认动态词典表为`dynamicdic.txt`;
|
||||||
2. `lastupdate`默认值为`0`,每次对动态词典表修改后请+1,不然不会将词典表中新的词语添加到内存中。<s>`lastupdate`采用的是`int`类型,不支持时间戳,如果使用时间戳的朋友可以把源码中的`int`改成`long`即可;</s> `2018-08-23` 已将源码中`lastUpdate`改为`long`类型,现可以用时间戳了。
|
2. `lastupdate`默认值为`0`,每次对动态词典表修改后请+1,不然不会将词典表中新的词语添加到内存中。<s>`lastupdate`采用的是`int`类型,不支持时间戳,如果使用时间戳的朋友可以把源码中的`int`改成`long`即可;</s> `2018-08-23` 已将源码中`lastUpdate`改为`long`类型,现可以用时间戳了。
|
||||||
|
|
||||||
6. `dynamicdic.txt` 为动态词典
|
7. `dynamicdic.txt` 为动态词典
|
||||||
|
|
||||||
在此文件配置的词语不需重启服务即可加载进内存中。
|
在此文件配置的词语不需重启服务即可加载进内存中。
|
||||||
以`#`开头的词语视为注释,将不会加载到内存中。
|
以`#`开头的词语视为注释,将不会加载到内存中。
|
||||||
|
|
||||||
|
|
||||||
## 更新说明
|
## 更新说明
|
||||||
- `2019-02-15:` 升级lucene版本为`7.7.0`
|
- **2021-12-23:** 升级lucene版本为`8.5.0`
|
||||||
- `2018-12-26:`
|
- **2021-03-22:** 升级lucene版本为`8.4.0`
|
||||||
|
- **2020-12-30:**
|
||||||
|
- 升级lucene版本为`8.3.1`
|
||||||
|
- 更新词库
|
||||||
|
- **2019-11-12:**
|
||||||
|
- 升级lucene版本为`8.3.0`
|
||||||
|
- `IKAnalyzer.cfg.xml`增加配置项`use_main_dict`,用于配置是否启用默认主词典
|
||||||
|
- **2019-09-27:** 升级lucene版本为`8.2.0`
|
||||||
|
- **2019-07-11:** 升级lucene版本为`8.1.1`
|
||||||
|
- **2019-05-27:**
|
||||||
|
- 升级lucene版本为`8.1.0`
|
||||||
|
- 优化原词典部分重复词语
|
||||||
|
- 更新搜狗2019最新流行词汇词典,约20k词汇量
|
||||||
|
- **2019-05-15:** 升级lucene版本为`8.0.0`,并支持Solr8使用
|
||||||
|
- **2019-03-01:** 升级lucene版本为`7.7.1`
|
||||||
|
- **2019-02-15:** 升级lucene版本为`7.7.0`
|
||||||
|
- **2018-12-26:**
|
||||||
- 升级lucene版本为`7.6.0`
|
- 升级lucene版本为`7.6.0`
|
||||||
- 兼容solr-cloud,动态词典配置文件及动态词典可交由`zookeeper`进行管理
|
- 兼容solr-cloud,动态词典配置文件及动态词典可交由`zookeeper`进行管理
|
||||||
- 动态词典增加注释功能,以`#`开头的行将视为注释
|
- 动态词典增加注释功能,以`#`开头的行将视为注释
|
||||||
- `2018-12-04:` 整理更新词库列表`magese.dic`
|
- **2018-12-04:** 整理更新词库列表`magese.dic`
|
||||||
- `2018-10-10:` 升级lucene版本为`7.5.0`
|
- **2018-10-10:** 升级lucene版本为`7.5.0`
|
||||||
- `2018-09-03:` 优化注释与输出信息,取消部分中文输出避免不同字符集乱码,现会打印被调用inform方法的hashcode
|
- **2018-09-03:** 优化注释与输出信息,取消部分中文输出避免不同字符集乱码,现会打印被调用inform方法的hashcode
|
||||||
- `2018-08-23: `
|
- **2018-08-23:**
|
||||||
- 完善了动态更新词库代码注释;
|
- 完善了动态更新词库代码注释;
|
||||||
- 将ik.conf配置文件中的lastUpdate属性改为long类型,现已支持时间戳形式
|
- 将ik.conf配置文件中的lastUpdate属性改为long类型,现已支持时间戳形式
|
||||||
- `2018-08-13:` 更新maven仓库地址
|
- **2018-08-13:** 更新maven仓库地址
|
||||||
- `2018-08-01:` 移除默认的扩展词与停用词
|
- **2018-08-01:** 移除默认的扩展词与停用词
|
||||||
- `2018-07-23:` 升级lucene版本为`7.4.0`
|
- **2018-07-23:** 升级lucene版本为`7.4.0`
|
||||||
|
|
||||||
|
|
||||||
## 感谢 Thanks
|
## 感谢 Thanks
|
||||||
|
|
||||||
[](https://www.jetbrains.com/?from=ik-analyzer-solr7)
|
[](https://www.jetbrains.com/?from=ik-analyzer-solr)
|
||||||
|
|
||||||
[](https://www.java.com)
|
[](https://www.java.com)
|
||||||
|
|
||||||
|
22
pom.xml
22
pom.xml
@ -4,29 +4,22 @@
|
|||||||
|
|
||||||
<groupId>com.github.magese</groupId>
|
<groupId>com.github.magese</groupId>
|
||||||
<artifactId>ik-analyzer</artifactId>
|
<artifactId>ik-analyzer</artifactId>
|
||||||
<version>7.7.0</version>
|
<version>8.5.0</version>
|
||||||
<packaging>jar</packaging>
|
<packaging>jar</packaging>
|
||||||
|
|
||||||
<name>ik-analyzer-solr7</name>
|
<name>ik-analyzer-solr</name>
|
||||||
<url>http://code.google.com/p/ik-analyzer/</url>
|
<url>http://code.google.com/p/ik-analyzer/</url>
|
||||||
<description>IK-Analyzer for solr7.7</description>
|
<description>IK-Analyzer for solr 7-8</description>
|
||||||
|
|
||||||
<properties>
|
<properties>
|
||||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||||
<lucene.version>7.7.0</lucene.version>
|
<lucene.version>8.5.0</lucene.version>
|
||||||
<javac.src.version>1.8</javac.src.version>
|
<javac.src.version>1.8</javac.src.version>
|
||||||
<javac.target.version>1.8</javac.target.version>
|
<javac.target.version>1.8</javac.target.version>
|
||||||
<maven.compiler.plugin.version>3.3</maven.compiler.plugin.version>
|
<maven.compiler.plugin.version>3.3</maven.compiler.plugin.version>
|
||||||
</properties>
|
</properties>
|
||||||
|
|
||||||
<dependencies>
|
<dependencies>
|
||||||
<dependency>
|
|
||||||
<groupId>junit</groupId>
|
|
||||||
<artifactId>junit</artifactId>
|
|
||||||
<version>4.11</version>
|
|
||||||
<scope>test</scope>
|
|
||||||
</dependency>
|
|
||||||
|
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.apache.lucene</groupId>
|
<groupId>org.apache.lucene</groupId>
|
||||||
<artifactId>lucene-core</artifactId>
|
<artifactId>lucene-core</artifactId>
|
||||||
@ -55,9 +48,9 @@
|
|||||||
</licenses>
|
</licenses>
|
||||||
<scm>
|
<scm>
|
||||||
<tag>master</tag>
|
<tag>master</tag>
|
||||||
<url>https://github.com/magese/ik-analyzer-solr7</url>
|
<url>https://github.com/magese/ik-analyzer-solr</url>
|
||||||
<connection>scm:git:git@github.com:magese/ik-analyzer-solr7.git</connection>
|
<connection>scm:git:git@github.com:magese/ik-analyzer-solr.git</connection>
|
||||||
<developerConnection>scm:git:git@github.com:magese/ik-analyzer-solr7.git</developerConnection>
|
<developerConnection>scm:git:git@github.com:magese/ik-analyzer-solr.git</developerConnection>
|
||||||
</scm>
|
</scm>
|
||||||
<developers>
|
<developers>
|
||||||
<developer>
|
<developer>
|
||||||
@ -152,4 +145,3 @@
|
|||||||
</profile>
|
</profile>
|
||||||
</profiles>
|
</profiles>
|
||||||
</project>
|
</project>
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.cfg;
|
package org.wltea.analyzer.cfg;
|
||||||
@ -50,6 +50,19 @@ public interface Configuration {
|
|||||||
*/
|
*/
|
||||||
void setUseSmart(boolean useSmart);
|
void setUseSmart(boolean useSmart);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 获取是否使用主词典
|
||||||
|
*
|
||||||
|
* @return = true 默认加载主词典, = false 不加载主词典
|
||||||
|
*/
|
||||||
|
boolean useMainDict();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 设置是否使用主词典
|
||||||
|
*
|
||||||
|
* @param useMainDic = true 默认加载主词典, = false 不加载主词典
|
||||||
|
*/
|
||||||
|
void setUseMainDict(boolean useMainDic);
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取主词典路径
|
* 获取主词典路径
|
||||||
@ -63,7 +76,7 @@ public interface Configuration {
|
|||||||
*
|
*
|
||||||
* @return String 量词词典路径
|
* @return String 量词词典路径
|
||||||
*/
|
*/
|
||||||
String getQuantifierDicionary();
|
String getQuantifierDictionary();
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取扩展字典配置路径
|
* 获取扩展字典配置路径
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.cfg;
|
package org.wltea.analyzer.cfg;
|
||||||
@ -41,24 +41,28 @@ public class DefaultConfig implements Configuration {
|
|||||||
/*
|
/*
|
||||||
* 分词器默认字典路径
|
* 分词器默认字典路径
|
||||||
*/
|
*/
|
||||||
private static final String PATH_DIC_MAIN = "dict/magese.dic";
|
private static final String PATH_DIC_MAIN = "dict/main_dic_2020.dic";
|
||||||
private static final String PATH_DIC_QUANTIFIER = "dict/quantifier.dic";
|
private static final String PATH_DIC_QUANTIFIER = "dict/quantifier.dic";
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 分词器配置文件路径
|
* 分词器配置文件路径
|
||||||
*/
|
*/
|
||||||
private static final String FILE_NAME = "IKAnalyzer.cfg.xml";
|
private static final String FILE_NAME = "IKAnalyzer.cfg.xml";
|
||||||
|
// 配置属性——是否使用主词典
|
||||||
|
private static final String USE_MAIN = "use_main_dict";
|
||||||
// 配置属性——扩展字典
|
// 配置属性——扩展字典
|
||||||
private static final String EXT_DICT = "ext_dict";
|
private static final String EXT_DICT = "ext_dict";
|
||||||
// 配置属性——扩展停止词典
|
// 配置属性——扩展停止词典
|
||||||
private static final String EXT_STOP = "ext_stopwords";
|
private static final String EXT_STOP = "ext_stopwords";
|
||||||
|
|
||||||
private Properties props;
|
private final Properties props;
|
||||||
/*
|
|
||||||
* 是否使用smart方式分词
|
// 是否使用smart方式分词
|
||||||
*/
|
|
||||||
private boolean useSmart;
|
private boolean useSmart;
|
||||||
|
|
||||||
|
// 是否加载主词典
|
||||||
|
private boolean useMainDict = true;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 返回单例
|
* 返回单例
|
||||||
*
|
*
|
||||||
@ -100,10 +104,33 @@ public class DefaultConfig implements Configuration {
|
|||||||
*
|
*
|
||||||
* @param useSmart =true ,分词器使用智能切分策略, =false则使用细粒度切分
|
* @param useSmart =true ,分词器使用智能切分策略, =false则使用细粒度切分
|
||||||
*/
|
*/
|
||||||
|
@Override
|
||||||
public void setUseSmart(boolean useSmart) {
|
public void setUseSmart(boolean useSmart) {
|
||||||
this.useSmart = useSmart;
|
this.useSmart = useSmart;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 获取是否使用主词典
|
||||||
|
*
|
||||||
|
* @return = true 默认加载主词典, = false 不加载主词典
|
||||||
|
*/
|
||||||
|
public boolean useMainDict() {
|
||||||
|
String useMainDictCfg = props.getProperty(USE_MAIN);
|
||||||
|
if (useMainDictCfg != null && useMainDictCfg.trim().length() > 0)
|
||||||
|
setUseMainDict(Boolean.parseBoolean(useMainDictCfg));
|
||||||
|
return useMainDict;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 设置是否使用主词典
|
||||||
|
*
|
||||||
|
* @param useMainDict = true 默认加载主词典, = false 不加载主词典
|
||||||
|
*/
|
||||||
|
@Override
|
||||||
|
public void setUseMainDict(boolean useMainDict) {
|
||||||
|
this.useMainDict = useMainDict;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取主词典路径
|
* 获取主词典路径
|
||||||
*
|
*
|
||||||
@ -118,7 +145,7 @@ public class DefaultConfig implements Configuration {
|
|||||||
*
|
*
|
||||||
* @return String 量词词典路径
|
* @return String 量词词典路径
|
||||||
*/
|
*/
|
||||||
public String getQuantifierDicionary() {
|
public String getQuantifierDictionary() {
|
||||||
return PATH_DIC_QUANTIFIER;
|
return PATH_DIC_QUANTIFIER;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -142,7 +169,6 @@ public class DefaultConfig implements Configuration {
|
|||||||
return extDictFiles;
|
return extDictFiles;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取扩展停止词典配置路径
|
* 获取扩展停止词典配置路径
|
||||||
*
|
*
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,62 +21,58 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
|
|
||||||
import java.io.IOException;
|
|
||||||
import java.io.Reader;
|
|
||||||
import java.util.HashMap;
|
|
||||||
import java.util.HashSet;
|
|
||||||
import java.util.LinkedList;
|
|
||||||
import java.util.Map;
|
|
||||||
import java.util.Set;
|
|
||||||
|
|
||||||
import org.wltea.analyzer.cfg.Configuration;
|
import org.wltea.analyzer.cfg.Configuration;
|
||||||
import org.wltea.analyzer.dic.Dictionary;
|
import org.wltea.analyzer.dic.Dictionary;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.io.Reader;
|
||||||
|
import java.util.*;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 分词器上下文状态
|
* 分词器上下文状态
|
||||||
*/
|
*/
|
||||||
class AnalyzeContext {
|
class AnalyzeContext {
|
||||||
|
|
||||||
//默认缓冲区大小
|
// 默认缓冲区大小
|
||||||
private static final int BUFF_SIZE = 4096;
|
private static final int BUFF_SIZE = 4096;
|
||||||
//缓冲区耗尽的临界值
|
// 缓冲区耗尽的临界值
|
||||||
private static final int BUFF_EXHAUST_CRITICAL = 100;
|
private static final int BUFF_EXHAUST_CRITICAL = 100;
|
||||||
|
|
||||||
|
|
||||||
//字符窜读取缓冲
|
// 字符窜读取缓冲
|
||||||
private char[] segmentBuff;
|
private char[] segmentBuff;
|
||||||
//字符类型数组
|
// 字符类型数组
|
||||||
private int[] charTypes;
|
private int[] charTypes;
|
||||||
|
|
||||||
|
|
||||||
//记录Reader内已分析的字串总长度
|
// 记录Reader内已分析的字串总长度
|
||||||
//在分多段分析词元时,该变量累计当前的segmentBuff相对于reader起始位置的位移
|
// 在分多段分析词元时,该变量累计当前的segmentBuff相对于reader起始位置的位移
|
||||||
private int buffOffset;
|
private int buffOffset;
|
||||||
//当前缓冲区位置指针
|
// 当前缓冲区位置指针
|
||||||
private int cursor;
|
private int cursor;
|
||||||
//最近一次读入的,可处理的字串长度
|
// 最近一次读入的,可处理的字串长度
|
||||||
private int available;
|
private int available;
|
||||||
|
|
||||||
|
|
||||||
//子分词器锁
|
// 子分词器锁
|
||||||
//该集合非空,说明有子分词器在占用segmentBuff
|
// 该集合非空,说明有子分词器在占用segmentBuff
|
||||||
private Set<String> buffLocker;
|
private final Set<String> buffLocker;
|
||||||
|
|
||||||
//原始分词结果集合,未经歧义处理
|
// 原始分词结果集合,未经歧义处理
|
||||||
private QuickSortSet orgLexemes;
|
private QuickSortSet orgLexemes;
|
||||||
//LexemePath位置索引表
|
// LexemePath位置索引表
|
||||||
private Map<Integer, LexemePath> pathMap;
|
private final Map<Integer, LexemePath> pathMap;
|
||||||
//最终分词结果集
|
// 最终分词结果集
|
||||||
private LinkedList<Lexeme> results;
|
private final LinkedList<Lexeme> results;
|
||||||
|
|
||||||
//分词器配置项
|
// 分词器配置项
|
||||||
private Configuration cfg;
|
private final Configuration cfg;
|
||||||
|
|
||||||
AnalyzeContext(Configuration cfg) {
|
AnalyzeContext(Configuration cfg) {
|
||||||
this.cfg = cfg;
|
this.cfg = cfg;
|
||||||
@ -117,21 +113,21 @@ class AnalyzeContext {
|
|||||||
int fillBuffer(Reader reader) throws IOException {
|
int fillBuffer(Reader reader) throws IOException {
|
||||||
int readCount = 0;
|
int readCount = 0;
|
||||||
if (this.buffOffset == 0) {
|
if (this.buffOffset == 0) {
|
||||||
//首次读取reader
|
// 首次读取reader
|
||||||
readCount = reader.read(segmentBuff);
|
readCount = reader.read(segmentBuff);
|
||||||
} else {
|
} else {
|
||||||
int offset = this.available - this.cursor;
|
int offset = this.available - this.cursor;
|
||||||
if (offset > 0) {
|
if (offset > 0) {
|
||||||
//最近一次读取的>最近一次处理的,将未处理的字串拷贝到segmentBuff头部
|
// 最近一次读取的>最近一次处理的,将未处理的字串拷贝到segmentBuff头部
|
||||||
System.arraycopy(this.segmentBuff, this.cursor, this.segmentBuff, 0, offset);
|
System.arraycopy(this.segmentBuff, this.cursor, this.segmentBuff, 0, offset);
|
||||||
readCount = offset;
|
readCount = offset;
|
||||||
}
|
}
|
||||||
//继续读取reader ,以onceReadIn - onceAnalyzed为起始位置,继续填充segmentBuff剩余的部分
|
// 继续读取reader ,以onceReadIn - onceAnalyzed为起始位置,继续填充segmentBuff剩余的部分
|
||||||
readCount += reader.read(this.segmentBuff, offset, BUFF_SIZE - offset);
|
readCount += reader.read(this.segmentBuff, offset, BUFF_SIZE - offset);
|
||||||
}
|
}
|
||||||
//记录最后一次从Reader中读入的可用字符长度
|
// 记录最后一次从Reader中读入的可用字符长度
|
||||||
this.available = readCount;
|
this.available = readCount;
|
||||||
//重置当前指针
|
// 重置当前指针
|
||||||
this.cursor = 0;
|
this.cursor = 0;
|
||||||
return readCount;
|
return readCount;
|
||||||
}
|
}
|
||||||
@ -254,36 +250,36 @@ class AnalyzeContext {
|
|||||||
*/
|
*/
|
||||||
void outputToResult() {
|
void outputToResult() {
|
||||||
int index = 0;
|
int index = 0;
|
||||||
for (; index <= this.cursor; ) {
|
while (index <= this.cursor) {
|
||||||
//跳过非CJK字符
|
// 跳过非CJK字符
|
||||||
if (CharacterUtil.CHAR_USELESS == this.charTypes[index]) {
|
if (CharacterUtil.CHAR_USELESS == this.charTypes[index]) {
|
||||||
index++;
|
index++;
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
//从pathMap找出对应index位置的LexemePath
|
// 从pathMap找出对应index位置的LexemePath
|
||||||
LexemePath path = this.pathMap.get(index);
|
LexemePath path = this.pathMap.get(index);
|
||||||
if (path != null) {
|
if (path != null) {
|
||||||
//输出LexemePath中的lexeme到results集合
|
// 输出LexemePath中的lexeme到results集合
|
||||||
Lexeme l = path.pollFirst();
|
Lexeme l = path.pollFirst();
|
||||||
while (l != null) {
|
while (l != null) {
|
||||||
this.results.add(l);
|
this.results.add(l);
|
||||||
//将index移至lexeme后
|
// 将index移至lexeme后
|
||||||
index = l.getBegin() + l.getLength();
|
index = l.getBegin() + l.getLength();
|
||||||
l = path.pollFirst();
|
l = path.pollFirst();
|
||||||
if (l != null) {
|
if (l != null) {
|
||||||
//输出path内部,词元间遗漏的单字
|
// 输出path内部,词元间遗漏的单字
|
||||||
for (; index < l.getBegin(); index++) {
|
for (; index < l.getBegin(); index++) {
|
||||||
this.outputSingleCJK(index);
|
this.outputSingleCJK(index);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
} else {//pathMap中找不到index对应的LexemePath
|
} else {// pathMap中找不到index对应的LexemePath
|
||||||
//单字输出
|
// 单字输出
|
||||||
this.outputSingleCJK(index);
|
this.outputSingleCJK(index);
|
||||||
index++;
|
index++;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
//清空当前的Map
|
// 清空当前的Map
|
||||||
this.pathMap.clear();
|
this.pathMap.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -308,16 +304,16 @@ class AnalyzeContext {
|
|||||||
* 同时处理合并
|
* 同时处理合并
|
||||||
*/
|
*/
|
||||||
Lexeme getNextLexeme() {
|
Lexeme getNextLexeme() {
|
||||||
//从结果集取出,并移除第一个Lexme
|
// 从结果集取出,并移除第一个Lexme
|
||||||
Lexeme result = this.results.pollFirst();
|
Lexeme result = this.results.pollFirst();
|
||||||
while (result != null) {
|
while (result != null) {
|
||||||
//数量词合并
|
// 数量词合并
|
||||||
this.compound(result);
|
this.compound(result);
|
||||||
if (Dictionary.getSingleton().isStopWord(this.segmentBuff, result.getBegin(), result.getLength())) {
|
if (Dictionary.getSingleton().isStopWord(this.segmentBuff, result.getBegin(), result.getLength())) {
|
||||||
//是停止词继续取列表的下一个
|
// 是停止词继续取列表的下一个
|
||||||
result = this.results.pollFirst();
|
result = this.results.pollFirst();
|
||||||
} else {
|
} else {
|
||||||
//不是停止词, 生成lexeme的词元文本,输出
|
// 不是停止词, 生成lexeme的词元文本,输出
|
||||||
result.setLexemeText(String.valueOf(segmentBuff, result.getBegin(), result.getLength()));
|
result.setLexemeText(String.valueOf(segmentBuff, result.getBegin(), result.getLength()));
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
@ -347,35 +343,37 @@ class AnalyzeContext {
|
|||||||
if (!this.cfg.useSmart()) {
|
if (!this.cfg.useSmart()) {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
//数量词合并处理
|
// 数量词合并处理
|
||||||
if (!this.results.isEmpty()) {
|
if (!this.results.isEmpty()) {
|
||||||
|
|
||||||
if (Lexeme.TYPE_ARABIC == result.getLexemeType()) {
|
if (Lexeme.TYPE_ARABIC == result.getLexemeType()) {
|
||||||
Lexeme nextLexeme = this.results.peekFirst();
|
Lexeme nextLexeme = this.results.peekFirst();
|
||||||
boolean appendOk = false;
|
boolean appendOk = false;
|
||||||
if (Lexeme.TYPE_CNUM == nextLexeme.getLexemeType()) {
|
if (nextLexeme != null) {
|
||||||
//合并英文数词+中文数词
|
if (Lexeme.TYPE_CNUM == nextLexeme.getLexemeType()) {
|
||||||
appendOk = result.append(nextLexeme, Lexeme.TYPE_CNUM);
|
// 合并英文数词+中文数词
|
||||||
} else if (Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()) {
|
appendOk = result.append(nextLexeme, Lexeme.TYPE_CNUM);
|
||||||
//合并英文数词+中文量词
|
} else if (Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()) {
|
||||||
appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
|
// 合并英文数词+中文量词
|
||||||
|
appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
if (appendOk) {
|
if (appendOk) {
|
||||||
//弹出
|
// 弹出
|
||||||
this.results.pollFirst();
|
this.results.pollFirst();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//可能存在第二轮合并
|
// 可能存在第二轮合并
|
||||||
if (Lexeme.TYPE_CNUM == result.getLexemeType() && !this.results.isEmpty()) {
|
if (Lexeme.TYPE_CNUM == result.getLexemeType() && !this.results.isEmpty()) {
|
||||||
Lexeme nextLexeme = this.results.peekFirst();
|
Lexeme nextLexeme = this.results.peekFirst();
|
||||||
boolean appendOk = false;
|
boolean appendOk = false;
|
||||||
if (Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()) {
|
if (Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()) {
|
||||||
//合并中文数词+中文量词
|
// 合并中文数词+中文量词
|
||||||
appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
|
appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
|
||||||
}
|
}
|
||||||
if (appendOk) {
|
if (appendOk) {
|
||||||
//弹出
|
// 弹出
|
||||||
this.results.pollFirst();
|
this.results.pollFirst();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,108 +21,107 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
|
|
||||||
import java.util.LinkedList;
|
|
||||||
import java.util.List;
|
|
||||||
|
|
||||||
import org.wltea.analyzer.dic.Dictionary;
|
import org.wltea.analyzer.dic.Dictionary;
|
||||||
import org.wltea.analyzer.dic.Hit;
|
import org.wltea.analyzer.dic.Hit;
|
||||||
|
|
||||||
|
import java.util.LinkedList;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 中文-日韩文子分词器
|
* 中文-日韩文子分词器
|
||||||
*/
|
*/
|
||||||
class CJKSegmenter implements ISegmenter {
|
class CJKSegmenter implements ISegmenter {
|
||||||
|
|
||||||
//子分词器标签
|
// 子分词器标签
|
||||||
private static final String SEGMENTER_NAME = "CJK_SEGMENTER";
|
private static final String SEGMENTER_NAME = "CJK_SEGMENTER";
|
||||||
//待处理的分词hit队列
|
// 待处理的分词hit队列
|
||||||
private List<Hit> tmpHits;
|
private final List<Hit> tmpHits;
|
||||||
|
|
||||||
|
|
||||||
CJKSegmenter(){
|
CJKSegmenter() {
|
||||||
this.tmpHits = new LinkedList<>();
|
this.tmpHits = new LinkedList<>();
|
||||||
}
|
}
|
||||||
|
|
||||||
/* (non-Javadoc)
|
/* (non-Javadoc)
|
||||||
* @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
|
* @see org.wltea.analyzer.core.ISegmenter#analyze(org.wltea.analyzer.core.AnalyzeContext)
|
||||||
*/
|
*/
|
||||||
public void analyze(AnalyzeContext context) {
|
public void analyze(AnalyzeContext context) {
|
||||||
if(CharacterUtil.CHAR_USELESS != context.getCurrentCharType()){
|
if (CharacterUtil.CHAR_USELESS != context.getCurrentCharType()) {
|
||||||
|
|
||||||
//优先处理tmpHits中的hit
|
// 优先处理tmpHits中的hit
|
||||||
if(!this.tmpHits.isEmpty()){
|
if (!this.tmpHits.isEmpty()) {
|
||||||
//处理词段队列
|
// 处理词段队列
|
||||||
Hit[] tmpArray = this.tmpHits.toArray(new Hit[0]);
|
Hit[] tmpArray = this.tmpHits.toArray(new Hit[0]);
|
||||||
for(Hit hit : tmpArray){
|
for (Hit hit : tmpArray) {
|
||||||
hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);
|
hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor(), hit);
|
||||||
if(hit.isMatch()){
|
if (hit.isMatch()) {
|
||||||
//输出当前的词
|
// 输出当前的词
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_CNWORD);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), hit.getBegin(), context.getCursor() - hit.getBegin() + 1, Lexeme.TYPE_CNWORD);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
|
|
||||||
if(!hit.isPrefix()){//不是词前缀,hit不需要继续匹配,移除
|
if (!hit.isPrefix()) {// 不是词前缀,hit不需要继续匹配,移除
|
||||||
this.tmpHits.remove(hit);
|
this.tmpHits.remove(hit);
|
||||||
}
|
}
|
||||||
|
|
||||||
}else if(hit.isUnmatch()){
|
} else if (hit.isUnmatch()) {
|
||||||
//hit不是词,移除
|
// hit不是词,移除
|
||||||
this.tmpHits.remove(hit);
|
this.tmpHits.remove(hit);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//*********************************
|
// *********************************
|
||||||
//再对当前指针位置的字符进行单字匹配
|
// 再对当前指针位置的字符进行单字匹配
|
||||||
Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);
|
Hit singleCharHit = Dictionary.getSingleton().matchInMainDict(context.getSegmentBuff(), context.getCursor(), 1);
|
||||||
if(singleCharHit.isMatch()){//首字成词
|
|
||||||
//输出当前的词
|
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_CNWORD);
|
|
||||||
context.addLexeme(newLexeme);
|
|
||||||
|
|
||||||
//同时也是词前缀
|
// 首字为词前缀
|
||||||
if(singleCharHit.isPrefix()){
|
if (singleCharHit.isMatch()) {
|
||||||
//前缀匹配则放入hit列表
|
// 输出当前的词
|
||||||
this.tmpHits.add(singleCharHit);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), context.getCursor(), 1, Lexeme.TYPE_CNWORD);
|
||||||
}
|
context.addLexeme(newLexeme);
|
||||||
}else if(singleCharHit.isPrefix()){//首字为词前缀
|
}
|
||||||
//前缀匹配则放入hit列表
|
|
||||||
this.tmpHits.add(singleCharHit);
|
// 前缀匹配则放入hit列表
|
||||||
}
|
if (singleCharHit.isPrefix()) {
|
||||||
|
// 前缀匹配则放入hit列表
|
||||||
|
this.tmpHits.add(singleCharHit);
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
}else{
|
} else {
|
||||||
//遇到CHAR_USELESS字符
|
// 遇到CHAR_USELESS字符
|
||||||
//清空队列
|
// 清空队列
|
||||||
this.tmpHits.clear();
|
this.tmpHits.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断缓冲区是否已经读完
|
// 判断缓冲区是否已经读完
|
||||||
if(context.isBufferConsumed()){
|
if (context.isBufferConsumed()) {
|
||||||
//清空队列
|
// 清空队列
|
||||||
this.tmpHits.clear();
|
this.tmpHits.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断是否锁定缓冲区
|
// 判断是否锁定缓冲区
|
||||||
if(this.tmpHits.size() == 0){
|
if (this.tmpHits.size() == 0) {
|
||||||
context.unlockBuffer(SEGMENTER_NAME);
|
context.unlockBuffer(SEGMENTER_NAME);
|
||||||
|
|
||||||
}else{
|
} else {
|
||||||
context.lockBuffer(SEGMENTER_NAME);
|
context.lockBuffer(SEGMENTER_NAME);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* (non-Javadoc)
|
/* (non-Javadoc)
|
||||||
* @see org.wltea.analyzer.core.ISegmenter#reset()
|
* @see org.wltea.analyzer.core.ISegmenter#reset()
|
||||||
*/
|
*/
|
||||||
public void reset() {
|
public void reset() {
|
||||||
//清空队列
|
// 清空队列
|
||||||
this.tmpHits.clear();
|
this.tmpHits.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
@ -36,206 +36,205 @@ import java.util.List;
|
|||||||
import java.util.Set;
|
import java.util.Set;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
*
|
|
||||||
* 中文数量词子分词器
|
* 中文数量词子分词器
|
||||||
*/
|
*/
|
||||||
class CN_QuantifierSegmenter implements ISegmenter{
|
class CN_QuantifierSegmenter implements ISegmenter {
|
||||||
|
|
||||||
//子分词器标签
|
// 子分词器标签
|
||||||
private static final String SEGMENTER_NAME = "QUAN_SEGMENTER";
|
private static final String SEGMENTER_NAME = "QUAN_SEGMENTER";
|
||||||
|
|
||||||
private static Set<Character> ChnNumberChars = new HashSet<>();
|
private static final Set<Character> CHN_NUMBER_CHARS = new HashSet<>();
|
||||||
static{
|
|
||||||
//中文数词
|
|
||||||
//Cnum
|
|
||||||
String chn_Num = "一二两三四五六七八九十零壹贰叁肆伍陆柒捌玖拾百千万亿拾佰仟萬億兆卅廿";
|
|
||||||
char[] ca = chn_Num.toCharArray();
|
|
||||||
for(char nChar : ca){
|
|
||||||
ChnNumberChars.add(nChar);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
static {
|
||||||
* 词元的开始位置,
|
// 中文数词
|
||||||
* 同时作为子分词器状态标识
|
String chn_Num = "一二两三四五六七八九十零壹贰叁肆伍陆柒捌玖拾百千万亿拾佰仟萬億兆卅廿";
|
||||||
* 当start > -1 时,标识当前的分词器正在处理字符
|
char[] ca = chn_Num.toCharArray();
|
||||||
*/
|
for (char nChar : ca) {
|
||||||
private int nStart;
|
CHN_NUMBER_CHARS.add(nChar);
|
||||||
/*
|
}
|
||||||
* 记录词元结束位置
|
}
|
||||||
* end记录的是在词元中最后一个出现的合理的数词结束
|
|
||||||
*/
|
|
||||||
private int nEnd;
|
|
||||||
|
|
||||||
//待处理的量词hit队列
|
/*
|
||||||
private List<Hit> countHits;
|
* 词元的开始位置,
|
||||||
|
* 同时作为子分词器状态标识
|
||||||
|
* 当start > -1 时,标识当前的分词器正在处理字符
|
||||||
|
*/
|
||||||
|
private int nStart;
|
||||||
|
/*
|
||||||
|
* 记录词元结束位置
|
||||||
|
* end记录的是在词元中最后一个出现的合理的数词结束
|
||||||
|
*/
|
||||||
|
private int nEnd;
|
||||||
|
|
||||||
|
// 待处理的量词hit队列
|
||||||
|
private final List<Hit> countHits;
|
||||||
|
|
||||||
|
|
||||||
CN_QuantifierSegmenter(){
|
CN_QuantifierSegmenter() {
|
||||||
nStart = -1;
|
nStart = -1;
|
||||||
nEnd = -1;
|
nEnd = -1;
|
||||||
this.countHits = new LinkedList<>();
|
this.countHits = new LinkedList<>();
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 分词
|
* 分词
|
||||||
*/
|
*/
|
||||||
public void analyze(AnalyzeContext context) {
|
public void analyze(AnalyzeContext context) {
|
||||||
//处理中文数词
|
// 处理中文数词
|
||||||
this.processCNumber(context);
|
this.processCNumber(context);
|
||||||
//处理中文量词
|
// 处理中文量词
|
||||||
this.processCount(context);
|
this.processCount(context);
|
||||||
|
|
||||||
//判断是否锁定缓冲区
|
// 判断是否锁定缓冲区
|
||||||
if(this.nStart == -1 && this.nEnd == -1 && countHits.isEmpty()){
|
if (this.nStart == -1 && this.nEnd == -1 && countHits.isEmpty()) {
|
||||||
//对缓冲区解锁
|
// 对缓冲区解锁
|
||||||
context.unlockBuffer(SEGMENTER_NAME);
|
context.unlockBuffer(SEGMENTER_NAME);
|
||||||
}else{
|
} else {
|
||||||
context.lockBuffer(SEGMENTER_NAME);
|
context.lockBuffer(SEGMENTER_NAME);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 重置子分词器状态
|
* 重置子分词器状态
|
||||||
*/
|
*/
|
||||||
public void reset() {
|
public void reset() {
|
||||||
nStart = -1;
|
nStart = -1;
|
||||||
nEnd = -1;
|
nEnd = -1;
|
||||||
countHits.clear();
|
countHits.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 处理数词
|
* 处理数词
|
||||||
*/
|
*/
|
||||||
private void processCNumber(AnalyzeContext context){
|
private void processCNumber(AnalyzeContext context) {
|
||||||
if(nStart == -1 && nEnd == -1){//初始状态
|
if (nStart == -1 && nEnd == -1) {// 初始状态
|
||||||
if(CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()
|
if (CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()
|
||||||
&& ChnNumberChars.contains(context.getCurrentChar())){
|
&& CHN_NUMBER_CHARS.contains(context.getCurrentChar())) {
|
||||||
//记录数词的起始、结束位置
|
// 记录数词的起始、结束位置
|
||||||
nStart = context.getCursor();
|
nStart = context.getCursor();
|
||||||
nEnd = context.getCursor();
|
nEnd = context.getCursor();
|
||||||
}
|
}
|
||||||
}else{//正在处理状态
|
} else {// 正在处理状态
|
||||||
if(CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()
|
if (CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()
|
||||||
&& ChnNumberChars.contains(context.getCurrentChar())){
|
&& CHN_NUMBER_CHARS.contains(context.getCurrentChar())) {
|
||||||
//记录数词的结束位置
|
// 记录数词的结束位置
|
||||||
nEnd = context.getCursor();
|
nEnd = context.getCursor();
|
||||||
}else{
|
} else {
|
||||||
//输出数词
|
// 输出数词
|
||||||
this.outputNumLexeme(context);
|
this.outputNumLexeme(context);
|
||||||
//重置头尾指针
|
// 重置头尾指针
|
||||||
nStart = -1;
|
nStart = -1;
|
||||||
nEnd = -1;
|
nEnd = -1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//缓冲区已经用完,还有尚未输出的数词
|
// 缓冲区已经用完,还有尚未输出的数词
|
||||||
if(context.isBufferConsumed()){
|
if (context.isBufferConsumed()) {
|
||||||
if(nStart != -1 && nEnd != -1){
|
if (nStart != -1 && nEnd != -1) {
|
||||||
//输出数词
|
// 输出数词
|
||||||
outputNumLexeme(context);
|
outputNumLexeme(context);
|
||||||
//重置头尾指针
|
// 重置头尾指针
|
||||||
nStart = -1;
|
nStart = -1;
|
||||||
nEnd = -1;
|
nEnd = -1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 处理中文量词
|
* 处理中文量词
|
||||||
* @param context 需要处理的内容
|
*
|
||||||
*/
|
* @param context 需要处理的内容
|
||||||
private void processCount(AnalyzeContext context){
|
*/
|
||||||
// 判断是否需要启动量词扫描
|
private void processCount(AnalyzeContext context) {
|
||||||
if(!this.needCountScan(context)){
|
// 判断是否需要启动量词扫描
|
||||||
return;
|
if (!this.needCountScan(context)) {
|
||||||
}
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
if(CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()){
|
if (CharacterUtil.CHAR_CHINESE == context.getCurrentCharType()) {
|
||||||
|
|
||||||
//优先处理countHits中的hit
|
// 优先处理countHits中的hit
|
||||||
if(!this.countHits.isEmpty()){
|
if (!this.countHits.isEmpty()) {
|
||||||
//处理词段队列
|
// 处理词段队列
|
||||||
Hit[] tmpArray = this.countHits.toArray(new Hit[0]);
|
Hit[] tmpArray = this.countHits.toArray(new Hit[0]);
|
||||||
for(Hit hit : tmpArray){
|
for (Hit hit : tmpArray) {
|
||||||
hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor() , hit);
|
hit = Dictionary.getSingleton().matchWithHit(context.getSegmentBuff(), context.getCursor(), hit);
|
||||||
if(hit.isMatch()){
|
if (hit.isMatch()) {
|
||||||
//输出当前的词
|
// 输出当前的词
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , hit.getBegin() , context.getCursor() - hit.getBegin() + 1 , Lexeme.TYPE_COUNT);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), hit.getBegin(), context.getCursor() - hit.getBegin() + 1, Lexeme.TYPE_COUNT);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
|
|
||||||
if(!hit.isPrefix()){//不是词前缀,hit不需要继续匹配,移除
|
if (!hit.isPrefix()) {// 不是词前缀,hit不需要继续匹配,移除
|
||||||
this.countHits.remove(hit);
|
this.countHits.remove(hit);
|
||||||
}
|
}
|
||||||
|
|
||||||
}else if(hit.isUnmatch()){
|
} else if (hit.isUnmatch()) {
|
||||||
//hit不是词,移除
|
// hit不是词,移除
|
||||||
this.countHits.remove(hit);
|
this.countHits.remove(hit);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//*********************************
|
// *********************************
|
||||||
//对当前指针位置的字符进行单字匹配
|
// 对当前指针位置的字符进行单字匹配
|
||||||
Hit singleCharHit = Dictionary.getSingleton().matchInQuantifierDict(context.getSegmentBuff(), context.getCursor(), 1);
|
Hit singleCharHit = Dictionary.getSingleton().matchInQuantifierDict(context.getSegmentBuff(), context.getCursor(), 1);
|
||||||
if(singleCharHit.isMatch()){//首字成量词词
|
|
||||||
//输出当前的词
|
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , context.getCursor() , 1 , Lexeme.TYPE_COUNT);
|
|
||||||
context.addLexeme(newLexeme);
|
|
||||||
|
|
||||||
//同时也是词前缀
|
// 首字为量词前缀
|
||||||
if(singleCharHit.isPrefix()){
|
if (singleCharHit.isMatch()) {
|
||||||
//前缀匹配则放入hit列表
|
// 输出当前的词
|
||||||
this.countHits.add(singleCharHit);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), context.getCursor(), 1, Lexeme.TYPE_COUNT);
|
||||||
}
|
context.addLexeme(newLexeme);
|
||||||
}else if(singleCharHit.isPrefix()){//首字为量词前缀
|
}
|
||||||
//前缀匹配则放入hit列表
|
|
||||||
this.countHits.add(singleCharHit);
|
|
||||||
}
|
|
||||||
|
|
||||||
|
// 前缀匹配则放入hit列表
|
||||||
|
if (singleCharHit.isPrefix()) {
|
||||||
|
// 前缀匹配则放入hit列表
|
||||||
|
this.countHits.add(singleCharHit);
|
||||||
|
}
|
||||||
|
|
||||||
}else{
|
} else {
|
||||||
//输入的不是中文字符
|
// 输入的不是中文字符
|
||||||
//清空未成形的量词
|
// 清空未成形的量词
|
||||||
this.countHits.clear();
|
this.countHits.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
//缓冲区数据已经读完,还有尚未输出的量词
|
// 缓冲区数据已经读完,还有尚未输出的量词
|
||||||
if(context.isBufferConsumed()){
|
if (context.isBufferConsumed()) {
|
||||||
//清空未成形的量词
|
// 清空未成形的量词
|
||||||
this.countHits.clear();
|
this.countHits.clear();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 判断是否需要扫描量词
|
* 判断是否需要扫描量词
|
||||||
*/
|
*/
|
||||||
private boolean needCountScan(AnalyzeContext context){
|
private boolean needCountScan(AnalyzeContext context) {
|
||||||
if((nStart != -1 && nEnd != -1 ) || !countHits.isEmpty()){
|
if ((nStart != -1 && nEnd != -1) || !countHits.isEmpty()) {
|
||||||
//正在处理中文数词,或者正在处理量词
|
// 正在处理中文数词,或者正在处理量词
|
||||||
return true;
|
return true;
|
||||||
}else{
|
} else {
|
||||||
//找到一个相邻的数词
|
// 找到一个相邻的数词
|
||||||
if(!context.getOrgLexemes().isEmpty()){
|
if (!context.getOrgLexemes().isEmpty()) {
|
||||||
Lexeme l = context.getOrgLexemes().peekLast();
|
Lexeme l = context.getOrgLexemes().peekLast();
|
||||||
if(Lexeme.TYPE_CNUM == l.getLexemeType() || Lexeme.TYPE_ARABIC == l.getLexemeType()){
|
if (Lexeme.TYPE_CNUM == l.getLexemeType() || Lexeme.TYPE_ARABIC == l.getLexemeType()) {
|
||||||
return l.getBegin() + l.getLength() == context.getCursor();
|
return l.getBegin() + l.getLength() == context.getCursor();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 添加数词词元到结果集
|
* 添加数词词元到结果集
|
||||||
* @param context 需要添加的词元
|
*
|
||||||
*/
|
* @param context 需要添加的词元
|
||||||
private void outputNumLexeme(AnalyzeContext context){
|
*/
|
||||||
if(nStart > -1 && nEnd > -1){
|
private void outputNumLexeme(AnalyzeContext context) {
|
||||||
//输出数词
|
if (nStart > -1 && nEnd > -1) {
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset() , nStart , nEnd - nStart + 1 , Lexeme.TYPE_CNUM);
|
// 输出数词
|
||||||
context.addLexeme(newLexeme);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), nStart, nEnd - nStart + 1, Lexeme.TYPE_CNUM);
|
||||||
}
|
context.addLexeme(newLexeme);
|
||||||
}
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,84 +21,85 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
*
|
|
||||||
* 字符集识别工具类
|
* 字符集识别工具类
|
||||||
*/
|
*/
|
||||||
class CharacterUtil {
|
class CharacterUtil {
|
||||||
|
|
||||||
static final int CHAR_USELESS = 0;
|
static final int CHAR_USELESS = 0;
|
||||||
|
|
||||||
static final int CHAR_ARABIC = 0X00000001;
|
static final int CHAR_ARABIC = 0X00000001;
|
||||||
|
|
||||||
static final int CHAR_ENGLISH = 0X00000002;
|
static final int CHAR_ENGLISH = 0X00000002;
|
||||||
|
|
||||||
static final int CHAR_CHINESE = 0X00000004;
|
static final int CHAR_CHINESE = 0X00000004;
|
||||||
|
|
||||||
static final int CHAR_OTHER_CJK = 0X00000008;
|
static final int CHAR_OTHER_CJK = 0X00000008;
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 识别字符类型
|
* 识别字符类型
|
||||||
* @param input 需要识别的字符
|
*
|
||||||
* @return int CharacterUtil定义的字符类型常量
|
* @param input 需要识别的字符
|
||||||
*/
|
* @return int CharacterUtil定义的字符类型常量
|
||||||
static int identifyCharType(char input){
|
*/
|
||||||
if(input >= '0' && input <= '9'){
|
static int identifyCharType(char input) {
|
||||||
return CHAR_ARABIC;
|
if (input >= '0' && input <= '9') {
|
||||||
|
return CHAR_ARABIC;
|
||||||
|
|
||||||
}else if((input >= 'a' && input <= 'z')
|
} else if ((input >= 'a' && input <= 'z')
|
||||||
|| (input >= 'A' && input <= 'Z')){
|
|| (input >= 'A' && input <= 'Z')) {
|
||||||
return CHAR_ENGLISH;
|
return CHAR_ENGLISH;
|
||||||
|
|
||||||
}else {
|
} else {
|
||||||
Character.UnicodeBlock ub = Character.UnicodeBlock.of(input);
|
Character.UnicodeBlock ub = Character.UnicodeBlock.of(input);
|
||||||
|
|
||||||
if(ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
|
if (ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS
|
||||||
|| ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
|
|| ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
|
||||||
|| ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A){
|
|| ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A) {
|
||||||
//目前已知的中文字符UTF-8集合
|
//目前已知的中文字符UTF-8集合
|
||||||
return CHAR_CHINESE;
|
return CHAR_CHINESE;
|
||||||
|
|
||||||
}else if(ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS //全角数字字符和日韩字符
|
} else if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS //全角数字字符和日韩字符
|
||||||
//韩文字符集
|
//韩文字符集
|
||||||
|| ub == Character.UnicodeBlock.HANGUL_SYLLABLES
|
|| ub == Character.UnicodeBlock.HANGUL_SYLLABLES
|
||||||
|| ub == Character.UnicodeBlock.HANGUL_JAMO
|
|| ub == Character.UnicodeBlock.HANGUL_JAMO
|
||||||
|| ub == Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO
|
|| ub == Character.UnicodeBlock.HANGUL_COMPATIBILITY_JAMO
|
||||||
//日文字符集
|
//日文字符集
|
||||||
|| ub == Character.UnicodeBlock.HIRAGANA //平假名
|
|| ub == Character.UnicodeBlock.HIRAGANA //平假名
|
||||||
|| ub == Character.UnicodeBlock.KATAKANA //片假名
|
|| ub == Character.UnicodeBlock.KATAKANA //片假名
|
||||||
|| ub == Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS){
|
|| ub == Character.UnicodeBlock.KATAKANA_PHONETIC_EXTENSIONS) {
|
||||||
return CHAR_OTHER_CJK;
|
return CHAR_OTHER_CJK;
|
||||||
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
//其他的不做处理的字符
|
//其他的不做处理的字符
|
||||||
return CHAR_USELESS;
|
return CHAR_USELESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 进行字符规格化(全角转半角,大写转小写处理)
|
* 进行字符规格化(全角转半角,大写转小写处理)
|
||||||
* @param input 需要转换的字符
|
*
|
||||||
* @return char
|
* @param input 需要转换的字符
|
||||||
*/
|
* @return char
|
||||||
static char regularize(char input){
|
*/
|
||||||
|
static char regularize(char input) {
|
||||||
if (input == 12288) {
|
if (input == 12288) {
|
||||||
input = (char) 32;
|
input = (char) 32;
|
||||||
|
|
||||||
}else if (input > 65280 && input < 65375) {
|
} else if (input > 65280 && input < 65375) {
|
||||||
input = (char) (input - 65248);
|
input = (char) (input - 65248);
|
||||||
|
|
||||||
}else if (input >= 'A' && input <= 'Z') {
|
} else if (input >= 'A' && input <= 'Z') {
|
||||||
input += 32;
|
input += 32;
|
||||||
}
|
}
|
||||||
|
|
||||||
return input;
|
return input;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
@ -35,9 +35,7 @@ import java.util.TreeSet;
|
|||||||
*/
|
*/
|
||||||
class IKArbitrator {
|
class IKArbitrator {
|
||||||
|
|
||||||
IKArbitrator() {
|
IKArbitrator() {}
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 分词歧义处理
|
* 分词歧义处理
|
||||||
@ -52,20 +50,20 @@ class IKArbitrator {
|
|||||||
LexemePath crossPath = new LexemePath();
|
LexemePath crossPath = new LexemePath();
|
||||||
while (orgLexeme != null) {
|
while (orgLexeme != null) {
|
||||||
if (!crossPath.addCrossLexeme(orgLexeme)) {
|
if (!crossPath.addCrossLexeme(orgLexeme)) {
|
||||||
//找到与crossPath不相交的下一个crossPath
|
// 找到与crossPath不相交的下一个crossPath
|
||||||
if (crossPath.size() == 1 || !useSmart) {
|
if (crossPath.size() == 1 || !useSmart) {
|
||||||
//crossPath没有歧义 或者 不做歧义处理
|
// crossPath没有歧义 或者 不做歧义处理
|
||||||
//直接输出当前crossPath
|
// 直接输出当前crossPath
|
||||||
context.addLexemePath(crossPath);
|
context.addLexemePath(crossPath);
|
||||||
} else {
|
} else {
|
||||||
//对当前的crossPath进行歧义处理
|
// 对当前的crossPath进行歧义处理
|
||||||
QuickSortSet.Cell headCell = crossPath.getHead();
|
QuickSortSet.Cell headCell = crossPath.getHead();
|
||||||
LexemePath judgeResult = this.judge(headCell);
|
LexemePath judgeResult = this.judge(headCell);
|
||||||
//输出歧义处理结果judgeResult
|
// 输出歧义处理结果judgeResult
|
||||||
context.addLexemePath(judgeResult);
|
context.addLexemePath(judgeResult);
|
||||||
}
|
}
|
||||||
|
|
||||||
//把orgLexeme加入新的crossPath中
|
// 把orgLexeme加入新的crossPath中
|
||||||
crossPath = new LexemePath();
|
crossPath = new LexemePath();
|
||||||
crossPath.addCrossLexeme(orgLexeme);
|
crossPath.addCrossLexeme(orgLexeme);
|
||||||
}
|
}
|
||||||
@ -73,16 +71,16 @@ class IKArbitrator {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
//处理最后的path
|
// 处理最后的path
|
||||||
if (crossPath.size() == 1 || !useSmart) {
|
if (crossPath.size() == 1 || !useSmart) {
|
||||||
//crossPath没有歧义 或者 不做歧义处理
|
// crossPath没有歧义 或者 不做歧义处理
|
||||||
//直接输出当前crossPath
|
// 直接输出当前crossPath
|
||||||
context.addLexemePath(crossPath);
|
context.addLexemePath(crossPath);
|
||||||
} else {
|
} else {
|
||||||
//对当前的crossPath进行歧义处理
|
// 对当前的crossPath进行歧义处理
|
||||||
QuickSortSet.Cell headCell = crossPath.getHead();
|
QuickSortSet.Cell headCell = crossPath.getHead();
|
||||||
LexemePath judgeResult = this.judge(headCell);
|
LexemePath judgeResult = this.judge(headCell);
|
||||||
//输出歧义处理结果judgeResult
|
// 输出歧义处理结果judgeResult
|
||||||
context.addLexemePath(judgeResult);
|
context.addLexemePath(judgeResult);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -93,29 +91,29 @@ class IKArbitrator {
|
|||||||
* @param lexemeCell 歧义路径链表头
|
* @param lexemeCell 歧义路径链表头
|
||||||
*/
|
*/
|
||||||
private LexemePath judge(QuickSortSet.Cell lexemeCell) {
|
private LexemePath judge(QuickSortSet.Cell lexemeCell) {
|
||||||
//候选路径集合
|
// 候选路径集合
|
||||||
TreeSet<LexemePath> pathOptions = new TreeSet<>();
|
TreeSet<LexemePath> pathOptions = new TreeSet<>();
|
||||||
//候选结果路径
|
// 候选结果路径
|
||||||
LexemePath option = new LexemePath();
|
LexemePath option = new LexemePath();
|
||||||
|
|
||||||
//对crossPath进行一次遍历,同时返回本次遍历中有冲突的Lexeme栈
|
// 对crossPath进行一次遍历,同时返回本次遍历中有冲突的Lexeme栈
|
||||||
Stack<QuickSortSet.Cell> lexemeStack = this.forwardPath(lexemeCell, option);
|
Stack<QuickSortSet.Cell> lexemeStack = this.forwardPath(lexemeCell, option);
|
||||||
|
|
||||||
//当前词元链并非最理想的,加入候选路径集合
|
// 当前词元链并非最理想的,加入候选路径集合
|
||||||
pathOptions.add(option.copy());
|
pathOptions.add(option.copy());
|
||||||
|
|
||||||
//存在歧义词,处理
|
// 存在歧义词,处理
|
||||||
QuickSortSet.Cell c;
|
QuickSortSet.Cell c;
|
||||||
while (!lexemeStack.isEmpty()) {
|
while (!lexemeStack.isEmpty()) {
|
||||||
c = lexemeStack.pop();
|
c = lexemeStack.pop();
|
||||||
//回滚词元链
|
// 回滚词元链
|
||||||
this.backPath(c.getLexeme(), option);
|
this.backPath(c.getLexeme(), option);
|
||||||
//从歧义词位置开始,递归,生成可选方案
|
// 从歧义词位置开始,递归,生成可选方案
|
||||||
this.forwardPath(c, option);
|
this.forwardPath(c, option);
|
||||||
pathOptions.add(option.copy());
|
pathOptions.add(option.copy());
|
||||||
}
|
}
|
||||||
|
|
||||||
//返回集合中的最优方案
|
// 返回集合中的最优方案
|
||||||
return pathOptions.first();
|
return pathOptions.first();
|
||||||
|
|
||||||
}
|
}
|
||||||
@ -124,13 +122,13 @@ class IKArbitrator {
|
|||||||
* 向前遍历,添加词元,构造一个无歧义词元组合
|
* 向前遍历,添加词元,构造一个无歧义词元组合
|
||||||
*/
|
*/
|
||||||
private Stack<QuickSortSet.Cell> forwardPath(QuickSortSet.Cell lexemeCell, LexemePath option) {
|
private Stack<QuickSortSet.Cell> forwardPath(QuickSortSet.Cell lexemeCell, LexemePath option) {
|
||||||
//发生冲突的Lexeme栈
|
// 发生冲突的Lexeme栈
|
||||||
Stack<QuickSortSet.Cell> conflictStack = new Stack<>();
|
Stack<QuickSortSet.Cell> conflictStack = new Stack<>();
|
||||||
QuickSortSet.Cell c = lexemeCell;
|
QuickSortSet.Cell c = lexemeCell;
|
||||||
//迭代遍历Lexeme链表
|
// 迭代遍历Lexeme链表
|
||||||
while (c != null && c.getLexeme() != null) {
|
while (c != null && c.getLexeme() != null) {
|
||||||
if (!option.addNotCrossLexeme(c.getLexeme())) {
|
if (!option.addNotCrossLexeme(c.getLexeme())) {
|
||||||
//词元交叉,添加失败则加入lexemeStack栈
|
// 词元交叉,添加失败则加入lexemeStack栈
|
||||||
conflictStack.push(c);
|
conflictStack.push(c);
|
||||||
}
|
}
|
||||||
c = c.getNext();
|
c = c.getNext();
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,35 +21,45 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.3.1 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
|
|
||||||
|
import org.wltea.analyzer.cfg.Configuration;
|
||||||
|
import org.wltea.analyzer.cfg.DefaultConfig;
|
||||||
|
import org.wltea.analyzer.dic.Dictionary;
|
||||||
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.io.Reader;
|
import java.io.Reader;
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
import org.wltea.analyzer.cfg.Configuration;
|
|
||||||
import org.wltea.analyzer.cfg.DefaultConfig;
|
|
||||||
import org.wltea.analyzer.dic.Dictionary;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* IK分词器主类
|
* IK分词器主类
|
||||||
*/
|
*/
|
||||||
public final class IKSegmenter {
|
public final class IKSegmenter {
|
||||||
|
|
||||||
//字符窜reader
|
/**
|
||||||
|
* 字符窜reader
|
||||||
|
*/
|
||||||
private Reader input;
|
private Reader input;
|
||||||
//分词器配置项
|
/**
|
||||||
private Configuration cfg;
|
* 分词器配置项
|
||||||
//分词器上下文
|
*/
|
||||||
|
private final Configuration cfg;
|
||||||
|
/**
|
||||||
|
* 分词器上下文
|
||||||
|
*/
|
||||||
private AnalyzeContext context;
|
private AnalyzeContext context;
|
||||||
//分词处理器列表
|
/**
|
||||||
|
* 分词处理器列表
|
||||||
|
*/
|
||||||
private List<ISegmenter> segmenters;
|
private List<ISegmenter> segmenters;
|
||||||
//分词歧义裁决器
|
/**
|
||||||
|
* 分词歧义裁决器
|
||||||
|
*/
|
||||||
private IKArbitrator arbitrator;
|
private IKArbitrator arbitrator;
|
||||||
|
|
||||||
|
|
||||||
@ -58,7 +68,6 @@ public final class IKSegmenter {
|
|||||||
*
|
*
|
||||||
* @param input 读取流
|
* @param input 读取流
|
||||||
* @param useSmart 为true,使用智能分词策略
|
* @param useSmart 为true,使用智能分词策略
|
||||||
* <p>
|
|
||||||
* 非智能分词:细粒度输出所有可能的切分结果
|
* 非智能分词:细粒度输出所有可能的切分结果
|
||||||
* 智能分词: 合并数词和量词,对分词结果进行歧义判断
|
* 智能分词: 合并数词和量词,对分词结果进行歧义判断
|
||||||
*/
|
*/
|
||||||
@ -86,13 +95,13 @@ public final class IKSegmenter {
|
|||||||
* 初始化
|
* 初始化
|
||||||
*/
|
*/
|
||||||
private void init() {
|
private void init() {
|
||||||
//初始化词典单例
|
// 初始化词典单例
|
||||||
Dictionary.initial(this.cfg);
|
Dictionary.initial(this.cfg);
|
||||||
//初始化分词上下文
|
// 初始化分词上下文
|
||||||
this.context = new AnalyzeContext(this.cfg);
|
this.context = new AnalyzeContext(this.cfg);
|
||||||
//加载子分词器
|
// 加载子分词器
|
||||||
this.segmenters = this.loadSegmenters();
|
this.segmenters = this.loadSegmenters();
|
||||||
//加载歧义裁决器
|
// 加载歧义裁决器
|
||||||
this.arbitrator = new IKArbitrator();
|
this.arbitrator = new IKArbitrator();
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -103,11 +112,11 @@ public final class IKSegmenter {
|
|||||||
*/
|
*/
|
||||||
private List<ISegmenter> loadSegmenters() {
|
private List<ISegmenter> loadSegmenters() {
|
||||||
List<ISegmenter> segmenters = new ArrayList<>(4);
|
List<ISegmenter> segmenters = new ArrayList<>(4);
|
||||||
//处理字母的子分词器
|
// 处理字母的子分词器
|
||||||
segmenters.add(new LetterSegmenter());
|
segmenters.add(new LetterSegmenter());
|
||||||
//处理中文数量词的子分词器
|
// 处理中文数量词的子分词器
|
||||||
segmenters.add(new CN_QuantifierSegmenter());
|
segmenters.add(new CN_QuantifierSegmenter());
|
||||||
//处理中文词的子分词器
|
// 处理中文词的子分词器
|
||||||
segmenters.add(new CJKSegmenter());
|
segmenters.add(new CJKSegmenter());
|
||||||
return segmenters;
|
return segmenters;
|
||||||
}
|
}
|
||||||
@ -127,34 +136,34 @@ public final class IKSegmenter {
|
|||||||
*/
|
*/
|
||||||
int available = context.fillBuffer(this.input);
|
int available = context.fillBuffer(this.input);
|
||||||
if (available <= 0) {
|
if (available <= 0) {
|
||||||
//reader已经读完
|
// reader已经读完
|
||||||
context.reset();
|
context.reset();
|
||||||
return null;
|
return null;
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
//初始化指针
|
// 初始化指针
|
||||||
context.initCursor();
|
context.initCursor();
|
||||||
do {
|
do {
|
||||||
//遍历子分词器
|
// 遍历子分词器
|
||||||
for (ISegmenter segmenter : segmenters) {
|
for (ISegmenter segmenter : segmenters) {
|
||||||
segmenter.analyze(context);
|
segmenter.analyze(context);
|
||||||
}
|
}
|
||||||
//字符缓冲区接近读完,需要读入新的字符
|
// 字符缓冲区接近读完,需要读入新的字符
|
||||||
if (context.needRefillBuffer()) {
|
if (context.needRefillBuffer()) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
//向前移动指针
|
// 向前移动指针
|
||||||
} while (context.moveCursor());
|
} while (context.moveCursor());
|
||||||
//重置子分词器,为下轮循环进行初始化
|
// 重置子分词器,为下轮循环进行初始化
|
||||||
for (ISegmenter segmenter : segmenters) {
|
for (ISegmenter segmenter : segmenters) {
|
||||||
segmenter.reset();
|
segmenter.reset();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
//对分词进行歧义处理
|
// 对分词进行歧义处理
|
||||||
this.arbitrator.process(context, this.cfg.useSmart());
|
this.arbitrator.process(context, this.cfg.useSmart());
|
||||||
//将分词结果输出到结果集,并处理未切分的单个CJK字符
|
// 将分词结果输出到结果集,并处理未切分的单个CJK字符
|
||||||
context.outputToResult();
|
context.outputToResult();
|
||||||
//记录本次分词的缓冲区位移
|
// 记录本次分词的缓冲区位移
|
||||||
context.markBufferOffset();
|
context.markBufferOffset();
|
||||||
}
|
}
|
||||||
return l;
|
return l;
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,29 +21,29 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
*
|
|
||||||
* 子分词器接口
|
* 子分词器接口
|
||||||
*/
|
*/
|
||||||
interface ISegmenter {
|
interface ISegmenter {
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 从分析器读取下一个可能分解的词元对象
|
* 从分析器读取下一个可能分解的词元对象
|
||||||
* @param context 分词算法上下文
|
*
|
||||||
*/
|
* @param context 分词算法上下文
|
||||||
void analyze(AnalyzeContext context);
|
*/
|
||||||
|
void analyze(AnalyzeContext context);
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 重置子分析器状态
|
* 重置子分析器状态
|
||||||
*/
|
*/
|
||||||
void reset();
|
void reset();
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
@ -34,14 +34,18 @@ import java.util.Arrays;
|
|||||||
*/
|
*/
|
||||||
class LetterSegmenter implements ISegmenter {
|
class LetterSegmenter implements ISegmenter {
|
||||||
|
|
||||||
//子分词器标签
|
/**
|
||||||
|
* 子分词器标签
|
||||||
|
*/
|
||||||
private static final String SEGMENTER_NAME = "LETTER_SEGMENTER";
|
private static final String SEGMENTER_NAME = "LETTER_SEGMENTER";
|
||||||
//链接符号
|
/**
|
||||||
|
* 链接符号
|
||||||
|
*/
|
||||||
private static final char[] Letter_Connector = new char[]{'#', '&', '+', '-', '.', '@', '_'};
|
private static final char[] Letter_Connector = new char[]{'#', '&', '+', '-', '.', '@', '_'};
|
||||||
|
/**
|
||||||
//数字符号
|
* 数字符号
|
||||||
|
*/
|
||||||
private static final char[] Num_Connector = new char[]{',', '.'};
|
private static final char[] Num_Connector = new char[]{',', '.'};
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 词元的开始位置,
|
* 词元的开始位置,
|
||||||
* 同时作为子分词器状态标识
|
* 同时作为子分词器状态标识
|
||||||
@ -53,22 +57,18 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
* end记录的是在词元中最后一个出现的Letter但非Sign_Connector的字符的位置
|
* end记录的是在词元中最后一个出现的Letter但非Sign_Connector的字符的位置
|
||||||
*/
|
*/
|
||||||
private int end;
|
private int end;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 字母起始位置
|
* 字母起始位置
|
||||||
*/
|
*/
|
||||||
private int englishStart;
|
private int englishStart;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 字母结束位置
|
* 字母结束位置
|
||||||
*/
|
*/
|
||||||
private int englishEnd;
|
private int englishEnd;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 阿拉伯数字起始位置
|
* 阿拉伯数字起始位置
|
||||||
*/
|
*/
|
||||||
private int arabicStart;
|
private int arabicStart;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 阿拉伯数字结束位置
|
* 阿拉伯数字结束位置
|
||||||
*/
|
*/
|
||||||
@ -91,18 +91,18 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
*/
|
*/
|
||||||
public void analyze(AnalyzeContext context) {
|
public void analyze(AnalyzeContext context) {
|
||||||
boolean bufferLockFlag;
|
boolean bufferLockFlag;
|
||||||
//处理英文字母
|
// 处理英文字母
|
||||||
bufferLockFlag = this.processEnglishLetter(context);
|
bufferLockFlag = this.processEnglishLetter(context);
|
||||||
//处理阿拉伯字母
|
// 处理阿拉伯字母
|
||||||
bufferLockFlag = this.processArabicLetter(context) || bufferLockFlag;
|
bufferLockFlag = this.processArabicLetter(context) || bufferLockFlag;
|
||||||
//处理混合字母(这个要放最后处理,可以通过QuickSortSet排除重复)
|
// 处理混合字母(这个要放最后处理,可以通过QuickSortSet排除重复)
|
||||||
bufferLockFlag = this.processMixLetter(context) || bufferLockFlag;
|
bufferLockFlag = this.processMixLetter(context) || bufferLockFlag;
|
||||||
|
|
||||||
//判断是否锁定缓冲区
|
// 判断是否锁定缓冲区
|
||||||
if (bufferLockFlag) {
|
if (bufferLockFlag) {
|
||||||
context.lockBuffer(SEGMENTER_NAME);
|
context.lockBuffer(SEGMENTER_NAME);
|
||||||
} else {
|
} else {
|
||||||
//对缓冲区解锁
|
// 对缓冲区解锁
|
||||||
context.unlockBuffer(SEGMENTER_NAME);
|
context.unlockBuffer(SEGMENTER_NAME);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -128,26 +128,26 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
private boolean processMixLetter(AnalyzeContext context) {
|
private boolean processMixLetter(AnalyzeContext context) {
|
||||||
boolean needLock;
|
boolean needLock;
|
||||||
|
|
||||||
if (this.start == -1) {//当前的分词器尚未开始处理字符
|
if (this.start == -1) {// 当前的分词器尚未开始处理字符
|
||||||
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()
|
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()
|
||||||
|| CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
|| CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
||||||
//记录起始指针的位置,标明分词器进入处理状态
|
// 记录起始指针的位置,标明分词器进入处理状态
|
||||||
this.start = context.getCursor();
|
this.start = context.getCursor();
|
||||||
this.end = start;
|
this.end = start;
|
||||||
}
|
}
|
||||||
|
|
||||||
} else {//当前的分词器正在处理字符
|
} else {// 当前的分词器正在处理字符
|
||||||
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()
|
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()
|
||||||
|| CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
|| CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
||||||
//记录下可能的结束位置
|
// 记录下可能的结束位置
|
||||||
this.end = context.getCursor();
|
this.end = context.getCursor();
|
||||||
|
|
||||||
} else if (CharacterUtil.CHAR_USELESS == context.getCurrentCharType()
|
} else if (CharacterUtil.CHAR_USELESS == context.getCurrentCharType()
|
||||||
&& this.isLetterConnector(context.getCurrentChar())) {
|
&& this.isLetterConnector(context.getCurrentChar())) {
|
||||||
//记录下可能的结束位置
|
// 记录下可能的结束位置
|
||||||
this.end = context.getCursor();
|
this.end = context.getCursor();
|
||||||
} else {
|
} else {
|
||||||
//遇到非Letter字符,输出词元
|
// 遇到非Letter字符,输出词元
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.start, this.end - this.start + 1, Lexeme.TYPE_LETTER);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.start, this.end - this.start + 1, Lexeme.TYPE_LETTER);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
this.start = -1;
|
this.start = -1;
|
||||||
@ -155,10 +155,10 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断缓冲区是否已经读完
|
// 判断缓冲区是否已经读完
|
||||||
if (context.isBufferConsumed()) {
|
if (context.isBufferConsumed()) {
|
||||||
if (this.start != -1 && this.end != -1) {
|
if (this.start != -1 && this.end != -1) {
|
||||||
//缓冲以读完,输出词元
|
// 缓冲以读完,输出词元
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.start, this.end - this.start + 1, Lexeme.TYPE_LETTER);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.start, this.end - this.start + 1, Lexeme.TYPE_LETTER);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
this.start = -1;
|
this.start = -1;
|
||||||
@ -166,7 +166,7 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断是否锁定缓冲区
|
// 判断是否锁定缓冲区
|
||||||
needLock = this.start != -1 || this.end != -1;
|
needLock = this.start != -1 || this.end != -1;
|
||||||
return needLock;
|
return needLock;
|
||||||
}
|
}
|
||||||
@ -179,18 +179,18 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
private boolean processEnglishLetter(AnalyzeContext context) {
|
private boolean processEnglishLetter(AnalyzeContext context) {
|
||||||
boolean needLock;
|
boolean needLock;
|
||||||
|
|
||||||
if (this.englishStart == -1) {//当前的分词器尚未开始处理英文字符
|
if (this.englishStart == -1) {// 当前的分词器尚未开始处理英文字符
|
||||||
if (CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
if (CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
||||||
//记录起始指针的位置,标明分词器进入处理状态
|
// 记录起始指针的位置,标明分词器进入处理状态
|
||||||
this.englishStart = context.getCursor();
|
this.englishStart = context.getCursor();
|
||||||
this.englishEnd = this.englishStart;
|
this.englishEnd = this.englishStart;
|
||||||
}
|
}
|
||||||
} else {//当前的分词器正在处理英文字符
|
} else {// 当前的分词器正在处理英文字符
|
||||||
if (CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
if (CharacterUtil.CHAR_ENGLISH == context.getCurrentCharType()) {
|
||||||
//记录当前指针位置为结束位置
|
// 记录当前指针位置为结束位置
|
||||||
this.englishEnd = context.getCursor();
|
this.englishEnd = context.getCursor();
|
||||||
} else {
|
} else {
|
||||||
//遇到非English字符,输出词元
|
// 遇到非English字符,输出词元
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.englishStart, this.englishEnd - this.englishStart + 1, Lexeme.TYPE_ENGLISH);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.englishStart, this.englishEnd - this.englishStart + 1, Lexeme.TYPE_ENGLISH);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
this.englishStart = -1;
|
this.englishStart = -1;
|
||||||
@ -198,10 +198,10 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断缓冲区是否已经读完
|
// 判断缓冲区是否已经读完
|
||||||
if (context.isBufferConsumed()) {
|
if (context.isBufferConsumed()) {
|
||||||
if (this.englishStart != -1 && this.englishEnd != -1) {
|
if (this.englishStart != -1 && this.englishEnd != -1) {
|
||||||
//缓冲以读完,输出词元
|
// 缓冲以读完,输出词元
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.englishStart, this.englishEnd - this.englishStart + 1, Lexeme.TYPE_ENGLISH);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.englishStart, this.englishEnd - this.englishStart + 1, Lexeme.TYPE_ENGLISH);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
this.englishStart = -1;
|
this.englishStart = -1;
|
||||||
@ -209,7 +209,7 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断是否锁定缓冲区
|
// 判断是否锁定缓冲区
|
||||||
needLock = this.englishStart != -1 || this.englishEnd != -1;
|
needLock = this.englishStart != -1 || this.englishEnd != -1;
|
||||||
return needLock;
|
return needLock;
|
||||||
}
|
}
|
||||||
@ -222,21 +222,21 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
private boolean processArabicLetter(AnalyzeContext context) {
|
private boolean processArabicLetter(AnalyzeContext context) {
|
||||||
boolean needLock;
|
boolean needLock;
|
||||||
|
|
||||||
if (this.arabicStart == -1) {//当前的分词器尚未开始处理数字字符
|
if (this.arabicStart == -1) {// 当前的分词器尚未开始处理数字字符
|
||||||
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()) {
|
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()) {
|
||||||
//记录起始指针的位置,标明分词器进入处理状态
|
// 记录起始指针的位置,标明分词器进入处理状态
|
||||||
this.arabicStart = context.getCursor();
|
this.arabicStart = context.getCursor();
|
||||||
this.arabicEnd = this.arabicStart;
|
this.arabicEnd = this.arabicStart;
|
||||||
}
|
}
|
||||||
} else {//当前的分词器正在处理数字字符
|
} else {// 当前的分词器正在处理数字字符
|
||||||
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()) {
|
if (CharacterUtil.CHAR_ARABIC == context.getCurrentCharType()) {
|
||||||
//记录当前指针位置为结束位置
|
// 记录当前指针位置为结束位置
|
||||||
this.arabicEnd = context.getCursor();
|
this.arabicEnd = context.getCursor();
|
||||||
}/* else if (CharacterUtil.CHAR_USELESS == context.getCurrentCharType()
|
}/* else if (CharacterUtil.CHAR_USELESS == context.getCurrentCharType()
|
||||||
&& this.isNumConnector(context.getCurrentChar())) {
|
&& this.isNumConnector(context.getCurrentChar())) {
|
||||||
//不输出数字,但不标记结束
|
// 不输出数字,但不标记结束
|
||||||
}*/ else {
|
}*/ else {
|
||||||
////遇到非Arabic字符,输出词元
|
// //遇到非Arabic字符,输出词元
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.arabicStart, this.arabicEnd - this.arabicStart + 1, Lexeme.TYPE_ARABIC);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.arabicStart, this.arabicEnd - this.arabicStart + 1, Lexeme.TYPE_ARABIC);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
this.arabicStart = -1;
|
this.arabicStart = -1;
|
||||||
@ -244,10 +244,10 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断缓冲区是否已经读完
|
// 判断缓冲区是否已经读完
|
||||||
if (context.isBufferConsumed()) {
|
if (context.isBufferConsumed()) {
|
||||||
if (this.arabicStart != -1 && this.arabicEnd != -1) {
|
if (this.arabicStart != -1 && this.arabicEnd != -1) {
|
||||||
//生成已切分的词元
|
// 生成已切分的词元
|
||||||
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.arabicStart, this.arabicEnd - this.arabicStart + 1, Lexeme.TYPE_ARABIC);
|
Lexeme newLexeme = new Lexeme(context.getBufferOffset(), this.arabicStart, this.arabicEnd - this.arabicStart + 1, Lexeme.TYPE_ARABIC);
|
||||||
context.addLexeme(newLexeme);
|
context.addLexeme(newLexeme);
|
||||||
this.arabicStart = -1;
|
this.arabicStart = -1;
|
||||||
@ -255,7 +255,7 @@ class LetterSegmenter implements ISegmenter {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//判断是否锁定缓冲区
|
// 判断是否锁定缓冲区
|
||||||
needLock = this.arabicStart != -1 || this.arabicEnd != -1;
|
needLock = this.arabicStart != -1 || this.arabicEnd != -1;
|
||||||
return needLock;
|
return needLock;
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
@ -31,242 +31,278 @@ package org.wltea.analyzer.core;
|
|||||||
* IK词元对象
|
* IK词元对象
|
||||||
*/
|
*/
|
||||||
@SuppressWarnings("unused")
|
@SuppressWarnings("unused")
|
||||||
public class Lexeme implements Comparable<Lexeme>{
|
public class Lexeme implements Comparable<Lexeme> {
|
||||||
//英文
|
/**
|
||||||
static final int TYPE_ENGLISH = 1;
|
* 英文
|
||||||
//数字
|
*/
|
||||||
static final int TYPE_ARABIC = 2;
|
static final int TYPE_ENGLISH = 1;
|
||||||
//英文数字混合
|
/**
|
||||||
static final int TYPE_LETTER = 3;
|
* 数字
|
||||||
//中文词元
|
*/
|
||||||
static final int TYPE_CNWORD = 4;
|
static final int TYPE_ARABIC = 2;
|
||||||
//中文单字
|
/**
|
||||||
static final int TYPE_CNCHAR = 64;
|
* 英文数字混合
|
||||||
//日韩文字
|
*/
|
||||||
static final int TYPE_OTHER_CJK = 8;
|
static final int TYPE_LETTER = 3;
|
||||||
//中文数词
|
/**
|
||||||
static final int TYPE_CNUM = 16;
|
* 中文词元
|
||||||
//中文量词
|
*/
|
||||||
static final int TYPE_COUNT = 32;
|
static final int TYPE_CNWORD = 4;
|
||||||
//中文数量词
|
/**
|
||||||
static final int TYPE_CQUAN = 48;
|
* 中文单字
|
||||||
|
*/
|
||||||
//词元的起始位移
|
static final int TYPE_CNCHAR = 64;
|
||||||
private int offset;
|
/**
|
||||||
//词元的相对起始位置
|
* 日韩文字
|
||||||
|
*/
|
||||||
|
static final int TYPE_OTHER_CJK = 8;
|
||||||
|
/**
|
||||||
|
* 中文数词
|
||||||
|
*/
|
||||||
|
static final int TYPE_CNUM = 16;
|
||||||
|
/**
|
||||||
|
* 中文量词
|
||||||
|
*/
|
||||||
|
static final int TYPE_COUNT = 32;
|
||||||
|
/**
|
||||||
|
* 中文数量词
|
||||||
|
*/
|
||||||
|
static final int TYPE_CQUAN = 48;
|
||||||
|
/**
|
||||||
|
* 词元的起始位移
|
||||||
|
*/
|
||||||
|
private int offset;
|
||||||
|
/**
|
||||||
|
* 词元的相对起始位置
|
||||||
|
*/
|
||||||
private int begin;
|
private int begin;
|
||||||
//词元的长度
|
/**
|
||||||
|
* 词元的长度
|
||||||
|
*/
|
||||||
private int length;
|
private int length;
|
||||||
//词元文本
|
/**
|
||||||
|
* 词元文本
|
||||||
|
*/
|
||||||
private String lexemeText;
|
private String lexemeText;
|
||||||
//词元类型
|
/**
|
||||||
|
* 词元类型
|
||||||
|
*/
|
||||||
private int lexemeType;
|
private int lexemeType;
|
||||||
|
|
||||||
|
|
||||||
public Lexeme(int offset , int begin , int length , int lexemeType){
|
public Lexeme(int offset, int begin, int length, int lexemeType) {
|
||||||
this.offset = offset;
|
this.offset = offset;
|
||||||
this.begin = begin;
|
this.begin = begin;
|
||||||
if(length < 0){
|
if (length < 0) {
|
||||||
throw new IllegalArgumentException("length < 0");
|
throw new IllegalArgumentException("length < 0");
|
||||||
}
|
}
|
||||||
this.length = length;
|
this.length = length;
|
||||||
this.lexemeType = lexemeType;
|
this.lexemeType = lexemeType;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 判断词元相等算法
|
* 判断词元相等算法
|
||||||
* 起始位置偏移、起始位置、终止位置相同
|
* 起始位置偏移、起始位置、终止位置相同
|
||||||
* @see java.lang.Object#equals(Object o)
|
* @see java.lang.Object#equals(Object o)
|
||||||
*/
|
*/
|
||||||
public boolean equals(Object o){
|
public boolean equals(Object o) {
|
||||||
if(o == null){
|
if (o == null) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
if(this == o){
|
if (this == o) {
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
if(o instanceof Lexeme){
|
if (o instanceof Lexeme) {
|
||||||
Lexeme other = (Lexeme)o;
|
Lexeme other = (Lexeme) o;
|
||||||
return this.offset == other.getOffset()
|
return this.offset == other.getOffset()
|
||||||
&& this.begin == other.getBegin()
|
&& this.begin == other.getBegin()
|
||||||
&& this.length == other.getLength();
|
&& this.length == other.getLength();
|
||||||
}else{
|
} else {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 词元哈希编码算法
|
* 词元哈希编码算法
|
||||||
* @see java.lang.Object#hashCode()
|
* @see java.lang.Object#hashCode()
|
||||||
*/
|
*/
|
||||||
public int hashCode(){
|
public int hashCode() {
|
||||||
int absBegin = getBeginPosition();
|
int absBegin = getBeginPosition();
|
||||||
int absEnd = getEndPosition();
|
int absEnd = getEndPosition();
|
||||||
return (absBegin * 37) + (absEnd * 31) + ((absBegin * absEnd) % getLength()) * 11;
|
return (absBegin * 37) + (absEnd * 31) + ((absBegin * absEnd) % getLength()) * 11;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 词元在排序集合中的比较算法
|
* 词元在排序集合中的比较算法
|
||||||
* @see java.lang.Comparable#compareTo(java.lang.Object)
|
* @see java.lang.Comparable#compareTo(java.lang.Object)
|
||||||
*/
|
*/
|
||||||
public int compareTo(Lexeme other) {
|
public int compareTo(Lexeme other) {
|
||||||
//起始位置优先
|
// 起始位置优先
|
||||||
if(this.begin < other.getBegin()){
|
if (this.begin < other.getBegin()) {
|
||||||
return -1;
|
return -1;
|
||||||
}else if(this.begin == other.getBegin()){
|
} else if (this.begin == other.getBegin()) {
|
||||||
//词元长度优先
|
// 词元长度优先
|
||||||
//this.length < other.getLength()
|
// this.length < other.getLength()
|
||||||
return Integer.compare(other.getLength(), this.length);
|
return Integer.compare(other.getLength(), this.length);
|
||||||
|
|
||||||
}else{//this.begin > other.getBegin()
|
} else {
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private int getOffset() {
|
private int getOffset() {
|
||||||
return offset;
|
return offset;
|
||||||
}
|
}
|
||||||
|
|
||||||
public void setOffset(int offset) {
|
public void setOffset(int offset) {
|
||||||
this.offset = offset;
|
this.offset = offset;
|
||||||
}
|
}
|
||||||
|
|
||||||
int getBegin() {
|
int getBegin() {
|
||||||
return begin;
|
return begin;
|
||||||
}
|
}
|
||||||
/**
|
|
||||||
* 获取词元在文本中的起始位置
|
|
||||||
* @return int
|
|
||||||
*/
|
|
||||||
public int getBeginPosition(){
|
|
||||||
return offset + begin;
|
|
||||||
}
|
|
||||||
|
|
||||||
public void setBegin(int begin) {
|
/**
|
||||||
this.begin = begin;
|
* 获取词元在文本中的起始位置
|
||||||
}
|
*
|
||||||
|
* @return int
|
||||||
|
*/
|
||||||
|
public int getBeginPosition() {
|
||||||
|
return offset + begin;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
public void setBegin(int begin) {
|
||||||
* 获取词元在文本中的结束位置
|
this.begin = begin;
|
||||||
* @return int
|
}
|
||||||
*/
|
|
||||||
public int getEndPosition(){
|
|
||||||
return offset + begin + length;
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取词元的字符长度
|
* 获取词元在文本中的结束位置
|
||||||
* @return int
|
*
|
||||||
*/
|
* @return int
|
||||||
public int getLength(){
|
*/
|
||||||
return this.length;
|
public int getEndPosition() {
|
||||||
}
|
return offset + begin + length;
|
||||||
|
}
|
||||||
|
|
||||||
public void setLength(int length) {
|
/**
|
||||||
if(this.length < 0){
|
* 获取词元的字符长度
|
||||||
throw new IllegalArgumentException("length < 0");
|
*
|
||||||
}
|
* @return int
|
||||||
this.length = length;
|
*/
|
||||||
}
|
public int getLength() {
|
||||||
|
return this.length;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
public void setLength(int length) {
|
||||||
* 获取词元的文本内容
|
if (this.length < 0) {
|
||||||
* @return String
|
throw new IllegalArgumentException("length < 0");
|
||||||
*/
|
}
|
||||||
public String getLexemeText() {
|
this.length = length;
|
||||||
if(lexemeText == null){
|
}
|
||||||
return "";
|
|
||||||
}
|
|
||||||
return lexemeText;
|
|
||||||
}
|
|
||||||
|
|
||||||
void setLexemeText(String lexemeText) {
|
/**
|
||||||
if(lexemeText == null){
|
* 获取词元的文本内容
|
||||||
this.lexemeText = "";
|
*
|
||||||
this.length = 0;
|
* @return String
|
||||||
}else{
|
*/
|
||||||
this.lexemeText = lexemeText;
|
public String getLexemeText() {
|
||||||
this.length = lexemeText.length();
|
if (lexemeText == null) {
|
||||||
}
|
return "";
|
||||||
}
|
}
|
||||||
|
return lexemeText;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
void setLexemeText(String lexemeText) {
|
||||||
* 获取词元类型
|
if (lexemeText == null) {
|
||||||
* @return int
|
this.lexemeText = "";
|
||||||
*/
|
this.length = 0;
|
||||||
int getLexemeType() {
|
} else {
|
||||||
return lexemeType;
|
this.lexemeText = lexemeText;
|
||||||
}
|
this.length = lexemeText.length();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取词元类型标示字符串
|
* 获取词元类型
|
||||||
* @return String
|
*
|
||||||
*/
|
* @return int
|
||||||
public String getLexemeTypeString(){
|
*/
|
||||||
switch(lexemeType) {
|
int getLexemeType() {
|
||||||
|
return lexemeType;
|
||||||
|
}
|
||||||
|
|
||||||
case TYPE_ENGLISH :
|
/**
|
||||||
return "ENGLISH";
|
* 获取词元类型标示字符串
|
||||||
|
*
|
||||||
|
* @return String
|
||||||
|
*/
|
||||||
|
public String getLexemeTypeString() {
|
||||||
|
switch (lexemeType) {
|
||||||
|
|
||||||
case TYPE_ARABIC :
|
case TYPE_ENGLISH:
|
||||||
return "ARABIC";
|
return "ENGLISH";
|
||||||
|
|
||||||
case TYPE_LETTER :
|
case TYPE_ARABIC:
|
||||||
return "LETTER";
|
return "ARABIC";
|
||||||
|
|
||||||
case TYPE_CNWORD :
|
case TYPE_LETTER:
|
||||||
return "CN_WORD";
|
return "LETTER";
|
||||||
|
|
||||||
case TYPE_CNCHAR :
|
case TYPE_CNWORD:
|
||||||
return "CN_CHAR";
|
return "CN_WORD";
|
||||||
|
|
||||||
case TYPE_OTHER_CJK :
|
case TYPE_CNCHAR:
|
||||||
return "OTHER_CJK";
|
return "CN_CHAR";
|
||||||
|
|
||||||
case TYPE_COUNT :
|
case TYPE_OTHER_CJK:
|
||||||
return "COUNT";
|
return "OTHER_CJK";
|
||||||
|
|
||||||
case TYPE_CNUM :
|
case TYPE_COUNT:
|
||||||
return "TYPE_CNUM";
|
return "COUNT";
|
||||||
|
|
||||||
case TYPE_CQUAN:
|
case TYPE_CNUM:
|
||||||
return "TYPE_CQUAN";
|
return "TYPE_CNUM";
|
||||||
|
|
||||||
default :
|
case TYPE_CQUAN:
|
||||||
return "UNKONW";
|
return "TYPE_CQUAN";
|
||||||
}
|
|
||||||
}
|
default:
|
||||||
|
return "UNKNOWN";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
public void setLexemeType(int lexemeType) {
|
public void setLexemeType(int lexemeType) {
|
||||||
this.lexemeType = lexemeType;
|
this.lexemeType = lexemeType;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 合并两个相邻的词元
|
* 合并两个相邻的词元
|
||||||
* @return boolean 词元是否成功合并
|
*
|
||||||
*/
|
* @return boolean 词元是否成功合并
|
||||||
boolean append(Lexeme l, int lexemeType){
|
*/
|
||||||
if(l != null && this.getEndPosition() == l.getBeginPosition()){
|
boolean append(Lexeme l, int lexemeType) {
|
||||||
this.length += l.getLength();
|
if (l != null && this.getEndPosition() == l.getBeginPosition()) {
|
||||||
this.lexemeType = lexemeType;
|
this.length += l.getLength();
|
||||||
return true;
|
this.lexemeType = lexemeType;
|
||||||
}else {
|
return true;
|
||||||
return false;
|
} else {
|
||||||
}
|
return false;
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
/**
|
* ToString 方法
|
||||||
*
|
*
|
||||||
*/
|
* @return 字符串输出
|
||||||
public String toString(){
|
*/
|
||||||
return this.getBeginPosition() + "-" + this.getEndPosition() +
|
public String toString() {
|
||||||
" : " + this.lexemeText + " : \t" +
|
return this.getBeginPosition() + "-" + this.getEndPosition() +
|
||||||
this.getLexemeTypeString();
|
" : " + this.lexemeText + " : \t" +
|
||||||
}
|
this.getLexemeTypeString();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
@ -34,11 +34,17 @@ package org.wltea.analyzer.core;
|
|||||||
@SuppressWarnings("unused")
|
@SuppressWarnings("unused")
|
||||||
class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
||||||
|
|
||||||
//起始位置
|
/**
|
||||||
|
* 起始位置
|
||||||
|
*/
|
||||||
private int pathBegin;
|
private int pathBegin;
|
||||||
//结束
|
/**
|
||||||
|
* 结束
|
||||||
|
*/
|
||||||
private int pathEnd;
|
private int pathEnd;
|
||||||
//词元链的有效字符长度
|
/**
|
||||||
|
* 词元链的有效字符长度
|
||||||
|
*/
|
||||||
private int payloadLength;
|
private int payloadLength;
|
||||||
|
|
||||||
LexemePath() {
|
LexemePath() {
|
||||||
@ -100,7 +106,6 @@ class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* 移除尾部的Lexeme
|
* 移除尾部的Lexeme
|
||||||
*
|
|
||||||
*/
|
*/
|
||||||
void removeTail() {
|
void removeTail() {
|
||||||
Lexeme tail = this.pollLast();
|
Lexeme tail = this.pollLast();
|
||||||
@ -117,7 +122,6 @@ class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* 检测词元位置交叉(有歧义的切分)
|
* 检测词元位置交叉(有歧义的切分)
|
||||||
*
|
|
||||||
*/
|
*/
|
||||||
boolean checkCross(Lexeme lexeme) {
|
boolean checkCross(Lexeme lexeme) {
|
||||||
return (lexeme.getBegin() >= this.pathBegin && lexeme.getBegin() < this.pathEnd)
|
return (lexeme.getBegin() >= this.pathBegin && lexeme.getBegin() < this.pathEnd)
|
||||||
@ -141,7 +145,6 @@ class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* 获取LexemePath的路径长度
|
* 获取LexemePath的路径长度
|
||||||
*
|
|
||||||
*/
|
*/
|
||||||
private int getPathLength() {
|
private int getPathLength() {
|
||||||
return this.pathEnd - this.pathBegin;
|
return this.pathEnd - this.pathBegin;
|
||||||
@ -150,7 +153,6 @@ class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* X权重(词元长度积)
|
* X权重(词元长度积)
|
||||||
*
|
|
||||||
*/
|
*/
|
||||||
private int getXWeight() {
|
private int getXWeight() {
|
||||||
int product = 1;
|
int product = 1;
|
||||||
@ -191,48 +193,48 @@ class LexemePath extends QuickSortSet implements Comparable<LexemePath> {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public int compareTo(LexemePath o) {
|
public int compareTo(LexemePath o) {
|
||||||
//比较有效文本长度
|
// 比较有效文本长度
|
||||||
if (this.payloadLength > o.payloadLength) {
|
if (this.payloadLength > o.payloadLength) {
|
||||||
return -1;
|
return -1;
|
||||||
} else if (this.payloadLength < o.payloadLength) {
|
} else if (this.payloadLength < o.payloadLength) {
|
||||||
return 1;
|
return 1;
|
||||||
} else {
|
|
||||||
//比较词元个数,越少越好
|
|
||||||
if (this.size() < o.size()) {
|
|
||||||
return -1;
|
|
||||||
} else if (this.size() > o.size()) {
|
|
||||||
return 1;
|
|
||||||
} else {
|
|
||||||
//路径跨度越大越好
|
|
||||||
if (this.getPathLength() > o.getPathLength()) {
|
|
||||||
return -1;
|
|
||||||
} else if (this.getPathLength() < o.getPathLength()) {
|
|
||||||
return 1;
|
|
||||||
} else {
|
|
||||||
//根据统计学结论,逆向切分概率高于正向切分,因此位置越靠后的优先
|
|
||||||
if (this.pathEnd > o.pathEnd) {
|
|
||||||
return -1;
|
|
||||||
} else if (pathEnd < o.pathEnd) {
|
|
||||||
return 1;
|
|
||||||
} else {
|
|
||||||
//词长越平均越好
|
|
||||||
if (this.getXWeight() > o.getXWeight()) {
|
|
||||||
return -1;
|
|
||||||
} else if (this.getXWeight() < o.getXWeight()) {
|
|
||||||
return 1;
|
|
||||||
} else {
|
|
||||||
//词元位置权重比较
|
|
||||||
if (this.getPWeight() > o.getPWeight()) {
|
|
||||||
return -1;
|
|
||||||
} else if (this.getPWeight() < o.getPWeight()) {
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// 比较词元个数,越少越好
|
||||||
|
if (this.size() < o.size()) {
|
||||||
|
return -1;
|
||||||
|
} else if (this.size() > o.size()) {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 路径跨度越大越好
|
||||||
|
if (this.getPathLength() > o.getPathLength()) {
|
||||||
|
return -1;
|
||||||
|
} else if (this.getPathLength() < o.getPathLength()) {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 根据统计学结论,逆向切分概率高于正向切分,因此位置越靠后的优先
|
||||||
|
if (this.pathEnd > o.pathEnd) {
|
||||||
|
return -1;
|
||||||
|
} else if (pathEnd < o.pathEnd) {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 词长越平均越好
|
||||||
|
if (this.getXWeight() > o.getXWeight()) {
|
||||||
|
return -1;
|
||||||
|
} else if (this.getXWeight() < o.getXWeight()) {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 词元位置权重比较
|
||||||
|
if (this.getPWeight() > o.getPWeight()) {
|
||||||
|
return -1;
|
||||||
|
} else if (this.getPWeight() < o.getPWeight()) {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,188 +21,196 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.2.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.2.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.core;
|
package org.wltea.analyzer.core;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* IK分词器专用的Lexem快速排序集合
|
* IK分词器专用的Lexeme快速排序集合
|
||||||
*/
|
*/
|
||||||
class QuickSortSet {
|
class QuickSortSet {
|
||||||
//链表头
|
/**
|
||||||
private Cell head;
|
* 链表头
|
||||||
//链表尾
|
*/
|
||||||
private Cell tail;
|
private Cell head;
|
||||||
//链表的实际大小
|
/**
|
||||||
private int size;
|
* 链表尾
|
||||||
|
*/
|
||||||
|
private Cell tail;
|
||||||
|
/**
|
||||||
|
* 链表的实际大小
|
||||||
|
*/
|
||||||
|
private int size;
|
||||||
|
|
||||||
QuickSortSet(){
|
QuickSortSet() {
|
||||||
this.size = 0;
|
this.size = 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 向链表集合添加词元
|
* 向链表集合添加词元
|
||||||
*/
|
*/
|
||||||
void addLexeme(Lexeme lexeme){
|
void addLexeme(Lexeme lexeme) {
|
||||||
Cell newCell = new Cell(lexeme);
|
Cell newCell = new Cell(lexeme);
|
||||||
if(this.size == 0){
|
if (this.size == 0) {
|
||||||
this.head = newCell;
|
this.head = newCell;
|
||||||
this.tail = newCell;
|
this.tail = newCell;
|
||||||
this.size++;
|
this.size++;
|
||||||
|
|
||||||
}else{
|
} else {
|
||||||
/*if(this.tail.compareTo(newCell) == 0){//词元与尾部词元相同,不放入集合
|
if (this.tail.compareTo(newCell) < 0) {
|
||||||
|
// 词元接入链表尾部
|
||||||
|
this.tail.next = newCell;
|
||||||
|
newCell.prev = this.tail;
|
||||||
|
this.tail = newCell;
|
||||||
|
this.size++;
|
||||||
|
|
||||||
}else */if(this.tail.compareTo(newCell) < 0){//词元接入链表尾部
|
} else if (this.head.compareTo(newCell) > 0) {
|
||||||
this.tail.next = newCell;
|
// 词元接入链表头部
|
||||||
newCell.prev = this.tail;
|
this.head.prev = newCell;
|
||||||
this.tail = newCell;
|
newCell.next = this.head;
|
||||||
this.size++;
|
this.head = newCell;
|
||||||
|
this.size++;
|
||||||
|
|
||||||
}else if(this.head.compareTo(newCell) > 0){//词元接入链表头部
|
} else {
|
||||||
this.head.prev = newCell;
|
// 从尾部上逆
|
||||||
newCell.next = this.head;
|
Cell index = this.tail;
|
||||||
this.head = newCell;
|
while (index != null && index.compareTo(newCell) > 0) {
|
||||||
this.size++;
|
index = index.prev;
|
||||||
|
}
|
||||||
|
|
||||||
}else{
|
// 词元插入链表中的某个位置
|
||||||
//从尾部上逆
|
if ((index != null ? index.compareTo(newCell) : 1) < 0) {
|
||||||
Cell index = this.tail;
|
newCell.prev = index;
|
||||||
while(index != null && index.compareTo(newCell) > 0){
|
newCell.next = index.next;
|
||||||
index = index.prev;
|
index.next.prev = newCell;
|
||||||
}
|
index.next = newCell;
|
||||||
/*if(index.compareTo(newCell) == 0){//词元与集合中的词元重复,不放入集合
|
this.size++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
}else */if((index != null ? index.compareTo(newCell) : 1) < 0){//词元插入链表中的某个位置
|
/**
|
||||||
newCell.prev = index;
|
* 返回链表头部元素
|
||||||
newCell.next = index.next;
|
*/
|
||||||
index.next.prev = newCell;
|
Lexeme peekFirst() {
|
||||||
index.next = newCell;
|
if (this.head != null) {
|
||||||
this.size++;
|
return this.head.lexeme;
|
||||||
}
|
}
|
||||||
}
|
return null;
|
||||||
}
|
}
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 返回链表头部元素
|
* 取出链表集合的第一个元素
|
||||||
*/
|
*
|
||||||
Lexeme peekFirst(){
|
* @return Lexeme
|
||||||
if(this.head != null){
|
*/
|
||||||
return this.head.lexeme;
|
Lexeme pollFirst() {
|
||||||
}
|
if (this.size == 1) {
|
||||||
return null;
|
Lexeme first = this.head.lexeme;
|
||||||
}
|
this.head = null;
|
||||||
|
this.tail = null;
|
||||||
|
this.size--;
|
||||||
|
return first;
|
||||||
|
} else if (this.size > 1) {
|
||||||
|
Lexeme first = this.head.lexeme;
|
||||||
|
this.head = this.head.next;
|
||||||
|
this.size--;
|
||||||
|
return first;
|
||||||
|
} else {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 取出链表集合的第一个元素
|
* 返回链表尾部元素
|
||||||
* @return Lexeme
|
*/
|
||||||
*/
|
Lexeme peekLast() {
|
||||||
Lexeme pollFirst(){
|
if (this.tail != null) {
|
||||||
if(this.size == 1){
|
return this.tail.lexeme;
|
||||||
Lexeme first = this.head.lexeme;
|
}
|
||||||
this.head = null;
|
return null;
|
||||||
this.tail = null;
|
}
|
||||||
this.size--;
|
|
||||||
return first;
|
|
||||||
}else if(this.size > 1){
|
|
||||||
Lexeme first = this.head.lexeme;
|
|
||||||
this.head = this.head.next;
|
|
||||||
this.size --;
|
|
||||||
return first;
|
|
||||||
}else{
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 返回链表尾部元素
|
* 取出链表集合的最后一个元素
|
||||||
*/
|
*
|
||||||
Lexeme peekLast(){
|
* @return Lexeme
|
||||||
if(this.tail != null){
|
*/
|
||||||
return this.tail.lexeme;
|
Lexeme pollLast() {
|
||||||
}
|
if (this.size == 1) {
|
||||||
return null;
|
Lexeme last = this.head.lexeme;
|
||||||
}
|
this.head = null;
|
||||||
|
this.tail = null;
|
||||||
|
this.size--;
|
||||||
|
return last;
|
||||||
|
|
||||||
/**
|
} else if (this.size > 1) {
|
||||||
* 取出链表集合的最后一个元素
|
Lexeme last = this.tail.lexeme;
|
||||||
* @return Lexeme
|
this.tail = this.tail.prev;
|
||||||
*/
|
this.size--;
|
||||||
Lexeme pollLast(){
|
return last;
|
||||||
if(this.size == 1){
|
|
||||||
Lexeme last = this.head.lexeme;
|
|
||||||
this.head = null;
|
|
||||||
this.tail = null;
|
|
||||||
this.size--;
|
|
||||||
return last;
|
|
||||||
|
|
||||||
}else if(this.size > 1){
|
} else {
|
||||||
Lexeme last = this.tail.lexeme;
|
return null;
|
||||||
this.tail = this.tail.prev;
|
}
|
||||||
this.size--;
|
}
|
||||||
return last;
|
|
||||||
|
|
||||||
}else{
|
/**
|
||||||
return null;
|
* 返回集合大小
|
||||||
}
|
*/
|
||||||
}
|
int size() {
|
||||||
|
return this.size;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 返回集合大小
|
* 判断集合是否为空
|
||||||
*/
|
*/
|
||||||
int size(){
|
boolean isEmpty() {
|
||||||
return this.size;
|
return this.size == 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 判断集合是否为空
|
* 返回lexeme链的头部
|
||||||
*/
|
*/
|
||||||
boolean isEmpty(){
|
Cell getHead() {
|
||||||
return this.size == 0;
|
return this.head;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/*
|
||||||
* 返回lexeme链的头部
|
* IK 中文分词 版本 8.5.0
|
||||||
*/
|
* IK Analyzer release 8.5.0
|
||||||
Cell getHead(){
|
* update by Magese(magese@live.cn)
|
||||||
return this.head;
|
*/
|
||||||
}
|
@SuppressWarnings("unused")
|
||||||
|
static class Cell implements Comparable<Cell> {
|
||||||
|
private Cell prev;
|
||||||
|
private Cell next;
|
||||||
|
private final Lexeme lexeme;
|
||||||
|
|
||||||
/*
|
Cell(Lexeme lexeme) {
|
||||||
* IK 中文分词 版本 7.0
|
if (lexeme == null) {
|
||||||
* IK Analyzer release 7.0
|
throw new IllegalArgumentException("lexeme must not be null");
|
||||||
* update by Magese(magese@live.cn)
|
}
|
||||||
*/
|
this.lexeme = lexeme;
|
||||||
@SuppressWarnings("unused")
|
}
|
||||||
class Cell implements Comparable<Cell>{
|
|
||||||
private Cell prev;
|
|
||||||
private Cell next;
|
|
||||||
private Lexeme lexeme;
|
|
||||||
|
|
||||||
Cell(Lexeme lexeme){
|
public int compareTo(Cell o) {
|
||||||
if(lexeme == null){
|
return this.lexeme.compareTo(o.lexeme);
|
||||||
throw new IllegalArgumentException("lexeme must not be null");
|
}
|
||||||
}
|
|
||||||
this.lexeme = lexeme;
|
|
||||||
}
|
|
||||||
|
|
||||||
public int compareTo(Cell o) {
|
public Cell getPrev() {
|
||||||
return this.lexeme.compareTo(o.lexeme);
|
return this.prev;
|
||||||
}
|
}
|
||||||
|
|
||||||
public Cell getPrev(){
|
Cell getNext() {
|
||||||
return this.prev;
|
return this.next;
|
||||||
}
|
}
|
||||||
|
|
||||||
Cell getNext(){
|
public Lexeme getLexeme() {
|
||||||
return this.next;
|
return this.lexeme;
|
||||||
}
|
}
|
||||||
|
}
|
||||||
public Lexeme getLexeme(){
|
|
||||||
return this.lexeme;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.dic;
|
package org.wltea.analyzer.dic;
|
||||||
@ -37,24 +37,38 @@ import java.util.Map;
|
|||||||
@SuppressWarnings("unused")
|
@SuppressWarnings("unused")
|
||||||
class DictSegment implements Comparable<DictSegment> {
|
class DictSegment implements Comparable<DictSegment> {
|
||||||
|
|
||||||
//公用字典表,存储汉字
|
/**
|
||||||
|
* 公用字典表,存储汉字
|
||||||
|
*/
|
||||||
private static final Map<Character, Character> charMap = new HashMap<>(16, 0.95f);
|
private static final Map<Character, Character> charMap = new HashMap<>(16, 0.95f);
|
||||||
//数组大小上限
|
/**
|
||||||
|
* 数组大小上限
|
||||||
|
*/
|
||||||
private static final int ARRAY_LENGTH_LIMIT = 3;
|
private static final int ARRAY_LENGTH_LIMIT = 3;
|
||||||
|
|
||||||
|
|
||||||
//Map存储结构
|
/**
|
||||||
private Map<Character, DictSegment> childrenMap;
|
* Map存储结构
|
||||||
//数组方式存储结构
|
*/
|
||||||
private DictSegment[] childrenArray;
|
private volatile Map<Character, DictSegment> childrenMap;
|
||||||
|
/**
|
||||||
|
* 数组方式存储结构
|
||||||
|
*/
|
||||||
|
private volatile DictSegment[] childrenArray;
|
||||||
|
|
||||||
|
|
||||||
//当前节点上存储的字符
|
/**
|
||||||
private Character nodeChar;
|
* 当前节点上存储的字符
|
||||||
//当前节点存储的Segment数目
|
*/
|
||||||
//storeSize <=ARRAY_LENGTH_LIMIT ,使用数组存储, storeSize >ARRAY_LENGTH_LIMIT ,则使用Map存储
|
private final Character nodeChar;
|
||||||
|
/**
|
||||||
|
* 当前节点存储的Segment数目
|
||||||
|
* storeSize <=ARRAY_LENGTH_LIMIT ,使用数组存储, storeSize >ARRAY_LENGTH_LIMIT ,则使用Map存储
|
||||||
|
*/
|
||||||
private int storeSize = 0;
|
private int storeSize = 0;
|
||||||
//当前DictSegment状态 ,默认 0 , 1表示从根节点到当前节点的路径表示一个词
|
/**
|
||||||
|
* 当前DictSegment状态 ,默认 0 , 1表示从根节点到当前节点的路径表示一个词
|
||||||
|
*/
|
||||||
private int nodeState = 0;
|
private int nodeState = 0;
|
||||||
|
|
||||||
|
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,20 +21,20 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.dic;
|
package org.wltea.analyzer.dic;
|
||||||
|
|
||||||
|
import org.wltea.analyzer.cfg.Configuration;
|
||||||
|
import org.wltea.analyzer.cfg.DefaultConfig;
|
||||||
|
|
||||||
import java.io.*;
|
import java.io.*;
|
||||||
import java.nio.charset.StandardCharsets;
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.util.Collection;
|
import java.util.Collection;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
import org.wltea.analyzer.cfg.Configuration;
|
|
||||||
import org.wltea.analyzer.cfg.DefaultConfig;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 词典管理类,单例模式
|
* 词典管理类,单例模式
|
||||||
*/
|
*/
|
||||||
@ -44,7 +44,7 @@ public class Dictionary {
|
|||||||
/*
|
/*
|
||||||
* 词典单子实例
|
* 词典单子实例
|
||||||
*/
|
*/
|
||||||
private static Dictionary singleton;
|
private static volatile Dictionary singleton;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* 主词典对象
|
* 主词典对象
|
||||||
@ -63,7 +63,7 @@ public class Dictionary {
|
|||||||
/**
|
/**
|
||||||
* 配置对象
|
* 配置对象
|
||||||
*/
|
*/
|
||||||
private Configuration cfg;
|
private final Configuration cfg;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 私有构造方法,阻止外部直接实例化本类
|
* 私有构造方法,阻止外部直接实例化本类
|
||||||
@ -226,31 +226,25 @@ public class Dictionary {
|
|||||||
private void loadMainDict() {
|
private void loadMainDict() {
|
||||||
// 建立一个主词典实例
|
// 建立一个主词典实例
|
||||||
_MainDict = new DictSegment((char) 0);
|
_MainDict = new DictSegment((char) 0);
|
||||||
// 读取主词典文件
|
// 获取是否加载主词典
|
||||||
InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getMainDictionary());
|
if (cfg.useMainDict()) {
|
||||||
if (is == null) {
|
// 读取主词典文件
|
||||||
throw new RuntimeException("Main Dictionary not found!!!");
|
InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getMainDictionary());
|
||||||
}
|
if (is == null) {
|
||||||
|
throw new RuntimeException("Main Dictionary not found!!!");
|
||||||
try {
|
}
|
||||||
BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8), 512);
|
|
||||||
String theWord;
|
|
||||||
do {
|
|
||||||
theWord = br.readLine();
|
|
||||||
if (theWord != null && !"".equals(theWord.trim())) {
|
|
||||||
_MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
|
|
||||||
}
|
|
||||||
} while (theWord != null);
|
|
||||||
|
|
||||||
} catch (IOException ioe) {
|
|
||||||
System.err.println("Main Dictionary loading exception.");
|
|
||||||
ioe.printStackTrace();
|
|
||||||
|
|
||||||
} finally {
|
|
||||||
try {
|
try {
|
||||||
is.close();
|
readDict(is, _MainDict);
|
||||||
} catch (IOException e) {
|
} catch (IOException ioe) {
|
||||||
e.printStackTrace();
|
System.err.println("Main Dictionary loading exception.");
|
||||||
|
ioe.printStackTrace();
|
||||||
|
|
||||||
|
} finally {
|
||||||
|
try {
|
||||||
|
is.close();
|
||||||
|
} catch (IOException e) {
|
||||||
|
e.printStackTrace();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
// 加载扩展词典
|
// 加载扩展词典
|
||||||
@ -274,17 +268,7 @@ public class Dictionary {
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8), 512);
|
readDict(is, _MainDict);
|
||||||
String theWord;
|
|
||||||
do {
|
|
||||||
theWord = br.readLine();
|
|
||||||
if (theWord != null && !"".equals(theWord.trim())) {
|
|
||||||
// 加载扩展词典数据到主内存词典中
|
|
||||||
// System.out.println(theWord);
|
|
||||||
_MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
|
|
||||||
}
|
|
||||||
} while (theWord != null);
|
|
||||||
|
|
||||||
} catch (IOException ioe) {
|
} catch (IOException ioe) {
|
||||||
System.err.println("Extension Dictionary loading exception.");
|
System.err.println("Extension Dictionary loading exception.");
|
||||||
ioe.printStackTrace();
|
ioe.printStackTrace();
|
||||||
@ -319,17 +303,7 @@ public class Dictionary {
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8), 512);
|
readDict(is, _StopWordDict);
|
||||||
String theWord;
|
|
||||||
do {
|
|
||||||
theWord = br.readLine();
|
|
||||||
if (theWord != null && !"".equals(theWord.trim())) {
|
|
||||||
// System.out.println(theWord);
|
|
||||||
// 加载扩展停止词典数据到内存中
|
|
||||||
_StopWordDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
|
|
||||||
}
|
|
||||||
} while (theWord != null);
|
|
||||||
|
|
||||||
} catch (IOException ioe) {
|
} catch (IOException ioe) {
|
||||||
System.err.println("Extension Stop word Dictionary loading exception.");
|
System.err.println("Extension Stop word Dictionary loading exception.");
|
||||||
ioe.printStackTrace();
|
ioe.printStackTrace();
|
||||||
@ -352,20 +326,12 @@ public class Dictionary {
|
|||||||
// 建立一个量词典实例
|
// 建立一个量词典实例
|
||||||
_QuantifierDict = new DictSegment((char) 0);
|
_QuantifierDict = new DictSegment((char) 0);
|
||||||
// 读取量词词典文件
|
// 读取量词词典文件
|
||||||
InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getQuantifierDicionary());
|
InputStream is = this.getClass().getClassLoader().getResourceAsStream(cfg.getQuantifierDictionary());
|
||||||
if (is == null) {
|
if (is == null) {
|
||||||
throw new RuntimeException("Quantifier Dictionary not found!!!");
|
throw new RuntimeException("Quantifier Dictionary not found!!!");
|
||||||
}
|
}
|
||||||
try {
|
try {
|
||||||
BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8), 512);
|
readDict(is, _QuantifierDict);
|
||||||
String theWord;
|
|
||||||
do {
|
|
||||||
theWord = br.readLine();
|
|
||||||
if (theWord != null && !"".equals(theWord.trim())) {
|
|
||||||
_QuantifierDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
|
|
||||||
}
|
|
||||||
} while (theWord != null);
|
|
||||||
|
|
||||||
} catch (IOException ioe) {
|
} catch (IOException ioe) {
|
||||||
System.err.println("Quantifier Dictionary loading exception.");
|
System.err.println("Quantifier Dictionary loading exception.");
|
||||||
ioe.printStackTrace();
|
ioe.printStackTrace();
|
||||||
@ -379,4 +345,21 @@ public class Dictionary {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* 读取词典文件到词典树中
|
||||||
|
*
|
||||||
|
* @param is 文件输入流
|
||||||
|
* @param dictSegment 词典树分段
|
||||||
|
* @throws IOException 读取异常
|
||||||
|
*/
|
||||||
|
private void readDict(InputStream is, DictSegment dictSegment) throws IOException {
|
||||||
|
BufferedReader br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8), 512);
|
||||||
|
String theWord;
|
||||||
|
do {
|
||||||
|
theWord = br.readLine();
|
||||||
|
if (theWord != null && !"".equals(theWord.trim())) {
|
||||||
|
dictSegment.fillSegment(theWord.trim().toLowerCase().toCharArray());
|
||||||
|
}
|
||||||
|
} while (theWord != null);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.dic;
|
package org.wltea.analyzer.dic;
|
||||||
@ -32,24 +32,33 @@ package org.wltea.analyzer.dic;
|
|||||||
*/
|
*/
|
||||||
@SuppressWarnings("unused")
|
@SuppressWarnings("unused")
|
||||||
public class Hit {
|
public class Hit {
|
||||||
//Hit不匹配
|
/**
|
||||||
|
* Hit不匹配
|
||||||
|
*/
|
||||||
private static final int UNMATCH = 0x00000000;
|
private static final int UNMATCH = 0x00000000;
|
||||||
//Hit完全匹配
|
/**
|
||||||
|
* Hit完全匹配
|
||||||
|
*/
|
||||||
private static final int MATCH = 0x00000001;
|
private static final int MATCH = 0x00000001;
|
||||||
//Hit前缀匹配
|
/**
|
||||||
|
* Hit前缀匹配
|
||||||
|
*/
|
||||||
private static final int PREFIX = 0x00000010;
|
private static final int PREFIX = 0x00000010;
|
||||||
|
|
||||||
|
|
||||||
//该HIT当前状态,默认未匹配
|
/**
|
||||||
|
* 该HIT当前状态,默认未匹配
|
||||||
|
*/
|
||||||
private int hitState = UNMATCH;
|
private int hitState = UNMATCH;
|
||||||
|
/**
|
||||||
//记录词典匹配过程中,当前匹配到的词典分支节点
|
* 记录词典匹配过程中,当前匹配到的词典分支节点
|
||||||
|
*/
|
||||||
private DictSegment matchedDictSegment;
|
private DictSegment matchedDictSegment;
|
||||||
/*
|
/**
|
||||||
* 词段开始位置
|
* 词段开始位置
|
||||||
*/
|
*/
|
||||||
private int begin;
|
private int begin;
|
||||||
/*
|
/**
|
||||||
* 词段的结束位置
|
* 词段的结束位置
|
||||||
*/
|
*/
|
||||||
private int end;
|
private int end;
|
||||||
@ -86,9 +95,7 @@ public class Hit {
|
|||||||
public boolean isUnmatch() {
|
public boolean isUnmatch() {
|
||||||
return this.hitState == UNMATCH ;
|
return this.hitState == UNMATCH ;
|
||||||
}
|
}
|
||||||
/**
|
|
||||||
*
|
|
||||||
*/
|
|
||||||
void setUnmatch() {
|
void setUnmatch() {
|
||||||
this.hitState = UNMATCH;
|
this.hitState = UNMATCH;
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.lucene;
|
package org.wltea.analyzer.lucene;
|
||||||
@ -34,44 +34,40 @@ import org.apache.lucene.analysis.Tokenizer;
|
|||||||
* IK分词器,Lucene Analyzer接口实现
|
* IK分词器,Lucene Analyzer接口实现
|
||||||
*/
|
*/
|
||||||
@SuppressWarnings("unused")
|
@SuppressWarnings("unused")
|
||||||
public final class IKAnalyzer extends Analyzer{
|
public final class IKAnalyzer extends Analyzer {
|
||||||
|
|
||||||
private boolean useSmart;
|
private final boolean useSmart;
|
||||||
|
|
||||||
private boolean useSmart() {
|
private boolean useSmart() {
|
||||||
return useSmart;
|
return useSmart;
|
||||||
}
|
}
|
||||||
|
|
||||||
public void setUseSmart(boolean useSmart) {
|
|
||||||
this.useSmart = useSmart;
|
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* IK分词器Lucene Analyzer接口实现类
|
* IK分词器Lucene Analyzer接口实现类
|
||||||
*
|
* 默认细粒度切分算法
|
||||||
* 默认细粒度切分算法
|
*/
|
||||||
*/
|
public IKAnalyzer() {
|
||||||
public IKAnalyzer(){
|
this(false);
|
||||||
this(false);
|
}
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* IK分词器Lucene Analyzer接口实现类
|
* IK分词器Lucene Analyzer接口实现类
|
||||||
*
|
*
|
||||||
* @param useSmart 当为true时,分词器进行智能切分
|
* @param useSmart 当为true时,分词器进行智能切分
|
||||||
*/
|
*/
|
||||||
public IKAnalyzer(boolean useSmart){
|
public IKAnalyzer(boolean useSmart) {
|
||||||
super();
|
super();
|
||||||
this.useSmart = useSmart;
|
this.useSmart = useSmart;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 重载Analyzer接口,构造分词组件
|
* 重载Analyzer接口,构造分词组件
|
||||||
*/
|
*/
|
||||||
@Override
|
@Override
|
||||||
protected TokenStreamComponents createComponents(String fieldName) {
|
protected TokenStreamComponents createComponents(String fieldName) {
|
||||||
Tokenizer _IKTokenizer = new IKTokenizer(this.useSmart());
|
Tokenizer _IKTokenizer = new IKTokenizer(this.useSmart());
|
||||||
return new TokenStreamComponents(_IKTokenizer);
|
return new TokenStreamComponents(_IKTokenizer);
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.lucene;
|
package org.wltea.analyzer.lucene;
|
||||||
@ -39,92 +39,102 @@ import java.io.IOException;
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* IK分词器 Lucene Tokenizer适配器类
|
* IK分词器 Lucene Tokenizer适配器类
|
||||||
* 兼容Lucene 4.0版本
|
|
||||||
*/
|
*/
|
||||||
@SuppressWarnings("unused")
|
@SuppressWarnings({"unused", "FinalMethodInFinalClass"})
|
||||||
public final class IKTokenizer extends Tokenizer {
|
public final class IKTokenizer extends Tokenizer {
|
||||||
|
|
||||||
//IK分词器实现
|
/**
|
||||||
private IKSegmenter _IKImplement;
|
* IK分词器实现
|
||||||
|
*/
|
||||||
|
private IKSegmenter _IKImplement;
|
||||||
|
|
||||||
//词元文本属性
|
/**
|
||||||
private CharTermAttribute termAtt;
|
* 词元文本属性
|
||||||
//词元位移属性
|
*/
|
||||||
private OffsetAttribute offsetAtt;
|
private CharTermAttribute termAtt;
|
||||||
//词元分类属性(该属性分类参考org.wltea.analyzer.core.Lexeme中的分类常量)
|
/**
|
||||||
private TypeAttribute typeAtt;
|
* 词元位移属性
|
||||||
//记录最后一个词元的结束位置
|
*/
|
||||||
private int endPosition;
|
private OffsetAttribute offsetAtt;
|
||||||
|
/**
|
||||||
|
* 词元分类属性(该属性分类参考org.wltea.analyzer.core.Lexeme中的分类常量)
|
||||||
|
*/
|
||||||
|
private TypeAttribute typeAtt;
|
||||||
|
/**
|
||||||
|
* 记录最后一个词元的结束位置
|
||||||
|
*/
|
||||||
|
private int endPosition;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Lucene 7.6 Tokenizer适配器类构造函数
|
* Lucene 7.6 Tokenizer适配器类构造函数
|
||||||
*/
|
*/
|
||||||
public IKTokenizer() {
|
public IKTokenizer() {
|
||||||
this(false);
|
this(false);
|
||||||
}
|
}
|
||||||
|
|
||||||
IKTokenizer(boolean useSmart) {
|
IKTokenizer(boolean useSmart) {
|
||||||
super();
|
super();
|
||||||
init(useSmart);
|
init(useSmart);
|
||||||
}
|
}
|
||||||
|
|
||||||
public IKTokenizer(AttributeFactory factory) {
|
public IKTokenizer(AttributeFactory factory) {
|
||||||
this(factory, false);
|
this(factory, false);
|
||||||
}
|
}
|
||||||
|
|
||||||
IKTokenizer(AttributeFactory factory, boolean useSmart) {
|
IKTokenizer(AttributeFactory factory, boolean useSmart) {
|
||||||
super(factory);
|
super(factory);
|
||||||
init(useSmart);
|
init(useSmart);
|
||||||
}
|
}
|
||||||
|
|
||||||
private void init(boolean useSmart) {
|
private void init(boolean useSmart) {
|
||||||
offsetAtt = addAttribute(OffsetAttribute.class);
|
offsetAtt = addAttribute(OffsetAttribute.class);
|
||||||
termAtt = addAttribute(CharTermAttribute.class);
|
termAtt = addAttribute(CharTermAttribute.class);
|
||||||
typeAtt = addAttribute(TypeAttribute.class);
|
typeAtt = addAttribute(TypeAttribute.class);
|
||||||
_IKImplement = new IKSegmenter(input , useSmart);
|
_IKImplement = new IKSegmenter(input, useSmart);
|
||||||
}
|
}
|
||||||
|
|
||||||
/* (non-Javadoc)
|
/*
|
||||||
* @see org.apache.lucene.analysis.TokenStream#incrementToken()
|
* (non-Javadoc)
|
||||||
*/
|
* @see org.apache.lucene.analysis.TokenStream#incrementToken()
|
||||||
@Override
|
*/
|
||||||
public boolean incrementToken() throws IOException {
|
@Override
|
||||||
//清除所有的词元属性
|
public boolean incrementToken() throws IOException {
|
||||||
clearAttributes();
|
// 清除所有的词元属性
|
||||||
Lexeme nextLexeme = _IKImplement.next();
|
clearAttributes();
|
||||||
if(nextLexeme != null){
|
Lexeme nextLexeme = _IKImplement.next();
|
||||||
//将Lexeme转成Attributes
|
if (nextLexeme != null) {
|
||||||
//设置词元文本
|
// 将Lexeme转成Attributes
|
||||||
termAtt.append(nextLexeme.getLexemeText());
|
// 设置词元文本
|
||||||
//设置词元长度
|
termAtt.append(nextLexeme.getLexemeText());
|
||||||
termAtt.setLength(nextLexeme.getLength());
|
// 设置词元长度
|
||||||
//设置词元位移
|
termAtt.setLength(nextLexeme.getLength());
|
||||||
offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
|
// 设置词元位移
|
||||||
//记录分词的最后位置
|
offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
|
||||||
endPosition = nextLexeme.getEndPosition();
|
// 记录分词的最后位置
|
||||||
//记录词元分类
|
endPosition = nextLexeme.getEndPosition();
|
||||||
typeAtt.setType(nextLexeme.getLexemeTypeString());
|
// 记录词元分类
|
||||||
//返会true告知还有下个词元
|
typeAtt.setType(nextLexeme.getLexemeTypeString());
|
||||||
return true;
|
// 返会true告知还有下个词元
|
||||||
}
|
return true;
|
||||||
//返会false告知词元输出完毕
|
}
|
||||||
return false;
|
// 返会false告知词元输出完毕
|
||||||
}
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* (non-Javadoc)
|
* (non-Javadoc)
|
||||||
* @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
|
* @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
|
||||||
*/
|
*/
|
||||||
@Override
|
@Override
|
||||||
public void reset() throws IOException {
|
public void reset() throws IOException {
|
||||||
super.reset();
|
super.reset();
|
||||||
_IKImplement.reset(input);
|
_IKImplement.reset(input);
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public final void end() {
|
public final void end() {
|
||||||
// set final offset
|
// set final offset
|
||||||
int finalOffset = correctOffset(this.endPosition);
|
int finalOffset = correctOffset(this.endPosition);
|
||||||
offsetAtt.setOffset(finalOffset, finalOffset);
|
offsetAtt.setOffset(finalOffset, finalOffset);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.lucene;
|
package org.wltea.analyzer.lucene;
|
||||||
@ -44,6 +44,8 @@ import java.nio.charset.StandardCharsets;
|
|||||||
import java.util.*;
|
import java.util.*;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
|
* 分词器工厂类
|
||||||
|
*
|
||||||
* @author <a href="magese@live.cn">Magese</a>
|
* @author <a href="magese@live.cn">Magese</a>
|
||||||
*/
|
*/
|
||||||
public class IKTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware, UpdateThread.UpdateJob {
|
public class IKTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware, UpdateThread.UpdateJob {
|
||||||
@ -74,7 +76,7 @@ public class IKTokenizerFactory extends TokenizerFactory implements ResourceLoad
|
|||||||
*/
|
*/
|
||||||
@Override
|
@Override
|
||||||
public void inform(ResourceLoader resourceLoader) throws IOException {
|
public void inform(ResourceLoader resourceLoader) throws IOException {
|
||||||
System.out.println(String.format("IKTokenizerFactory " + this.hashCode() + " inform conf: %s", getConf()));
|
System.out.printf("IKTokenizerFactory " + this.hashCode() + " inform conf: %s%n", getConf());
|
||||||
this.loader = resourceLoader;
|
this.loader = resourceLoader;
|
||||||
update();
|
update();
|
||||||
if ((getConf() != null) && (!getConf().trim().isEmpty())) {
|
if ((getConf() != null) && (!getConf().trim().isEmpty())) {
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.lucene;
|
package org.wltea.analyzer.lucene;
|
||||||
@ -35,7 +35,7 @@ import java.util.Vector;
|
|||||||
*/
|
*/
|
||||||
public class UpdateThread implements Runnable {
|
public class UpdateThread implements Runnable {
|
||||||
private static final long INTERVAL = 30000L; // 循环等待时间
|
private static final long INTERVAL = 30000L; // 循环等待时间
|
||||||
private Vector<UpdateJob> filterFactorys; // 更新任务集合
|
private final Vector<UpdateJob> filterFactorys; // 更新任务集合
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 私有化构造器,阻止外部进行实例化
|
* 私有化构造器,阻止外部进行实例化
|
||||||
@ -51,7 +51,7 @@ public class UpdateThread implements Runnable {
|
|||||||
* 静态内部类,实现线程安全单例模式
|
* 静态内部类,实现线程安全单例模式
|
||||||
*/
|
*/
|
||||||
private static class Builder {
|
private static class Builder {
|
||||||
private static UpdateThread singleton = new UpdateThread();
|
private static final UpdateThread singleton = new UpdateThread();
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@ -81,6 +81,7 @@ public class UpdateThread implements Runnable {
|
|||||||
//noinspection InfiniteLoopStatement
|
//noinspection InfiniteLoopStatement
|
||||||
while (true) {
|
while (true) {
|
||||||
try {
|
try {
|
||||||
|
//noinspection BusyWait
|
||||||
Thread.sleep(INTERVAL);
|
Thread.sleep(INTERVAL);
|
||||||
} catch (InterruptedException e) {
|
} catch (InterruptedException e) {
|
||||||
e.printStackTrace();
|
e.printStackTrace();
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.query;
|
package org.wltea.analyzer.query;
|
||||||
@ -46,11 +46,11 @@ import java.util.Stack;
|
|||||||
public class IKQueryExpressionParser {
|
public class IKQueryExpressionParser {
|
||||||
|
|
||||||
|
|
||||||
private List<Element> elements = new ArrayList<>();
|
private final List<Element> elements = new ArrayList<>();
|
||||||
|
|
||||||
private Stack<Query> querys = new Stack<>();
|
private final Stack<Query> querys = new Stack<>();
|
||||||
|
|
||||||
private Stack<Element> operates = new Stack<>();
|
private final Stack<Element> operates = new Stack<>();
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 解析查询表达式,生成Lucene Query对象
|
* 解析查询表达式,生成Lucene Query对象
|
||||||
@ -61,9 +61,9 @@ public class IKQueryExpressionParser {
|
|||||||
Query lucenceQuery = null;
|
Query lucenceQuery = null;
|
||||||
if (expression != null && !"".equals(expression.trim())) {
|
if (expression != null && !"".equals(expression.trim())) {
|
||||||
try {
|
try {
|
||||||
//文法解析
|
// 文法解析
|
||||||
this.splitElements(expression);
|
this.splitElements(expression);
|
||||||
//语法解析
|
// 语法解析
|
||||||
this.parseSyntax();
|
this.parseSyntax();
|
||||||
if (this.querys.size() == 1) {
|
if (this.querys.size() == 1) {
|
||||||
lucenceQuery = this.querys.pop();
|
lucenceQuery = this.querys.pop();
|
||||||
@ -87,263 +87,263 @@ public class IKQueryExpressionParser {
|
|||||||
if (expression == null) {
|
if (expression == null) {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
Element curretElement = null;
|
Element currentElement = null;
|
||||||
|
|
||||||
char[] expChars = expression.toCharArray();
|
char[] expChars = expression.toCharArray();
|
||||||
for (char expChar : expChars) {
|
for (char expChar : expChars) {
|
||||||
switch (expChar) {
|
switch (expChar) {
|
||||||
case '&':
|
case '&':
|
||||||
if (curretElement == null) {
|
if (currentElement == null) {
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '&';
|
currentElement.type = '&';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
} else if (curretElement.type == '&') {
|
} else if (currentElement.type == '&') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
} else if (curretElement.type == '\'') {
|
} else if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '&';
|
currentElement.type = '&';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '|':
|
case '|':
|
||||||
if (curretElement == null) {
|
if (currentElement == null) {
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '|';
|
currentElement.type = '|';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
} else if (curretElement.type == '|') {
|
} else if (currentElement.type == '|') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
} else if (curretElement.type == '\'') {
|
} else if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '|';
|
currentElement.type = '|';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '-':
|
case '-':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '-';
|
currentElement.type = '-';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '(':
|
case '(':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '(';
|
currentElement.type = '(';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case ')':
|
case ')':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = ')';
|
currentElement.type = ')';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case ':':
|
case ':':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = ':';
|
currentElement.type = ':';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '=':
|
case '=':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '=';
|
currentElement.type = '=';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case ' ':
|
case ' ':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '\'':
|
case '\'':
|
||||||
if (curretElement == null) {
|
if (currentElement == null) {
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '\'';
|
currentElement.type = '\'';
|
||||||
|
|
||||||
} else if (curretElement.type == '\'') {
|
} else if (currentElement.type == '\'') {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '\'';
|
currentElement.type = '\'';
|
||||||
|
|
||||||
}
|
}
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '[':
|
case '[':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '[';
|
currentElement.type = '[';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case ']':
|
case ']':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = ']';
|
currentElement.type = ']';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
|
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '{':
|
case '{':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '{';
|
currentElement.type = '{';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
break;
|
break;
|
||||||
|
|
||||||
case '}':
|
case '}':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = '}';
|
currentElement.type = '}';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
|
|
||||||
break;
|
break;
|
||||||
case ',':
|
case ',':
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
if (curretElement.type == '\'') {
|
if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
continue;
|
continue;
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = ',';
|
currentElement.type = ',';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = null;
|
currentElement = null;
|
||||||
|
|
||||||
break;
|
break;
|
||||||
|
|
||||||
default:
|
default:
|
||||||
if (curretElement == null) {
|
if (currentElement == null) {
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = 'F';
|
currentElement.type = 'F';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
|
|
||||||
} else if (curretElement.type == 'F') {
|
} else if (currentElement.type == 'F') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
|
|
||||||
} else if (curretElement.type == '\'') {
|
} else if (currentElement.type == '\'') {
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
curretElement = new Element();
|
currentElement = new Element();
|
||||||
curretElement.type = 'F';
|
currentElement.type = 'F';
|
||||||
curretElement.append(expChar);
|
currentElement.append(expChar);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (curretElement != null) {
|
if (currentElement != null) {
|
||||||
this.elements.add(curretElement);
|
this.elements.add(currentElement);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -359,7 +359,7 @@ public class IKQueryExpressionParser {
|
|||||||
throw new IllegalStateException("表达式异常: = 或 : 号丢失");
|
throw new IllegalStateException("表达式异常: = 或 : 号丢失");
|
||||||
}
|
}
|
||||||
Element e3 = this.elements.get(i + 2);
|
Element e3 = this.elements.get(i + 2);
|
||||||
//处理 = 和 : 运算
|
// 处理 = 和 : 运算
|
||||||
if ('\'' == e3.type) {
|
if ('\'' == e3.type) {
|
||||||
i += 2;
|
i += 2;
|
||||||
if ('=' == e2.type) {
|
if ('=' == e2.type) {
|
||||||
@ -367,14 +367,14 @@ public class IKQueryExpressionParser {
|
|||||||
this.querys.push(tQuery);
|
this.querys.push(tQuery);
|
||||||
} else {
|
} else {
|
||||||
String keyword = e3.toString();
|
String keyword = e3.toString();
|
||||||
//SWMCQuery Here
|
// SWMCQuery Here
|
||||||
Query _SWMCQuery = SWMCQueryBuilder.create(e.toString(), keyword);
|
Query _SWMCQuery = SWMCQueryBuilder.create(e.toString(), keyword);
|
||||||
this.querys.push(_SWMCQuery);
|
this.querys.push(_SWMCQuery);
|
||||||
}
|
}
|
||||||
|
|
||||||
} else if ('[' == e3.type || '{' == e3.type) {
|
} else if ('[' == e3.type || '{' == e3.type) {
|
||||||
i += 2;
|
i += 2;
|
||||||
//处理 [] 和 {}
|
// 处理 [] 和 {}
|
||||||
LinkedList<Element> eQueue = new LinkedList<>();
|
LinkedList<Element> eQueue = new LinkedList<>();
|
||||||
eQueue.add(e3);
|
eQueue.add(e3);
|
||||||
for (i++; i < this.elements.size(); i++) {
|
for (i++; i < this.elements.size(); i++) {
|
||||||
@ -384,7 +384,7 @@ public class IKQueryExpressionParser {
|
|||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
//翻译RangeQuery
|
// 翻译RangeQuery
|
||||||
Query rangeQuery = this.toTermRangeQuery(e, eQueue);
|
Query rangeQuery = this.toTermRangeQuery(e, eQueue);
|
||||||
this.querys.push(rangeQuery);
|
this.querys.push(rangeQuery);
|
||||||
} else {
|
} else {
|
||||||
@ -475,10 +475,10 @@ public class IKQueryExpressionParser {
|
|||||||
}
|
}
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
//q1 instanceof TermQuery
|
// q1 instanceof TermQuery
|
||||||
//q1 instanceof TermRangeQuery
|
// q1 instanceof TermRangeQuery
|
||||||
//q1 instanceof PhraseQuery
|
// q1 instanceof PhraseQuery
|
||||||
//others
|
// others
|
||||||
resultQuery.add(q1, Occur.MUST);
|
resultQuery.add(q1, Occur.MUST);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -496,10 +496,10 @@ public class IKQueryExpressionParser {
|
|||||||
}
|
}
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
//q1 instanceof TermQuery
|
// q1 instanceof TermQuery
|
||||||
//q1 instanceof TermRangeQuery
|
// q1 instanceof TermRangeQuery
|
||||||
//q1 instanceof PhraseQuery
|
// q1 instanceof PhraseQuery
|
||||||
//others
|
// others
|
||||||
resultQuery.add(q2, Occur.MUST);
|
resultQuery.add(q2, Occur.MUST);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -518,10 +518,10 @@ public class IKQueryExpressionParser {
|
|||||||
}
|
}
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
//q1 instanceof TermQuery
|
// q1 instanceof TermQuery
|
||||||
//q1 instanceof TermRangeQuery
|
// q1 instanceof TermRangeQuery
|
||||||
//q1 instanceof PhraseQuery
|
// q1 instanceof PhraseQuery
|
||||||
//others
|
// others
|
||||||
resultQuery.add(q1, Occur.SHOULD);
|
resultQuery.add(q1, Occur.SHOULD);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -538,10 +538,10 @@ public class IKQueryExpressionParser {
|
|||||||
resultQuery.add(q2, Occur.SHOULD);
|
resultQuery.add(q2, Occur.SHOULD);
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
//q2 instanceof TermQuery
|
// q2 instanceof TermQuery
|
||||||
//q2 instanceof TermRangeQuery
|
// q2 instanceof TermRangeQuery
|
||||||
//q2 instanceof PhraseQuery
|
// q2 instanceof PhraseQuery
|
||||||
//others
|
// others
|
||||||
resultQuery.add(q2, Occur.SHOULD);
|
resultQuery.add(q2, Occur.SHOULD);
|
||||||
|
|
||||||
}
|
}
|
||||||
@ -563,10 +563,10 @@ public class IKQueryExpressionParser {
|
|||||||
}
|
}
|
||||||
|
|
||||||
} else {
|
} else {
|
||||||
//q1 instanceof TermQuery
|
// q1 instanceof TermQuery
|
||||||
//q1 instanceof TermRangeQuery
|
// q1 instanceof TermRangeQuery
|
||||||
//q1 instanceof PhraseQuery
|
// q1 instanceof PhraseQuery
|
||||||
//others
|
// others
|
||||||
resultQuery.add(q1, Occur.MUST);
|
resultQuery.add(q1, Occur.MUST);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -584,7 +584,7 @@ public class IKQueryExpressionParser {
|
|||||||
boolean includeLast;
|
boolean includeLast;
|
||||||
String firstValue;
|
String firstValue;
|
||||||
String lastValue = null;
|
String lastValue = null;
|
||||||
//检查第一个元素是否是[或者{
|
// 检查第一个元素是否是[或者{
|
||||||
Element first = elements.getFirst();
|
Element first = elements.getFirst();
|
||||||
if ('[' == first.type) {
|
if ('[' == first.type) {
|
||||||
includeFirst = true;
|
includeFirst = true;
|
||||||
@ -593,7 +593,7 @@ public class IKQueryExpressionParser {
|
|||||||
} else {
|
} else {
|
||||||
throw new IllegalStateException("表达式异常");
|
throw new IllegalStateException("表达式异常");
|
||||||
}
|
}
|
||||||
//检查最后一个元素是否是]或者}
|
// 检查最后一个元素是否是]或者}
|
||||||
Element last = elements.getLast();
|
Element last = elements.getLast();
|
||||||
if (']' == last.type) {
|
if (']' == last.type) {
|
||||||
includeLast = true;
|
includeLast = true;
|
||||||
@ -605,7 +605,7 @@ public class IKQueryExpressionParser {
|
|||||||
if (elements.size() < 4 || elements.size() > 5) {
|
if (elements.size() < 4 || elements.size() > 5) {
|
||||||
throw new IllegalStateException("表达式异常, RangeQuery 错误");
|
throw new IllegalStateException("表达式异常, RangeQuery 错误");
|
||||||
}
|
}
|
||||||
//读出中间部分
|
// 读出中间部分
|
||||||
Element e2 = elements.get(1);
|
Element e2 = elements.get(1);
|
||||||
if ('\'' == e2.type) {
|
if ('\'' == e2.type) {
|
||||||
firstValue = e2.toString();
|
firstValue = e2.toString();
|
||||||
@ -673,7 +673,7 @@ public class IKQueryExpressionParser {
|
|||||||
* @author linliangyi
|
* @author linliangyi
|
||||||
* May 20, 2010
|
* May 20, 2010
|
||||||
*/
|
*/
|
||||||
private class Element {
|
private static class Element {
|
||||||
char type = 0;
|
char type = 0;
|
||||||
StringBuffer eleTextBuff;
|
StringBuffer eleTextBuff;
|
||||||
|
|
||||||
@ -692,11 +692,9 @@ public class IKQueryExpressionParser {
|
|||||||
|
|
||||||
public static void main(String[] args) {
|
public static void main(String[] args) {
|
||||||
IKQueryExpressionParser parser = new IKQueryExpressionParser();
|
IKQueryExpressionParser parser = new IKQueryExpressionParser();
|
||||||
//String ikQueryExp = "newsTitle:'的两款《魔兽世界》插件Bigfoot和月光宝盒'";
|
|
||||||
String ikQueryExp = "(id='ABcdRf' && date:{'20010101','20110101'} && keyword:'魔兽中国') || (content:'KSHT-KSH-A001-18' || ulr='www.ik.com') - name:'林良益'";
|
String ikQueryExp = "(id='ABcdRf' && date:{'20010101','20110101'} && keyword:'魔兽中国') || (content:'KSHT-KSH-A001-18' || ulr='www.ik.com') - name:'林良益'";
|
||||||
Query result = parser.parseExp(ikQueryExp);
|
Query result = parser.parseExp(ikQueryExp);
|
||||||
System.out.println(result);
|
System.out.println(result);
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
/*
|
/*
|
||||||
* IK 中文分词 版本 7.7
|
* IK 中文分词 版本 8.5.0
|
||||||
* IK Analyzer release 7.7
|
* IK Analyzer release 8.5.0
|
||||||
*
|
*
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
* contributor license agreements. See the NOTICE file distributed with
|
||||||
@ -21,8 +21,8 @@
|
|||||||
* 版权声明 2012,乌龙茶工作室
|
* 版权声明 2012,乌龙茶工作室
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
* provided by Linliangyi and copyright 2012 by Oolong studio
|
||||||
*
|
*
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
* 8.5.0版本 由 Magese (magese@live.cn) 更新
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
* release 8.5.0 update by Magese(magese@live.cn)
|
||||||
*
|
*
|
||||||
*/
|
*/
|
||||||
package org.wltea.analyzer.query;
|
package org.wltea.analyzer.query;
|
||||||
@ -45,6 +45,7 @@ import java.util.List;
|
|||||||
*
|
*
|
||||||
* @author linliangyi
|
* @author linliangyi
|
||||||
*/
|
*/
|
||||||
|
@SuppressWarnings("unused")
|
||||||
class SWMCQueryBuilder {
|
class SWMCQueryBuilder {
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@ -56,9 +57,9 @@ class SWMCQueryBuilder {
|
|||||||
if (fieldName == null || keywords == null) {
|
if (fieldName == null || keywords == null) {
|
||||||
throw new IllegalArgumentException("参数 fieldName 、 keywords 不能为null.");
|
throw new IllegalArgumentException("参数 fieldName 、 keywords 不能为null.");
|
||||||
}
|
}
|
||||||
//1.对keywords进行分词处理
|
// 1.对keywords进行分词处理
|
||||||
List<Lexeme> lexemes = doAnalyze(keywords);
|
List<Lexeme> lexemes = doAnalyze(keywords);
|
||||||
//2.根据分词结果,生成SWMCQuery
|
// 2.根据分词结果,生成SWMCQuery
|
||||||
return getSWMCQuery(fieldName, lexemes);
|
return getSWMCQuery(fieldName, lexemes);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -84,20 +85,20 @@ class SWMCQueryBuilder {
|
|||||||
* 根据分词结果生成SWMC搜索
|
* 根据分词结果生成SWMC搜索
|
||||||
*/
|
*/
|
||||||
private static Query getSWMCQuery(String fieldName, List<Lexeme> lexemes) {
|
private static Query getSWMCQuery(String fieldName, List<Lexeme> lexemes) {
|
||||||
//构造SWMC的查询表达式
|
// 构造SWMC的查询表达式
|
||||||
StringBuilder keywordBuffer = new StringBuilder();
|
StringBuilder keywordBuffer = new StringBuilder();
|
||||||
//精简的SWMC的查询表达式
|
// 精简的SWMC的查询表达式
|
||||||
StringBuilder keywordBuffer_Short = new StringBuilder();
|
StringBuilder keywordBuffer_Short = new StringBuilder();
|
||||||
//记录最后词元长度
|
// 记录最后词元长度
|
||||||
int lastLexemeLength = 0;
|
int lastLexemeLength = 0;
|
||||||
//记录最后词元结束位置
|
// 记录最后词元结束位置
|
||||||
int lastLexemeEnd = -1;
|
int lastLexemeEnd = -1;
|
||||||
|
|
||||||
int shortCount = 0;
|
int shortCount = 0;
|
||||||
int totalCount = 0;
|
int totalCount = 0;
|
||||||
for (Lexeme l : lexemes) {
|
for (Lexeme l : lexemes) {
|
||||||
totalCount += l.getLength();
|
totalCount += l.getLength();
|
||||||
//精简表达式
|
// 精简表达式
|
||||||
if (l.getLength() > 1) {
|
if (l.getLength() > 1) {
|
||||||
keywordBuffer_Short.append(' ').append(l.getLexemeText());
|
keywordBuffer_Short.append(' ').append(l.getLexemeText());
|
||||||
shortCount += l.getLength();
|
shortCount += l.getLength();
|
||||||
@ -106,7 +107,7 @@ class SWMCQueryBuilder {
|
|||||||
if (lastLexemeLength == 0) {
|
if (lastLexemeLength == 0) {
|
||||||
keywordBuffer.append(l.getLexemeText());
|
keywordBuffer.append(l.getLexemeText());
|
||||||
} else if (lastLexemeLength == 1 && l.getLength() == 1
|
} else if (lastLexemeLength == 1 && l.getLength() == 1
|
||||||
&& lastLexemeEnd == l.getBeginPosition()) {//单字位置相邻,长度为一,合并)
|
&& lastLexemeEnd == l.getBeginPosition()) {// 单字位置相邻,长度为一,合并)
|
||||||
keywordBuffer.append(l.getLexemeText());
|
keywordBuffer.append(l.getLexemeText());
|
||||||
} else {
|
} else {
|
||||||
keywordBuffer.append(' ').append(l.getLexemeText());
|
keywordBuffer.append(' ').append(l.getLexemeText());
|
||||||
@ -116,10 +117,10 @@ class SWMCQueryBuilder {
|
|||||||
lastLexemeEnd = l.getEndPosition();
|
lastLexemeEnd = l.getEndPosition();
|
||||||
}
|
}
|
||||||
|
|
||||||
//借助lucene queryparser 生成SWMC Query
|
// 借助lucene queryparser 生成SWMC Query
|
||||||
QueryParser qp = new QueryParser(fieldName, new StandardAnalyzer());
|
QueryParser qp = new QueryParser(fieldName, new StandardAnalyzer());
|
||||||
|
qp.setAutoGeneratePhraseQueries(false);
|
||||||
qp.setDefaultOperator(QueryParser.AND_OPERATOR);
|
qp.setDefaultOperator(QueryParser.AND_OPERATOR);
|
||||||
qp.setAutoGeneratePhraseQueries(true);
|
|
||||||
|
|
||||||
if ((shortCount * 1.0f / totalCount) > 0.5f) {
|
if ((shortCount * 1.0f / totalCount) > 0.5f) {
|
||||||
try {
|
try {
|
||||||
|
@ -1,86 +0,0 @@
|
|||||||
/*
|
|
||||||
* IK 中文分词 版本 7.7
|
|
||||||
* IK Analyzer release 7.7
|
|
||||||
*
|
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
|
||||||
* this work for additional information regarding copyright ownership.
|
|
||||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
|
||||||
* (the "License"); you may not use this file except in compliance with
|
|
||||||
* the License. You may obtain a copy of the License at
|
|
||||||
*
|
|
||||||
* http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
*
|
|
||||||
* Unless required by applicable law or agreed to in writing, software
|
|
||||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
* See the License for the specific language governing permissions and
|
|
||||||
* limitations under the License.
|
|
||||||
*
|
|
||||||
* 源代码由林良益(linliangyi2005@gmail.com)提供
|
|
||||||
* 版权声明 2012,乌龙茶工作室
|
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
|
||||||
*
|
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
|
||||||
*
|
|
||||||
*/
|
|
||||||
package org.wltea.analyzer.sample;
|
|
||||||
|
|
||||||
import org.apache.lucene.analysis.Analyzer;
|
|
||||||
import org.apache.lucene.analysis.TokenStream;
|
|
||||||
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
|
|
||||||
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
|
|
||||||
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
|
|
||||||
import org.wltea.analyzer.lucene.IKAnalyzer;
|
|
||||||
|
|
||||||
import java.io.IOException;
|
|
||||||
import java.io.StringReader;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* 使用IKAnalyzer进行分词的演示
|
|
||||||
* 2012-10-22
|
|
||||||
*/
|
|
||||||
public class IKAnalzyerDemo {
|
|
||||||
|
|
||||||
public static void main(String[] args) {
|
|
||||||
//构建IK分词器,使用smart分词模式
|
|
||||||
Analyzer analyzer = new IKAnalyzer(true);
|
|
||||||
|
|
||||||
//获取Lucene的TokenStream对象
|
|
||||||
TokenStream ts = null;
|
|
||||||
try {
|
|
||||||
ts = analyzer.tokenStream("myfield", new StringReader("这是一个中文分词的例子,你可以直接运行它!IKAnalyer can analysis english text too"));
|
|
||||||
//获取词元位置属性
|
|
||||||
OffsetAttribute offset = ts.addAttribute(OffsetAttribute.class);
|
|
||||||
//获取词元文本属性
|
|
||||||
CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
|
|
||||||
//获取词元文本属性
|
|
||||||
TypeAttribute type = ts.addAttribute(TypeAttribute.class);
|
|
||||||
|
|
||||||
|
|
||||||
//重置TokenStream(重置StringReader)
|
|
||||||
ts.reset();
|
|
||||||
//迭代获取分词结果
|
|
||||||
while (ts.incrementToken()) {
|
|
||||||
System.out.println(offset.startOffset() + " - " + offset.endOffset() + " : " + term.toString() + " | " + type.type());
|
|
||||||
}
|
|
||||||
//关闭TokenStream(关闭StringReader)
|
|
||||||
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
|
|
||||||
|
|
||||||
} catch (IOException e) {
|
|
||||||
e.printStackTrace();
|
|
||||||
} finally {
|
|
||||||
//释放TokenStream的所有资源
|
|
||||||
if (ts != null) {
|
|
||||||
try {
|
|
||||||
ts.close();
|
|
||||||
} catch (IOException e) {
|
|
||||||
e.printStackTrace();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
}
|
|
@ -1,137 +0,0 @@
|
|||||||
/*
|
|
||||||
* IK 中文分词 版本 7.7
|
|
||||||
* IK Analyzer release 7.7
|
|
||||||
*
|
|
||||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
|
||||||
* contributor license agreements. See the NOTICE file distributed with
|
|
||||||
* this work for additional information regarding copyright ownership.
|
|
||||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
|
||||||
* (the "License"); you may not use this file except in compliance with
|
|
||||||
* the License. You may obtain a copy of the License at
|
|
||||||
*
|
|
||||||
* http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
*
|
|
||||||
* Unless required by applicable law or agreed to in writing, software
|
|
||||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
* See the License for the specific language governing permissions and
|
|
||||||
* limitations under the License.
|
|
||||||
*
|
|
||||||
* 源代码由林良益(linliangyi2005@gmail.com)提供
|
|
||||||
* 版权声明 2012,乌龙茶工作室
|
|
||||||
* provided by Linliangyi and copyright 2012 by Oolong studio
|
|
||||||
*
|
|
||||||
* 7.7版本 由 Magese (magese@live.cn) 更新
|
|
||||||
* release 7.7 update by Magese(magese@live.cn)
|
|
||||||
*
|
|
||||||
*/
|
|
||||||
package org.wltea.analyzer.sample;
|
|
||||||
|
|
||||||
import org.apache.lucene.analysis.Analyzer;
|
|
||||||
import org.apache.lucene.document.Document;
|
|
||||||
import org.apache.lucene.document.Field;
|
|
||||||
import org.apache.lucene.document.StringField;
|
|
||||||
import org.apache.lucene.document.TextField;
|
|
||||||
import org.apache.lucene.index.DirectoryReader;
|
|
||||||
import org.apache.lucene.index.IndexReader;
|
|
||||||
import org.apache.lucene.index.IndexWriter;
|
|
||||||
import org.apache.lucene.index.IndexWriterConfig;
|
|
||||||
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
|
|
||||||
import org.apache.lucene.queryparser.classic.ParseException;
|
|
||||||
import org.apache.lucene.queryparser.classic.QueryParser;
|
|
||||||
import org.apache.lucene.search.IndexSearcher;
|
|
||||||
import org.apache.lucene.search.Query;
|
|
||||||
import org.apache.lucene.search.ScoreDoc;
|
|
||||||
import org.apache.lucene.search.TopDocs;
|
|
||||||
import org.apache.lucene.store.Directory;
|
|
||||||
import org.apache.lucene.store.RAMDirectory;
|
|
||||||
import org.wltea.analyzer.lucene.IKAnalyzer;
|
|
||||||
|
|
||||||
import java.io.IOException;
|
|
||||||
|
|
||||||
|
|
||||||
/**
|
|
||||||
* 使用IKAnalyzer进行Lucene索引和查询的演示
|
|
||||||
* 2012-3-2
|
|
||||||
* <p>
|
|
||||||
* 以下是结合Lucene4.0 API的写法
|
|
||||||
*/
|
|
||||||
public class LuceneIndexAndSearchDemo {
|
|
||||||
|
|
||||||
|
|
||||||
/**
|
|
||||||
* 模拟:
|
|
||||||
* 创建一个单条记录的索引,并对其进行搜索
|
|
||||||
*
|
|
||||||
*/
|
|
||||||
public static void main(String[] args) {
|
|
||||||
//Lucene Document的域名
|
|
||||||
String fieldName = "text";
|
|
||||||
//检索内容
|
|
||||||
String text = "IK Analyzer是一个结合词典分词和文法分词的中文分词开源工具包。它使用了全新的正向迭代最细粒度切分算法。";
|
|
||||||
|
|
||||||
//实例化IKAnalyzer分词器
|
|
||||||
Analyzer analyzer = new IKAnalyzer(true);
|
|
||||||
|
|
||||||
Directory directory = null;
|
|
||||||
IndexWriter iwriter;
|
|
||||||
IndexReader ireader = null;
|
|
||||||
IndexSearcher isearcher;
|
|
||||||
try {
|
|
||||||
//建立内存索引对象
|
|
||||||
directory = new RAMDirectory();
|
|
||||||
|
|
||||||
//配置IndexWriterConfig
|
|
||||||
IndexWriterConfig iwConfig = new IndexWriterConfig(analyzer);
|
|
||||||
iwConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);
|
|
||||||
iwriter = new IndexWriter(directory, iwConfig);
|
|
||||||
//写入索引
|
|
||||||
Document doc = new Document();
|
|
||||||
doc.add(new StringField("ID", "10000", Field.Store.YES));
|
|
||||||
doc.add(new TextField(fieldName, text, Field.Store.YES));
|
|
||||||
iwriter.addDocument(doc);
|
|
||||||
iwriter.close();
|
|
||||||
|
|
||||||
|
|
||||||
//搜索过程**********************************
|
|
||||||
//实例化搜索器
|
|
||||||
ireader = DirectoryReader.open(directory);
|
|
||||||
isearcher = new IndexSearcher(ireader);
|
|
||||||
|
|
||||||
String keyword = "中文分词工具包";
|
|
||||||
//使用QueryParser查询分析器构造Query对象
|
|
||||||
QueryParser qp = new QueryParser(fieldName, analyzer);
|
|
||||||
qp.setDefaultOperator(QueryParser.AND_OPERATOR);
|
|
||||||
Query query = qp.parse(keyword);
|
|
||||||
System.out.println("Query = " + query);
|
|
||||||
|
|
||||||
//搜索相似度最高的5条记录
|
|
||||||
TopDocs topDocs = isearcher.search(query, 5);
|
|
||||||
System.out.println("命中:" + topDocs.totalHits);
|
|
||||||
//输出结果
|
|
||||||
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
|
|
||||||
for (int i = 0; i < topDocs.totalHits; i++) {
|
|
||||||
Document targetDoc = isearcher.doc(scoreDocs[i].doc);
|
|
||||||
System.out.println("内容:" + targetDoc.toString());
|
|
||||||
}
|
|
||||||
|
|
||||||
} catch (ParseException | IOException e) {
|
|
||||||
e.printStackTrace();
|
|
||||||
} finally {
|
|
||||||
if (ireader != null) {
|
|
||||||
try {
|
|
||||||
ireader.close();
|
|
||||||
} catch (IOException e) {
|
|
||||||
e.printStackTrace();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (directory != null) {
|
|
||||||
try {
|
|
||||||
directory.close();
|
|
||||||
} catch (IOException e) {
|
|
||||||
e.printStackTrace();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
@ -1,11 +1,11 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
|
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
|
||||||
<properties>
|
<properties>
|
||||||
<comment>IK Analyzer 扩展配置</comment>
|
<comment>IK Analyzer 扩展配置</comment>
|
||||||
<!--用户可以在这里配置自己的扩展字典 -->
|
<!-- 配置是否加载默认词典 -->
|
||||||
<entry key="ext_dict">ext.dic;</entry>
|
<entry key="use_main_dict">true</entry>
|
||||||
|
<!-- 配置自己的扩展字典,多个用分号分隔 -->
|
||||||
<!--用户可以在这里配置自己的扩展停止词字典-->
|
<entry key="ext_dict">ext.dic;</entry>
|
||||||
<entry key="ext_stopwords">stopword.dic;</entry>
|
<!-- 配置自己的扩展停止词字典,多个用分号分隔 -->
|
||||||
|
<entry key="ext_stopwords">stopword.dic;</entry>
|
||||||
</properties>
|
</properties>
|
File diff suppressed because it is too large
Load Diff
@ -1,3 +1,3 @@
|
|||||||
Wed Aug 01 11:21:30 CST 2018
|
Wed Aug 01 00:00:00 CST 2021
|
||||||
files=dynamicdic.txt
|
files=dynamicdic.txt
|
||||||
lastupdate=0
|
lastupdate=0
|
||||||
|
Loading…
x
Reference in New Issue
Block a user