[原创] 使用 fastText 做中文文本分类(5) – 编码无悔 / Intent & Focused

查看本系列文章合集，请看这里。

前面说的模型训练、预测过程，是用 fastText 可执行程序完成的。fastText提供了Python的接口，同样的功能也可以用Python实现。如果数据量比较小，单机做文本分类没啥问题。但我的数据量比较大，几十G的文本数据，单机加载模型、预测分类太耗资源了，而且速度慢。
并行这种事嘛，交给Map-Reduce job来做是最合适不过了，不过，要在Hadoop集群上安装fastText的Python包是不可能的，所以我只能找一下，fastText的模型怎么用Java加载，从而在M-R job中并行地去做预测。

✓ 选择
网上能搜到好些个 fastText 的“Java版”，比如 JFastText，它是 fastText 的一个Java wrapper；又比如 FastText4j，它是一个完全由 Kotlin & Java 实现的 fastText 实现。还有其他的，没有调研。
看了 FastText4j 的自我介绍：

● 100%由Kotlin & Java实现

● 良好的API

● 兼容官方原版的预训练模型

● 提供所有的包括train、test等api

● 支持自有模型存储格式，可以使用MMAP快速加载大模型

我心动了，马上试用。

✓ Maven项目引入 FastText4j 依赖

<dependency>
  <groupId>com.mayabot.mynlp</groupId>
  <artifactId>fastText4j</artifactId>
  <version>3.1.7</version>
</dependency>

这样就能在代码里用了。
文章来源：https://www.codelast.com/
✓ 训练模型

// 用FastText4j训练一个文本分类模型，模型保存成单个文件
File trainFile = new File("/home/codelast/labeled-data_train");
InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.softmax);
inputArgs.setLr(1.0);
inputArgs.setEpoch(25);
inputArgs.setWordNgrams(2);

FastText model = FastText.trainSupervised(trainFile, inputArgs);
model.saveModelToSingleFile(new File("/home/codelast/model"));

训练的参数，包括 lr，epoch，wordNgrams 的含义，都和 fastText 的原版一致。和 fastText 默认生成 .bin & .vec 两个模型文件不同，FastText4j 可以用 saveModelToSingleFile() 方法来生成一个单一的模型文件，如果用 saveModel() 方法的话，则会在一个目录下生成4个文件（如果是这种形式的话，加载模型的时候，4个文件缺一不可）：

args.bin

dict.bin

input.matrix

output.matrix

如果要在Java Map-Reduce job中加载模型，把模型放到 distributed cache 中分发，当然是一个文件最方便。所以强烈建议把模型save成单一文件。
文章来源：https://www.codelast.com/
✓ 加载模型并测试效果

// 加载模型
FastText model = FastText.Companion.loadModelFromSingleFile(new File("/home/codelast/model"));
System.out.println("load model done, will do test...");
// 测试模型效果
model.test(new File("/home/codelast/labeled-data_valid"), 1, 0, true);

输出：

load model done, will do test...

F1-Score : 0.953652 Precision : 0.949348 Recall : 0.957996 __label__娱乐

F1-Score : 0.704064 Precision : 0.702055 Recall : 0.706085 __label__社会

F1-Score : 0.929426 Precision : 0.917355 Recall : 0.941818 __label__历史

F1-Score : 0.784775 Precision : 0.784232 Recall : 0.785319 __label__时政

F1-Score : 0.969314 Precision : 0.967568 Recall : 0.971067 __label__汽车

F1-Score : 0.910314 Precision : 0.914414 Recall : 0.906250 __label__时尚

F1-Score : 0.899281 Precision : 0.903614 Recall : 0.894988 __label__健康

F1-Score : 0.929919 Precision : 0.905512 Recall : 0.955679 __label__美食

F1-Score : 0.908136 Precision : 0.894057 Recall : 0.922667 __label__军事

F1-Score : 0.967391 Precision : 0.975342 Recall : 0.959569 __label__体育

F1-Score : 0.907618 Precision : 0.915033 Recall : 0.900322 __label__育儿

F1-Score : 0.782895 Precision : 0.760383 Recall : 0.806780 __label__情感

F1-Score : 0.863946 Precision : 0.866894 Recall : 0.861017 __label__财经

F1-Score : 0.905188 Precision : 0.920000 Recall : 0.890845 __label__教育

F1-Score : 0.781431 Precision : 0.792157 Recall : 0.770992 __label__文化

F1-Score : 0.892495 Precision : 0.894309 Recall : 0.890688 __label__游戏

F1-Score : 0.830882 Precision : 0.801418 Recall : 0.862595 __label__科技

F1-Score : 0.795455 Precision : 0.781250 Recall : 0.810185 __label__旅游

F1-Score : 0.843537 Precision : 0.826667 Recall : 0.861111 __label__动漫

F1-Score : 0.960961 Precision : 0.969697 Recall : 0.952381 __label__占卜

F1-Score : 0.915361 Precision : 0.912500 Recall : 0.918239 __label__数码

F1-Score : 0.553191 Precision : 0.601852 Recall : 0.511811 __label__搞笑

F1-Score : 0.788104 Precision : 0.834646 Recall : 0.746479 __label__农林牧副渔

F1-Score : 0.797048 Precision : 0.830769 Recall : 0.765957 __label__科学

F1-Score : 0.788462 Precision : 0.828283 Recall : 0.752294 __label__家居

F1-Score : 0.831579 Precision : 0.877778 Recall : 0.790000 __label__房产

F1-Score : 0.674286 Precision : 0.710843 Recall : 0.641304 __label__生活方式

F1-Score : 0.908108 Precision : 0.933333 Recall : 0.884211 __label__宠物

F1-Score : 0.546667 Precision : 0.546667 Recall : 0.546667 __label__宗教

F1-Score : 0.706767 Precision : 0.671429 Recall : 0.746032 __label__职场

F1-Score : 0.951220 Precision : 0.928571 Recall : 0.975000 __label__天气

F1-Score : 0.666667 Precision : 0.909091 Recall : 0.526316 __label__摄影

F1-Score : 0.707692 Precision : 0.718750 Recall : 0.696970 __label__法律

F1-Score : 0.750000 Precision : 1.000000 Recall : 0.600000 __label__彩票

F1-Score : 0.333333 Precision : 1.000000 Recall : 0.200000 __label__移民

F1-Score : 0.000000 Precision : -------- Recall : 0.000000 __label__生活百科

N 10703

P@1 0.870

R@1 0.870

文章来源：https://www.codelast.com/
✓ 预测一段文本的label

// 预测一个分好词的string的label
List<ScoreLabelPair> result = model.predict(Arrays.asList("人民网 辽宁 频道 人民网 沈阳 月 10 日电 日前 进一步 增强 全民 节能 意识".split(" ")), 1, 0);
System.out.println(result.get(0).getLabel());

输出：

__label__社会

注意这里的文本应该是分好词的、空格分隔的、清洗过的文本。
文章来源：https://www.codelast.com/
✓ 压缩模型
如果一个模型文件体积太大，可能放不进 distributed cache 中，所以压缩模型体积这个功能很有用。以我的模型为例，接近900MB的大小，压缩之后会变成 100 多MB，模型的Precision & Recall指标却没有变差多少，值。

// 压缩模型并保存。加载压缩过的模型可以节省内存
FastText qmodel = model.quantize(2, false, false);
qmodel.saveModelToSingleFile(new File("/home/codelast/model_compressed"));

保存成压缩过的模型是一次性的操作，以后再加载模型的话，就加载这个压缩过的模型了。

✓ 后记
通过 FastText4j 在 Map-Reduce job 中并行做文本分类，成功地让文本分类任务提高了无数倍的速度，达到了实用的水平。

文章来源：https://www.codelast.com/
➤➤ 版权声明 ➤➤
转载需注明出处：codelast.com
感谢关注我的微信公众号（微信扫一扫）：

wechat qrcode of codelast

发表评论 取消回复

发表评论取消回复