<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>FastText4j &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/fasttext4j/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Wed, 29 Jul 2020 12:06:34 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] 使用 fastText 做中文文本分类(5)</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8-fasttext-%e5%81%9a%e4%b8%ad%e6%96%87%e6%96%87%e6%9c%ac%e5%88%86%e7%b1%bb5/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8-fasttext-%e5%81%9a%e4%b8%ad%e6%96%87%e6%96%87%e6%9c%ac%e5%88%86%e7%b1%bb5/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 29 Jul 2020 09:48:07 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[fastText]]></category>
		<category><![CDATA[FastText4j]]></category>
		<category><![CDATA[中文]]></category>
		<category><![CDATA[文本分类]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=12840</guid>

					<description><![CDATA[<p>查看本系列文章合集，请看<a href="https://www.codelast.com/?p=12856" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>前面说的模型训练、预测过程，是用 fastText 可执行程序完成的。fastText提供了Python的接口，同样的功能也可以用Python实现。如果数据量比较小，单机做文本分类没啥问题。但我的数据量比较大，几十G的文本数据，单机加载模型、预测分类太耗资源了，而且速度慢。<br />
并行这种事嘛，交给Map-Reduce job来做是最合适不过了，不过，要在Hadoop集群上安装fastText的Python包是不可能的，所以我只能找一下，fastText的模型怎么用Java加载，从而在M-R job中并行地去做预测。<br />
<span id="more-12840"></span><br />
<span style="background-color: rgb(255, 255, 0);">✓</span>&#160;选择<br />
网上能搜到好些个 fastText 的&#8220;Java版&#8221;，比如 <a href="https://github.com/vinhkhuc/JFastText" rel="noopener noreferrer" target="_blank"><span style="background-color:#fff0f5;">JFastText</span></a>，它是 fastText 的一个Java wrapper；又比如&#160;<a href="https://github.com/mayabot/mynlp/tree/master/fastText4j" rel="noopener noreferrer" target="_blank"><span style="background-color:#fff0f5;">FastText4j</span></a>，它是一个完全由 Kotlin &#38; Java 实现的 fastText 实现。还有其他的，没有调研。<br />
看了 FastText4j 的自我介绍：</p>
<blockquote>
<div>
		● 100%由Kotlin &#38; Java实现</div>
<div>
		● 良好的API</div>
<div>
		● 兼容官方原版的预训练模型</div>
<div>
		● 提供所有的包括train、test等api</div>
<div>
		● 支持自有模型存储格式，可以使用MMAP快速加载大模型</div>
</blockquote>
<div>
	我心动了，马上试用。
<p>	<span style="background-color: rgb(255, 255, 0);">✓</span>&#160;Maven项目引入 FastText4j 依赖</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono',monospace;font-size:12.0pt;">
<span style="color:#e8bf6a;">&#60;dependency&#62;
</span><span style="color:#e8bf6a;">  &#60;groupId&#62;</span>com.mayabot.mynlp</pre></div>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8-fasttext-%e5%81%9a%e4%b8%ad%e6%96%87%e6%96%87%e6%9c%ac%e5%88%86%e7%b1%bb5/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>查看本系列文章合集，请看<a href="https://www.codelast.com/?p=12856" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>前面说的模型训练、预测过程，是用 fastText 可执行程序完成的。fastText提供了Python的接口，同样的功能也可以用Python实现。如果数据量比较小，单机做文本分类没啥问题。但我的数据量比较大，几十G的文本数据，单机加载模型、预测分类太耗资源了，而且速度慢。<br />
并行这种事嘛，交给Map-Reduce job来做是最合适不过了，不过，要在Hadoop集群上安装fastText的Python包是不可能的，所以我只能找一下，fastText的模型怎么用Java加载，从而在M-R job中并行地去做预测。<br />
<span id="more-12840"></span><br />
<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;选择<br />
网上能搜到好些个 fastText 的&ldquo;Java版&rdquo;，比如 <a href="https://github.com/vinhkhuc/JFastText" rel="noopener noreferrer" target="_blank"><span style="background-color:#fff0f5;">JFastText</span></a>，它是 fastText 的一个Java wrapper；又比如&nbsp;<a href="https://github.com/mayabot/mynlp/tree/master/fastText4j" rel="noopener noreferrer" target="_blank"><span style="background-color:#fff0f5;">FastText4j</span></a>，它是一个完全由 Kotlin &amp; Java 实现的 fastText 实现。还有其他的，没有调研。<br />
看了 FastText4j 的自我介绍：</p>
<blockquote>
<div>
		● 100%由Kotlin &amp; Java实现</div>
<div>
		● 良好的API</div>
<div>
		● 兼容官方原版的预训练模型</div>
<div>
		● 提供所有的包括train、test等api</div>
<div>
		● 支持自有模型存储格式，可以使用MMAP快速加载大模型</div>
</blockquote>
<div>
	我心动了，马上试用。</p>
<p>	<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;Maven项目引入 FastText4j 依赖</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono',monospace;font-size:12.0pt;">
<span style="color:#e8bf6a;">&lt;dependency&gt;
</span><span style="color:#e8bf6a;">  &lt;groupId&gt;</span>com.mayabot.mynlp<span style="color:#e8bf6a;">&lt;/groupId&gt;
</span><span style="color:#e8bf6a;">  &lt;artifactId&gt;</span>fastText4j<span style="color:#e8bf6a;">&lt;/artifactId&gt;
</span><span style="color:#e8bf6a;">  &lt;version&gt;</span>3.1.7<span style="color:#e8bf6a;">&lt;/version&gt;
</span><span style="color:#e8bf6a;">&lt;/dependency&gt;</span></pre>
<p>	这样就能在代码里用了。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;训练模型</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono',monospace;font-size:12.0pt;">
<span style="color:#808080;">// </span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">用</span><span style="color:#808080;">FastText4j</span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">训练一个文本分类模型，模型保存成单个文件
</span>File trainFile = <span style="color:#cc7832;">new </span>File(<span style="color:#6a8759;">&quot;/home/codelast/labeled-data_train&quot;</span>)<span style="color:#cc7832;">;
</span>InputArgs inputArgs = <span style="color:#cc7832;">new </span>InputArgs()<span style="color:#cc7832;">;
</span>inputArgs.setLoss(LossName.<span style="color:#9876aa;font-style:italic;">softmax</span>)<span style="color:#cc7832;">;
</span>inputArgs.setLr(<span style="color:#6897bb;">1.0</span>)<span style="color:#cc7832;">;
</span>inputArgs.setEpoch(<span style="color:#6897bb;">25</span>)<span style="color:#cc7832;">;
</span>inputArgs.setWordNgrams(<span style="color:#6897bb;">2</span>)<span style="color:#cc7832;">;
</span>
FastText model = FastText.<span style="font-style:italic;">trainSupervised</span>(trainFile<span style="color:#cc7832;">, </span>inputArgs)<span style="color:#cc7832;">;
</span>model.saveModelToSingleFile(<span style="color:#cc7832;">new </span>File(<span style="color:#6a8759;">&quot;/home/codelast/model&quot;</span>))<span style="color:#cc7832;">;</span></pre>
<p>	训练的参数，包括 lr，epoch，wordNgrams 的含义，都和 fastText 的原版一致。和 fastText 默认生成 .bin &amp; .vec 两个模型文件不同，FastText4j 可以用 saveModelToSingleFile() 方法来生成一个单一的模型文件，如果用 saveModel() 方法的话，则会在一个目录下生成4个文件（如果是这种形式的话，加载模型的时候，4个文件缺一不可）：</p>
<blockquote>
<div>
			args.bin</div>
<div>
			dict.bin</div>
<div>
			input.matrix</div>
<div>
			output.matrix</div>
</blockquote>
<div>
		如果要在Java Map-Reduce job中加载模型，把模型放到 distributed cache 中分发，当然是一个文件最方便。所以强烈建议把模型save成单一文件。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;加载模型并测试效果</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono',monospace;font-size:12.0pt;">
<span style="color:#808080;">// </span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">加载模型
</span>FastText model = FastText.<span style="color:#9876aa;font-style:italic;">Companion</span>.loadModelFromSingleFile(<span style="color:#cc7832;">new </span>File(<span style="color:#6a8759;">&quot;/home/codelast/model&quot;</span>))<span style="color:#cc7832;">;
</span>System.<span style="color:#9876aa;font-style:italic;">out</span>.println(<span style="color:#6a8759;">&quot;load model done, will do test...&quot;</span>)<span style="color:#cc7832;">;
</span><span style="color:#808080;">// </span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">测试模型效果
</span>model.test(<span style="color:#cc7832;">new </span>File(<span style="color:#6a8759;">&quot;/home/codelast/labeled-data_valid&quot;</span>)<span style="color:#cc7832;">, </span><span style="color:#6897bb;">1</span><span style="color:#cc7832;">, </span><span style="color:#6897bb;">0</span><span style="color:#cc7832;">, true</span>)<span style="color:#cc7832;">;</span></pre>
<p>		输出：</p>
<blockquote>
<div>
				load model done, will do test...</div>
<div>
				F1-Score : 0.953652 Precision : 0.949348 Recall : 0.957996&nbsp; __label__娱乐</div>
<div>
				F1-Score : 0.704064 Precision : 0.702055 Recall : 0.706085&nbsp; __label__社会</div>
<div>
				F1-Score : 0.929426 Precision : 0.917355 Recall : 0.941818&nbsp; __label__历史</div>
<div>
				F1-Score : 0.784775 Precision : 0.784232 Recall : 0.785319&nbsp; __label__时政</div>
<div>
				F1-Score : 0.969314 Precision : 0.967568 Recall : 0.971067&nbsp; __label__汽车</div>
<div>
				F1-Score : 0.910314 Precision : 0.914414 Recall : 0.906250&nbsp; __label__时尚</div>
<div>
				F1-Score : 0.899281 Precision : 0.903614 Recall : 0.894988&nbsp; __label__健康</div>
<div>
				F1-Score : 0.929919 Precision : 0.905512 Recall : 0.955679&nbsp; __label__美食</div>
<div>
				F1-Score : 0.908136 Precision : 0.894057 Recall : 0.922667&nbsp; __label__军事</div>
<div>
				F1-Score : 0.967391 Precision : 0.975342 Recall : 0.959569&nbsp; __label__体育</div>
<div>
				F1-Score : 0.907618 Precision : 0.915033 Recall : 0.900322&nbsp; __label__育儿</div>
<div>
				F1-Score : 0.782895 Precision : 0.760383 Recall : 0.806780&nbsp; __label__情感</div>
<div>
				F1-Score : 0.863946 Precision : 0.866894 Recall : 0.861017&nbsp; __label__财经</div>
<div>
				F1-Score : 0.905188 Precision : 0.920000 Recall : 0.890845&nbsp; __label__教育</div>
<div>
				F1-Score : 0.781431 Precision : 0.792157 Recall : 0.770992&nbsp; __label__文化</div>
<div>
				F1-Score : 0.892495 Precision : 0.894309 Recall : 0.890688&nbsp; __label__游戏</div>
<div>
				F1-Score : 0.830882 Precision : 0.801418 Recall : 0.862595&nbsp; __label__科技</div>
<div>
				F1-Score : 0.795455 Precision : 0.781250 Recall : 0.810185&nbsp; __label__旅游</div>
<div>
				F1-Score : 0.843537 Precision : 0.826667 Recall : 0.861111&nbsp; __label__动漫</div>
<div>
				F1-Score : 0.960961 Precision : 0.969697 Recall : 0.952381&nbsp; __label__占卜</div>
<div>
				F1-Score : 0.915361 Precision : 0.912500 Recall : 0.918239&nbsp; __label__数码</div>
<div>
				F1-Score : 0.553191 Precision : 0.601852 Recall : 0.511811&nbsp; __label__搞笑</div>
<div>
				F1-Score : 0.788104 Precision : 0.834646 Recall : 0.746479&nbsp; __label__农林牧副渔</div>
<div>
				F1-Score : 0.797048 Precision : 0.830769 Recall : 0.765957&nbsp; __label__科学</div>
<div>
				F1-Score : 0.788462 Precision : 0.828283 Recall : 0.752294&nbsp; __label__家居</div>
<div>
				F1-Score : 0.831579 Precision : 0.877778 Recall : 0.790000&nbsp; __label__房产</div>
<div>
				F1-Score : 0.674286 Precision : 0.710843 Recall : 0.641304&nbsp; __label__生活方式</div>
<div>
				F1-Score : 0.908108 Precision : 0.933333 Recall : 0.884211&nbsp; __label__宠物</div>
<div>
				F1-Score : 0.546667 Precision : 0.546667 Recall : 0.546667&nbsp; __label__宗教</div>
<div>
				F1-Score : 0.706767 Precision : 0.671429 Recall : 0.746032&nbsp; __label__职场</div>
<div>
				F1-Score : 0.951220 Precision : 0.928571 Recall : 0.975000&nbsp; __label__天气</div>
<div>
				F1-Score : 0.666667 Precision : 0.909091 Recall : 0.526316&nbsp; __label__摄影</div>
<div>
				F1-Score : 0.707692 Precision : 0.718750 Recall : 0.696970&nbsp; __label__法律</div>
<div>
				F1-Score : 0.750000 Precision : 1.000000 Recall : 0.600000&nbsp; __label__彩票</div>
<div>
				F1-Score : 0.333333 Precision : 1.000000 Recall : 0.200000&nbsp; __label__移民</div>
<div>
				F1-Score : 0.000000 Precision : -------- Recall : 0.000000&nbsp; __label__生活百科</div>
<div>
				N<span style="white-space:pre"> </span>10703</div>
<div>
				P@1<span style="white-space:pre"> </span>0.870</div>
<div>
				R@1<span style="white-space:pre"> </span>0.870</div>
</blockquote>
<p>		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;预测一段文本的label</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono',monospace;font-size:12.0pt;">
<span style="color:#808080;">// </span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">预测一个分好词的</span><span style="color:#808080;">string</span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">的</span><span style="color:#808080;">label
</span>List&lt;ScoreLabelPair&gt; result = model.predict(Arrays.<span style="font-style:italic;">asList</span>(<span style="color:#6a8759;">&quot;</span><span style="color:#6a8759;font-family:'DejaVu Sans Mono',monospace;">人民网 辽宁 频道 人民网 沈阳 月</span><span style="color:#6a8759;"> 10 </span><span style="color:#6a8759;font-family:'DejaVu Sans Mono',monospace;">日电 日前 进一步 增强 全民 节能 意识</span><span style="color:#6a8759;">&quot;</span>.split(<span style="color:#6a8759;">&quot; &quot;</span>))<span style="color:#cc7832;">, </span><span style="color:#6897bb;">1</span><span style="color:#cc7832;">, </span><span style="color:#6897bb;">0</span>)<span style="color:#cc7832;">;
</span>System.<span style="color:#9876aa;font-style:italic;">out</span>.println(result.get(<span style="color:#6897bb;">0</span>).getLabel())<span style="color:#cc7832;">;</span></pre>
<p>		输出：</p>
<blockquote>
<p>
				__label__社会</p>
</blockquote>
<p>		注意这里的文本应该是分好词的、空格分隔的、清洗过的文本。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;压缩模型<br />
		如果一个模型文件体积太大，可能放不进 distributed cache 中，所以压缩模型体积这个功能很有用。以我的模型为例，接近900MB的大小，压缩之后会变成 100 多MB，模型的Precision &amp; Recall指标却没有变差多少，值。</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono',monospace;font-size:12.0pt;">
<span style="color:#808080;">// </span><span style="color:#808080;font-family:'DejaVu Sans Mono',monospace;">压缩模型并保存。加载压缩过的模型可以节省内存
</span>FastText qmodel = model.quantize(<span style="color:#6897bb;">2</span><span style="color:#cc7832;">, false, false</span>)<span style="color:#cc7832;">;
</span>qmodel.saveModelToSingleFile(<span style="color:#cc7832;">new </span>File(<span style="color:#6a8759;">&quot;/home/codelast/model_compressed&quot;</span>))<span style="color:#cc7832;">;</span></pre>
<p>		保存成压缩过的模型是一次性的操作，以后再加载模型的话，就加载这个压缩过的模型了。</p>
<p>		<span style="background-color: rgb(255, 255, 0);">✓</span>&nbsp;后记<br />
		通过 FastText4j 在&nbsp;Map-Reduce job 中并行做文本分类，成功地让文本分类任务提高了无数倍的速度，达到了实用的水平。</div>
</div>
<p>
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8-fasttext-%e5%81%9a%e4%b8%ad%e6%96%87%e6%96%87%e6%9c%ac%e5%88%86%e7%b1%bb5/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
