<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SequenceFile &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/sequencefile/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Sat, 18 Nov 2023 15:11:36 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] 如何用Apache Pig输出压缩格式的SequenceFile</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 23 Jul 2015 17:02:49 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[SequenceFile]]></category>
		<category><![CDATA[输出压缩文件]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=8499</guid>

					<description><![CDATA[<div>
	查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。
<p>	SequenceFile是Hadoop API提供的一种二进制文件，它将数据以&#60;key,value&#62;的形式序列化到文件中。</p></div>
<div>
	如果你要用Apache Pig读取这种类型的数据，可以用 PiggyBank 中的SequenceFileLoader&#8212;&#8212;我没有用过，但肯定是没问题的。</div>
<div>
<span id="more-8499"></span></div>
<div>
	但是，如果你保存在SequenceFile中的key或value是<span style="color:#b22222;">ThriftWritable</span>类型的数据，那么，要用Pig来 load ＆ store 这种数据，就不那么容易了。</div>
<div>
	幸好我们有Twitter，它已经帮我们做好了这个工作。利用其开源的 <a href="https://github.com/twitter/elephant-bird/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Elephant Bird</span></a>，我们可以轻松做到这一点。</div>
<div>
	Elephant Bird 中的 <span style="color:#0000ff;">SequenceFileLoader</span> 以及 <span style="color:#0000ff;">SequenceFileStorage</span> 就是用来干这个的。</div>
<div>
	&#160;</div>
<div>
	例如，load数据的做法是：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;xxx&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&#160;com.twitter.elephantbird.pig.load.SequenceFileLoader(
&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&#160;com.codelast.elephantbird.pig.util.BooleanWritableConverter&#39;</span>,
&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&#160;com.twitter.elephantbird.pig.util.ThriftWritableConverter&#160;com.codelast.MyThriftClass&#39;</span></code></pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/" class="read-more">Read More </a></section>]]></description>
										<content:encoded><![CDATA[<div>
	查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>	SequenceFile是Hadoop API提供的一种二进制文件，它将数据以&lt;key,value&gt;的形式序列化到文件中。</p></div>
<div>
	如果你要用Apache Pig读取这种类型的数据，可以用 PiggyBank 中的SequenceFileLoader&mdash;&mdash;我没有用过，但肯定是没问题的。</div>
<div>
<span id="more-8499"></span></div>
<div>
	但是，如果你保存在SequenceFile中的key或value是<span style="color:#b22222;">ThriftWritable</span>类型的数据，那么，要用Pig来 load ＆ store 这种数据，就不那么容易了。</div>
<div>
	幸好我们有Twitter，它已经帮我们做好了这个工作。利用其开源的 <a href="https://github.com/twitter/elephant-bird/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Elephant Bird</span></a>，我们可以轻松做到这一点。</div>
<div>
	Elephant Bird 中的 <span style="color:#0000ff;">SequenceFileLoader</span> 以及 <span style="color:#0000ff;">SequenceFileStorage</span> 就是用来干这个的。</div>
<div>
	&nbsp;</div>
<div>
	例如，load数据的做法是：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;xxx&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;com.twitter.elephantbird.pig.load.SequenceFileLoader(
&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&nbsp;com.codelast.elephantbird.pig.util.BooleanWritableConverter&#39;</span>,
&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&nbsp;com.twitter.elephantbird.pig.util.ThriftWritableConverter&nbsp;com.codelast.MyThriftClass&#39;</span>);</code></pre>
</section>
<div>
	其中，这份SequenceFile的key是BooleanWritable类型，value是ThriftWritable类型，它对应的Thrift类是MyThriftClass，这是一个自定义的Thrift class。</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></div>
<div>
	store 数据的做法是：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">STORE&nbsp;B&nbsp;INTO&nbsp;&#39;xxx&#39;&nbsp;USING&nbsp;com.twitter.elephantbird.pig.store.SequenceFileStorage(
&nbsp;&#39;-c&nbsp;com.codelast.elephantbird.pig.util.BooleanWritableConverter&#39;,
&nbsp;&#39;-c&nbsp;com.twitter.elephantbird.pig.util.ThriftWritableConverter&nbsp;com.codelast.MyThriftClass&#39;);
</code></pre>
</section>
<div>
	其中，对key和value的说明和上面一样。</div>
<div>
	&nbsp;</div>
<div>
	这样，就可以实现加载以及存储SequenceFile了。</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></div>
<div>
	但是你会发现，这样输出的SequenceFile是未压缩的，所以文件体积比较大。如果要压缩，该怎么做呢？</div>
<div>
	答案就是在Pig脚本中添加以下几句话就OK了：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SET</span>&nbsp;output.compression.enabled&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;true&#39;</span>;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SET</span>&nbsp;mapreduce.output.fileoutputformat.compress.type&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;BLOCK&#39;</span>;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SET</span>&nbsp;output.compression.codec&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;org.apache.hadoop.io.compress.DefaultCodec&#39;</span>;
</code></pre>
</section>
<div>
	这会使得输出的SequenceFile是BLOCK压缩类型，默认压缩编码的文件。</p>
<p style="font-size: 16px; margin: 5px 0px; clear: both; font-family: sans-serif;">
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
		转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
		感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
		以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
