<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>apache pig &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/apache-pig/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Tue, 05 Dec 2023 04:01:08 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] 在Apache Pig中把数据按指定字段分组，每组取时间最新的一条记录</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%95%b0%e6%8d%ae%e6%8c%89%e6%8c%87%e5%ae%9a%e5%ad%97%e6%ae%b5%e5%88%86%e7%bb%84%ef%bc%8c%e6%af%8f%e7%bb%84%e5%8f%96%e6%97%b6%e9%97%b4/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%95%b0%e6%8d%ae%e6%8c%89%e6%8c%87%e5%ae%9a%e5%ad%97%e6%ae%b5%e5%88%86%e7%bb%84%ef%bc%8c%e6%af%8f%e7%bb%84%e5%8f%96%e6%97%b6%e9%97%b4/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 15 Nov 2023 08:15:25 +0000</pubDate>
				<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[GROUP]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=13967</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>用Apache Pig处理大数据时，经常会有这种需求：把输入数据按指定的字段group，并且每个group内只输出时间最新的一条记录。<br />
<span id="more-13967"></span><br />
举个例子。有数据文件 input.txt ：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">10&#160;&#160;&#160;&#160;&#160;&#160;a&#160;&#160;&#160;&#160;&#160;&#160;&#160;1,2,3
9&#160;&#160;&#160;&#160;&#160;&#160;&#160;b&#160;&#160;&#160;&#160;&#160;&#160;&#160;1,2
8&#160;&#160;&#160;&#160;&#160;&#160;&#160;a&#160;&#160;&#160;&#160;&#160;&#160;&#160;2,3,4
13&#160;&#160;&#160;&#160;&#160;&#160;a&#160;&#160;&#160;&#160;&#160;&#160;&#160;1,2,3,4
6&#160;&#160;&#160;&#160;&#160;&#160;&#160;b&#160;&#160;&#160;&#160;&#160;&#160;&#160;1
</code></pre>
</section>
<p>该数据的三个字段分别代表：<span style="background-color: rgb(255, 255, 255); color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre;">time（时间戳），userId（用户id），userInterest（用户兴趣id）<br />
现在，要找出每个用户时间最新的</span><span style="color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">userInterest，</span><span style="color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">应该怎么做？</span><br />
<span style="color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">即：对用户 a，最新的时间戳是13，</span><span style="color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">userInterest是1,2,3,4；对用户 b，最新的时间戳是9，</span><span style="color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">userInterest是1,2。</span><br />
<span style="color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">直接上代码：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;input.txt&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&#160;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">time</span>:&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>,&#160;userId:&#160;chararray,&#160;userInterest:&#160;chararray);
A&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;time,&#160;userId,&#160;userInterest;
B&#160;=&#160;GROUP&#160;A&#160;BY&#160;userId;
<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">--&#160;每个userId取时间最新的一条记录</span>
C&#160;=&#160;FOREACH&#160;B&#160;{
&#160;&#160;&#160;&#160;SORTED&#160;=&#160;ORDER&#160;A&#160;BY&#160;time&#160;DESC;
&#160;&#160;&#160;&#160;ONE_RECORD&#160;=&#160;LIMIT&#160;SORTED&#160;1;
&#160;&#160;&#160;&#160;GENERATE&#160;FLATTEN(ONE_RECORD);
};
DUMP&#160;C;
</code></pre>
</section>
<p>
在嵌套的FOREACH语句中，首先用ORDER BY对同一个group内的数据进行了降序排序，再用LIMIT取一条记录，由于是按time降序排序，因此LIMIT 1取到的就是时间戳最大的那条记录，即时间最新的记录。<br />
<span style="font-size: 16px; color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; white-space: pre; background-color: rgb(255, 255, 255);">输出：</span></p>
<blockquote>
<div>
		(13,a,1,2,3,4)</div>
<div>
		(9,b,1,2)</div>
</blockquote>
<p><span style="font-size: 16px; color: rgb(59, 59, 59); font-family: &#34;Droid Sans Mono&#34;, &#34;monospace&#34;, monospace; white-space: pre; background-color: rgb(255, 255, 255);">可见这个结果和我们前面人工判断出来的正确结果一致。</span></p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%95%b0%e6%8d%ae%e6%8c%89%e6%8c%87%e5%ae%9a%e5%ad%97%e6%ae%b5%e5%88%86%e7%bb%84%ef%bc%8c%e6%af%8f%e7%bb%84%e5%8f%96%e6%97%b6%e9%97%b4/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>用Apache Pig处理大数据时，经常会有这种需求：把输入数据按指定的字段group，并且每个group内只输出时间最新的一条记录。<br />
<span id="more-13967"></span><br />
举个例子。有数据文件 input.txt ：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1,2,3
9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1,2
8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2,3,4
13&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1,2,3,4
6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1
</code></pre>
</section>
<p>该数据的三个字段分别代表：<span style="background-color: rgb(255, 255, 255); color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre;">time（时间戳），userId（用户id），userInterest（用户兴趣id）<br />
现在，要找出每个用户时间最新的</span><span style="color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">userInterest，</span><span style="color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">应该怎么做？</span><br />
<span style="color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">即：对用户 a，最新的时间戳是13，</span><span style="color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">userInterest是1,2,3,4；对用户 b，最新的时间戳是9，</span><span style="color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">userInterest是1,2。</span><br />
<span style="color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; font-size: 16px; white-space: pre; background-color: rgb(255, 255, 255);">直接上代码：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;input.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">time</span>:&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>,&nbsp;userId:&nbsp;chararray,&nbsp;userInterest:&nbsp;chararray);
A&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;time,&nbsp;userId,&nbsp;userInterest;
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;userId;
<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">--&nbsp;每个userId取时间最新的一条记录</span>
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;SORTED&nbsp;=&nbsp;ORDER&nbsp;A&nbsp;BY&nbsp;time&nbsp;DESC;
&nbsp;&nbsp;&nbsp;&nbsp;ONE_RECORD&nbsp;=&nbsp;LIMIT&nbsp;SORTED&nbsp;1;
&nbsp;&nbsp;&nbsp;&nbsp;GENERATE&nbsp;FLATTEN(ONE_RECORD);
};
DUMP&nbsp;C;
</code></pre>
</section>
<p>
在嵌套的FOREACH语句中，首先用ORDER BY对同一个group内的数据进行了降序排序，再用LIMIT取一条记录，由于是按time降序排序，因此LIMIT 1取到的就是时间戳最大的那条记录，即时间最新的记录。<br />
<span style="font-size: 16px; color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; white-space: pre; background-color: rgb(255, 255, 255);">输出：</span></p>
<blockquote>
<div>
		(13,a,1,2,3,4)</div>
<div>
		(9,b,1,2)</div>
</blockquote>
<p><span style="font-size: 16px; color: rgb(59, 59, 59); font-family: &quot;Droid Sans Mono&quot;, &quot;monospace&quot;, monospace; white-space: pre; background-color: rgb(255, 255, 255);">可见这个结果和我们前面人工判断出来的正确结果一致。</span></p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：<br />
<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%95%b0%e6%8d%ae%e6%8c%89%e6%8c%87%e5%ae%9a%e5%ad%97%e6%ae%b5%e5%88%86%e7%bb%84%ef%bc%8c%e6%af%8f%e7%bb%84%e5%8f%96%e6%97%b6%e9%97%b4/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 在Apache Pig中把时间字符串转换成时间戳</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%97%b6%e9%97%b4%e5%ad%97%e7%ac%a6%e4%b8%b2%e8%bd%ac%e6%8d%a2%e6%88%90%e6%97%b6%e9%97%b4%e6%88%b3/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%97%b6%e9%97%b4%e5%ad%97%e7%ac%a6%e4%b8%b2%e8%bd%ac%e6%8d%a2%e6%88%90%e6%97%b6%e9%97%b4%e6%88%b3/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 12 Oct 2023 09:37:25 +0000</pubDate>
				<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[时间字符串]]></category>
		<category><![CDATA[时间戳]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=13959</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" target="_blank" rel="noopener"><span style="background-color:#ffa07a;">这里</span></a>。</p>
<p>在Apache Pig中，怎样把 <span style="color:#ff0000;">2023-10-11_10:57:56</span> 这种格式的时间字符串，转成整型的时间戳？<br />
话不多说，直接上代码。<br />
假设输入数据文件 1.txt，其格式是一行一个时间字符串。<br />
<span id="more-13959"></span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&#160;(dt:&#160;chararray);
A&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;ToDate(dt,&#160;&#39;yyyy-MM-dd_HH:mm:ss&#39;)&#160;AS&#160;date;
B&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;ToUnixTime(date)&#160;AS&#160;ts;
DUMP&#160;B;
</code></pre>
</section>
<p>
输出结果形如：</p>
<blockquote>
<p>
		1696993076</p>
</blockquote>
<p>
可见，这样得到的时间戳单位是&#8220;秒&#8221;。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;版权声明&#160;<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&#160;<br />
感谢关注我的微信公众号（微信扫一扫）：<br />
<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" />&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%97%b6%e9%97%b4%e5%ad%97%e7%ac%a6%e4%b8%b2%e8%bd%ac%e6%8d%a2%e6%88%90%e6%97%b6%e9%97%b4%e6%88%b3/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" target="_blank" rel="noopener"><span style="background-color:#ffa07a;">这里</span></a>。</p>
<p>在Apache Pig中，怎样把 <span style="color:#ff0000;">2023-10-11_10:57:56</span> 这种格式的时间字符串，转成整型的时间戳？<br />
话不多说，直接上代码。<br />
假设输入数据文件 1.txt，其格式是一行一个时间字符串。<br />
<span id="more-13959"></span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(dt:&nbsp;chararray);
A&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;ToDate(dt,&nbsp;&#39;yyyy-MM-dd_HH:mm:ss&#39;)&nbsp;AS&nbsp;date;
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;ToUnixTime(date)&nbsp;AS&nbsp;ts;
DUMP&nbsp;B;
</code></pre>
</section>
<p>
输出结果形如：</p>
<blockquote>
<p>
		1696993076</p>
</blockquote>
<p>
可见，这样得到的时间戳单位是&ldquo;秒&rdquo;。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：<br />
<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%9c%a8apache-pig%e4%b8%ad%e6%8a%8a%e6%97%b6%e9%97%b4%e5%ad%97%e7%ac%a6%e4%b8%b2%e8%bd%ac%e6%8d%a2%e6%88%90%e6%97%b6%e9%97%b4%e6%88%b3/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] Apache Pig 加载/解析/读取 XML数据</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig-%e5%8a%a0%e8%bd%bd-%e8%a7%a3%e6%9e%90-%e8%af%bb%e5%8f%96-xml%e6%95%b0%e6%8d%ae/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig-%e5%8a%a0%e8%bd%bd-%e8%a7%a3%e6%9e%90-%e8%af%bb%e5%8f%96-xml%e6%95%b0%e6%8d%ae/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 18 Jan 2023 03:42:19 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[piggybank]]></category>
		<category><![CDATA[XMLLoader]]></category>
		<category><![CDATA[解析XML]]></category>
		<category><![CDATA[读取XML]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=13817</guid>

					<description><![CDATA[<div style="text-align: center;">
	<img decoding="async" alt="XML logo" src="https://www.codelast.com/wp-content/uploads/ckfinder/images/xml.png" style="width: 200px; height: 200px;" /></div>
<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>用Apache Pig加载/解析/读取XML数据不是一个常见的需求，但有时为了方便我们又不得不用，其实Pig的UDF库piggybank已经帮大家做好了这个工作了：只需要使用其中的<a href="https://pig.apache.org/docs/r0.17.0/api/org/apache/pig/piggybank/storage/XMLLoader.html" rel="noopener" target="_blank"><span style="background-color:#ffa07a;">XMLLoader</span></a>，就可以方便地实现XML文件解析。<br />
<span id="more-13817"></span><br />
假设有如下XML文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="xml language-xml hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">&#60;?xml&#160;version=&#34;1.0&#34;&#160;encoding=&#34;UTF-8&#34;?&#62;
</span>
<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">users</span>&#62;</span>
&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&#62;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&#62;</span>http://www.codelast.com/user/profile/159.html<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&#62;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&#62;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&#62;</span>159<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&#62;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&#62;</span>李四<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&#62;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&#62;</span>
&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&#62;</span>
&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&#62;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&#60;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&#62;</span>http://www.codelast.com/user/profile/220.html</code></pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig-%e5%8a%a0%e8%bd%bd-%e8%a7%a3%e6%9e%90-%e8%af%bb%e5%8f%96-xml%e6%95%b0%e6%8d%ae/" class="read-more">Read More </a></section>]]></description>
										<content:encoded><![CDATA[<div style="text-align: center;">
	<img decoding="async" alt="XML logo" src="https://www.codelast.com/wp-content/uploads/ckfinder/images/xml.png" style="width: 200px; height: 200px;" /></div>
<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>用Apache Pig加载/解析/读取XML数据不是一个常见的需求，但有时为了方便我们又不得不用，其实Pig的UDF库piggybank已经帮大家做好了这个工作了：只需要使用其中的<a href="https://pig.apache.org/docs/r0.17.0/api/org/apache/pig/piggybank/storage/XMLLoader.html" rel="noopener" target="_blank"><span style="background-color:#ffa07a;">XMLLoader</span></a>，就可以方便地实现XML文件解析。<br />
<span id="more-13817"></span><br />
假设有如下XML文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="xml language-xml hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">&lt;?xml&nbsp;version=&quot;1.0&quot;&nbsp;encoding=&quot;UTF-8&quot;?&gt;
</span>
<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">users</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span>http://www.codelast.com/user/profile/159.html<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span>159<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span>李四<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span>http://www.codelast.com/user/profile/220.html<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span>220<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span>王五<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;
</span>
<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">users</span>&gt;</span></code></pre>
</section>
<p>首先我们欣赏一下这个XML文件，它用于存储多个用户的信息，&lt;users&gt;&lt;/users&gt; 标签把所有用户信息包裹在里面，&lt;user&gt;&lt;/user&gt; 标签则对应一个用户的信息。在每一个用户信息中，不仅有 profileUrl 这种第一层的属性，还有 data/id，data/name 这种更深层的属性，我们在下面的代码中，会把这些字段全部提取出来。<br />
这个XML虽然结构简单，但层次分明，通过解析这个XML文件，我们可以引申出对复杂XML文件的读取方法。<br />
这里需要特别说明一下，通常情况下存储在磁盘上的大数据都不会是这种&ldquo;格式化过&rdquo;的数据格式（为了方便大家观看，我把XML格式化成了上面的样子），而是一种紧凑的格式，例如可能是下面这个样子：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="xml language-xml hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">&lt;?xml&nbsp;version=&quot;1.0&quot;&nbsp;encoding=&quot;UTF-8&quot;?&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">users</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span>http://www.codelast.com/user/profile/159.html<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span>159<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span>李四<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span>http://www.codelast.com/user/profile/220.html<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">profileUrl</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span>220<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">id</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span>王五<span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">data</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">user</span>&gt;</span><span class="hljs-tag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">&lt;/<span class="hljs-name" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">users</span>&gt;</span></code></pre>
</section>
<p>但是对piggybank的XMLLoader来说，上面的各种格式并不会影响它对XML内容的解析，这一点很好。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
话不多说，直接上代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">--&nbsp;注册piggybank的jar包，视实际情况修改此路径</span>
REGISTER&nbsp;/xxx/piggybank.jar;

DEFINE&nbsp;XPath&nbsp;org.apache.pig.piggybank.evaluation.xml.XPath();

A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;users.xml&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;org.apache.pig.piggybank.storage.XMLLoader(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;user&#39;</span>)&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(x:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;XPath(x,&nbsp;&#39;user/profileUrl&#39;)&nbsp;AS&nbsp;profileUrl,&nbsp;XPath(x,&nbsp;&#39;user/data/id&#39;)&nbsp;AS&nbsp;id,&nbsp;XPath(x,&nbsp;&#39;user/data/name&#39;)&nbsp;AS&nbsp;name;
DUMP&nbsp;B;
</code></pre>
</section>
<p>代码解读：<br />
➤ XMLLoader 的参数&quot;user&quot;，应该是你的XML文件中，有多个标签的那个名字，例如，在上面的XML中，含有多个 user，但是只有一个 users，所以我们这里填的应该是 user<br />
➤ AS (x: chararray) 中的 &quot;x&quot; 其实可以是任意名称，我们下面只用这个名字来解析其下的字段，所以这个名字不重要。<br />
➤ 在 XPath() 中我们可以通过XML的层级关系来取出指定的字段，一看就明白。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
上面的代码输出：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(http:<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">//www.codelast.com/user/profile/159.html,159,李四)</span>
(http:<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">//www.codelast.com/user/profile/220.html,220,王五)</span></code></pre>
</section>
<p>也就是我们提取的3个字段。</p>
<p>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：<br />
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig-%e5%8a%a0%e8%bd%bd-%e8%a7%a3%e6%9e%90-%e8%af%bb%e5%8f%96-xml%e6%95%b0%e6%8d%ae/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] Apache Pig如何按数据分组保存到不同的子目录中(MultiStorage)</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e5%a6%82%e4%bd%95%e6%8c%89%e6%95%b0%e6%8d%ae%e5%88%86%e7%bb%84%e4%bf%9d%e5%ad%98%e5%88%b0%e4%b8%8d%e5%90%8c%e7%9a%84%e5%ad%90%e7%9b%ae%e5%bd%95%e4%b8%admultistorage/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e5%a6%82%e4%bd%95%e6%8c%89%e6%95%b0%e6%8d%ae%e5%88%86%e7%bb%84%e4%bf%9d%e5%ad%98%e5%88%b0%e4%b8%8d%e5%90%8c%e7%9a%84%e5%ad%90%e7%9b%ae%e5%bd%95%e4%b8%admultistorage/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sun, 06 Nov 2022 06:30:06 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[MultiStorage]]></category>
		<category><![CDATA[多目录]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=13628</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>用Apache Pig进行数据处理的时候，我们通常会在最后把处理结果保存到一个HDFS目录下：</p>
<blockquote>
<p>
		STORE result INTO &#39;/my_output_dir&#39;;</p>
</blockquote>
<p>这是最常见的情况。<br />
但是，如果我们想根据某个字段，把数据分成多组，分别存储在多个目录下呢？举个可能不恰当的例子，就有点像我们先把数据按某个字段分组：</p>
<blockquote>
<p>
		GROUP data BY field;</p>
</blockquote>
<p>再把各个group的数据分别存储在不同的目录下一样。<br />
<span id="more-13628"></span><br />
现在来看一个实例。<br />
假设有如下数据（用 tab&#160;分隔的四列分别为：人员类型的id，人员类型的描述，姓名，爱好）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<table style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; width: 100%;">
<thead style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				type</th>
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				desc</th>
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				name</th>
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				hobby</th>
</tr>
</thead>
<tbody style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border: 0px;">
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				2</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				学生</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				陈玉</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				篮球</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				3</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				老师</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				王强</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				足球</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				1</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				保安</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				许勤</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				下棋</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				2</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				学生</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				范雨</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				跑步</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				2</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				学生</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				李林</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				游泳</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				1</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				保安</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				涂欣</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				看书</td>
</tr>
</tbody>
</table>
</section>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e5%a6%82%e4%bd%95%e6%8c%89%e6%95%b0%e6%8d%ae%e5%88%86%e7%bb%84%e4%bf%9d%e5%ad%98%e5%88%b0%e4%b8%8d%e5%90%8c%e7%9a%84%e5%ad%90%e7%9b%ae%e5%bd%95%e4%b8%admultistorage/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>用Apache Pig进行数据处理的时候，我们通常会在最后把处理结果保存到一个HDFS目录下：</p>
<blockquote>
<p>
		STORE result INTO &#39;/my_output_dir&#39;;</p>
</blockquote>
<p>这是最常见的情况。<br />
但是，如果我们想根据某个字段，把数据分成多组，分别存储在多个目录下呢？举个可能不恰当的例子，就有点像我们先把数据按某个字段分组：</p>
<blockquote>
<p>
		GROUP data BY field;</p>
</blockquote>
<p>再把各个group的数据分别存储在不同的目录下一样。<br />
<span id="more-13628"></span><br />
现在来看一个实例。<br />
假设有如下数据（用 tab&nbsp;分隔的四列分别为：人员类型的id，人员类型的描述，姓名，爱好）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<table style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; width: 100%;">
<thead style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px;">
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				type</th>
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				desc</th>
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				name</th>
<th style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em; text-align: left; background-color: rgb(240, 240, 240);">
				hobby</th>
</tr>
</thead>
<tbody style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border: 0px;">
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				2</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				学生</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				陈玉</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				篮球</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				3</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				老师</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				王强</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				足球</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				1</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				保安</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				许勤</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				下棋</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				2</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				学生</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				范雨</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				跑步</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				2</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				学生</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				李林</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				游泳</td>
</tr>
<tr style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; border-width: 1px 0px 0px; border-right-style: initial; border-bottom-style: initial; border-left-style: initial; border-right-color: initial; border-bottom-color: initial; border-left-color: initial; border-image: initial; border-top-style: solid; border-top-color: rgb(204, 204, 204); background-color: white;">
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				1</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				保安</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				涂欣</td>
<td style="color: inherit; line-height: inherit; margin: 0px; font-size: 1em; border-style: solid; border-color: rgb(204, 204, 204); padding: 0.5em 1em;">
				看书</td>
</tr>
</tbody>
</table>
</section>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
现在我们想按第1列 type，把这份数据处理之后分别存储到不同的目录下，即 type 2（学生）的数据保存到目录&quot;2&quot;里，type 3（老师）的数据保存到目录&quot;3&quot;里，依此类推。<br />
一个最简单也最笨的方法是： 在Pig中把数据 SPLIT(拆分) 成多份，再分别 STORE(存储) 到指定的目录下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">type</span>:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">desc</span>:&nbsp;chararray,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>:&nbsp;chararray,&nbsp;hobby:&nbsp;chararray);
SPLIT&nbsp;A&nbsp;INTO&nbsp;A1&nbsp;IF&nbsp;(type&nbsp;==&nbsp;1),&nbsp;A2&nbsp;IF&nbsp;(type&nbsp;==&nbsp;2),&nbsp;A3&nbsp;IF&nbsp;(type&nbsp;==&nbsp;3);

STORE&nbsp;A1&nbsp;INTO&nbsp;&#39;/my_output_dir/1&#39;;
STORE&nbsp;A2&nbsp;INTO&nbsp;&#39;/my_output_dir/2&#39;;
STORE&nbsp;A3&nbsp;INTO&nbsp;&#39;/my_output_dir/3&#39;;
</code></pre>
</section>
<p>
代码很直观，但当 type 有太多种数值的时候，比如说type有100种数值，难道你打算写100个 STORE 语句吗？会疯。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
更好的办法是利用piggybank里的一个UDF <a href="https://pig.apache.org/docs/latest/api/org/apache/pig/piggybank/storage/MultiStorage.html" rel="noopener" target="_blank">MultiStorage</a>来实现同样的功能：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">type</span>:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">desc</span>:&nbsp;chararray,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>:&nbsp;chararray,&nbsp;hobby:&nbsp;chararray);
STORE&nbsp;A&nbsp;INTO&nbsp;&#39;/my_output_dir&#39;&nbsp;USING&nbsp;org.apache.pig.piggybank.storage.MultiStorage(&#39;/my_output_dir&#39;,&nbsp;&#39;0&#39;,&nbsp;&#39;none&#39;,&nbsp;&#39;\t&#39;,&nbsp;&#39;true&#39;);
</code></pre>
</section>
<p>看，只要一行代码就代替了无数的 SPLIT INTO&nbsp;以及 STORE INTO&nbsp;语句，多么优雅。<br />
MultiStorage的参数含义分别是：</p>
<blockquote>
<p>
		public MultiStorage(String <span style="color:#0000ff;">parentPathStr</span>, String <span style="color:#0000ff;">splitFieldIndex</span>, String <span style="color:#0000ff;">compression</span>, String <span style="color:#0000ff;">fieldDel</span>, String <span style="color:#0000ff;">isRemoveKeys</span>)</p>
</blockquote>
<p>parentPathStr：输出文件的父目录。事实上这个参数不起作用，真正起作用的输出路径是在 &quot;INTO&quot;&nbsp;关键字后面的那个路径里指定的，但开发团队为了向前兼容，保留了这个参数，因此，把这个参数里的路径，设置成和 &quot;INTO&quot;&nbsp;后面的路径一样即可。<br />
splitFieldIndex：使用数据里的第几个字段来拆分数据存储到不同的子目录下（从0开始）。我这里设置成0，是表示我使用数据里的 type&nbsp;字段来拆分数据。<br />
compression：输出数据的压缩方式，可以是&nbsp;&#39;bz2&#39;，&#39;bz&#39;，&#39;gz&#39;，&#39;none&#39;这几种取值，其中 none&nbsp;表示无压缩。<br />
fieldDel：输出数据多个字段的分隔符。<br />
isRemoveKeys：输出数据是否移除掉用于拆分数据的字段。我这里设置成 true&nbsp;表示在输出数据中不会包含 type，因为我只想拿 type&nbsp;用作输出子目录名，不想让它出现在输出数据中。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
最后我们看一下执行结果。生成了3个目录，每个 type&nbsp;一个目录：</p>
<blockquote>
<p>
		/my_output_dir/1<br />
		/my_output_dir/2<br />
		/my_output_dir/3<br />
		/my_output_dir/_SUCCESS</p>
</blockquote>
<p>再来看一下子目录&quot;1&quot;里的文件：</p>
<blockquote>
<p>
		/my_output_dir/1/1-0,000</p>
</blockquote>
<p>其文件内容是：</p>
<blockquote>
<p>
		保安&nbsp; &nbsp; 许勤&nbsp; &nbsp; 下棋<br />
		保安&nbsp; &nbsp; 涂欣&nbsp; &nbsp; 看书</p>
</blockquote>
<p>这里面全是 type 1&nbsp;的数据。可见 MultiStorage 确实把数据按照 type&nbsp;拆分并存储在了不同的子目录下。</p>
<p>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：<br />
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e5%a6%82%e4%bd%95%e6%8c%89%e6%95%b0%e6%8d%ae%e5%88%86%e7%bb%84%e4%bf%9d%e5%ad%98%e5%88%b0%e4%b8%8d%e5%90%8c%e7%9a%84%e5%ad%90%e7%9b%ae%e5%bd%95%e4%b8%admultistorage/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] Apache Pig解析JSON数据</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e8%a7%a3%e6%9e%90json%e6%95%b0%e6%8d%ae/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e8%a7%a3%e6%9e%90json%e6%95%b0%e6%8d%ae/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Tue, 04 Jan 2022 18:20:14 +0000</pubDate>
				<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[json]]></category>
		<category><![CDATA[Parse]]></category>
		<category><![CDATA[解析]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=13406</guid>

					<description><![CDATA[<div style="text-align: center;">
	<a href="https://www.codelast.com/" rel="noopener" target="_blank"><img decoding="async" alt="JSON" src="//www.codelast.com/wp-content/uploads/2022/01/json_logo.png" style="width: 300px; height: 68px;" /></a></div>
<p>
查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>在大数据处理领域，JSON格式的数据非常常见，然而用Apache Pig读取JSON并正确取出其中的字段我觉得并不算方便（在某些情况下很容易写错），所以总结一下几个常见的JSON loader/UDF的用法。</p>
<p>假设有数据文件 1.txt，内容是一行JSON（为了简单，这里以一行为例）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="json language-json hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;items&#34;</span>:[{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;id&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;111&#34;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;name&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;aaa&#34;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;extra&#34;</span>:{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;k&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;ttt&#34;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;v&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;uuu&#34;</span>}},{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;id&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;222&#34;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;name&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;bbb&#34;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;extra&#34;</span>:{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;k&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;rrr&#34;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;v&#34;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;sss&#34;</span>}}]}
</code></pre>
</section>
<p><span id="more-13406"></span><br />
为了便于观看，这里把它格式化如下（注意，Pig是按行读取数据，这里格式化只是为了写这篇文章便于展示，实际存储在文件里仍然是一行）： </p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="json language-json hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">{
&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;items&#34;</span>:&#160;[
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;{
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;id&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;111&#34;</span>,
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;name&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;aaa&#34;</span>,
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;extra&#34;</span>:&#160;{
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;k&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;ttt&#34;</span>,
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;v&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;uuu&#34;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;{
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;id&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;222&#34;</span>,
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;name&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;bbb&#34;</span>,
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;extra&#34;</span>:&#160;{
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;k&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;rrr&#34;</span>,
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&#34;v&#34;</span>:&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;sss&#34;</span>
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
&#160;&#160;&#160;&#160;]
}
</code></pre>
</section>
<p>
我们的目的是正确解析这个JSON。</p>
<p>➤&#160;来自于<a href="https://github.com/twitter/elephant-bird" rel="noopener" target="_blank">Elephant Bird</a>的 com.twitter.elephantbird.pig.piggybank.JsonStringToMap()&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e8%a7%a3%e6%9e%90json%e6%95%b0%e6%8d%ae/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<div style="text-align: center;">
	<a href="https://www.codelast.com/" rel="noopener" target="_blank"><img decoding="async" alt="JSON" src="//www.codelast.com/wp-content/uploads/2022/01/json_logo.png" style="width: 300px; height: 68px;" /></a></div>
<p>
查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>在大数据处理领域，JSON格式的数据非常常见，然而用Apache Pig读取JSON并正确取出其中的字段我觉得并不算方便（在某些情况下很容易写错），所以总结一下几个常见的JSON loader/UDF的用法。</p>
<p>假设有数据文件 1.txt，内容是一行JSON（为了简单，这里以一行为例）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="json language-json hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;items&quot;</span>:[{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;id&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;111&quot;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;name&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;aaa&quot;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;extra&quot;</span>:{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;k&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;ttt&quot;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;v&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;uuu&quot;</span>}},{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;id&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;222&quot;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;name&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;bbb&quot;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;extra&quot;</span>:{<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;k&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;rrr&quot;</span>,<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;v&quot;</span>:<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;sss&quot;</span>}}]}
</code></pre>
</section>
<p><span id="more-13406"></span><br />
为了便于观看，这里把它格式化如下（注意，Pig是按行读取数据，这里格式化只是为了写这篇文章便于展示，实际存储在文件里仍然是一行）： </p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="json language-json hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;items&quot;</span>:&nbsp;[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;id&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;111&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;name&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;aaa&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;extra&quot;</span>:&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;k&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;ttt&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;v&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;uuu&quot;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;},
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;id&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;222&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;name&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;bbb&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;extra&quot;</span>:&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;k&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;rrr&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;v&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;sss&quot;</span>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;]
}
</code></pre>
</section>
<p>
我们的目的是正确解析这个JSON。</p>
<p>➤&nbsp;来自于<a href="https://github.com/twitter/elephant-bird" rel="noopener" target="_blank">Elephant Bird</a>的 com.twitter.elephantbird.pig.piggybank.JsonStringToMap()<br />
其实它不是一个JSON loader而是一个UDF，它的作用是把JSON字符串转成一个map。由于JSON结构中都是key:value形式的数据，所以把JSON字符串转成map之后就可以取出指定字段的内容。<br />
这个UDF对于只有&ldquo;一层&rdquo;的JSON非常好用，例如（格式化后）长下面这个样子的JSON：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="json language-json hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;id&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;111&quot;</span>,
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-attr" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">&quot;name&quot;</span>:&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;aaa&quot;</span>
}
</code></pre>
</section>
<p>此时，可以像下面这样取到字段 id&nbsp;及 name 的值：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">DEFINE&nbsp;Json2Map&nbsp;com.twitter.elephantbird.pig.piggybank.JsonStringToMap();

A&nbsp;=&nbsp;LOAD&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;AS&nbsp;(jsonStr:&nbsp;chararray);
B&nbsp;=&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">FOREACH&nbsp;A&nbsp;GENERATE&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">Json2Map</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(jsonStr)</span>&nbsp;AS&nbsp;jsonObj</span>;
C&nbsp;=&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">FOREACH&nbsp;B&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">GENERATE</span>&nbsp;<span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(chararray)</span>&nbsp;jsonObj#&#39;id&#39;&nbsp;AS&nbsp;id,&nbsp;<span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(chararray)</span>&nbsp;jsonObj#&#39;name&#39;&nbsp;AS&nbsp;name</span>;
DUMP&nbsp;C;
</code></pre>
</section>
<p>在这里，JSON被转成了一个map对象，用 <span style="color:#ff0000;">#&#39;id&#39;</span>&nbsp;的方式可以取出名为&nbsp;id&nbsp;的key对应的value。<br />
可见这个UDF在某些场合下用起来很方便，但对本文一开始举例的那种嵌套有多层的JSON就不好处理了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
➤ Pig内置的 JsonLoader<br />
对付本文开头的那种JSON，我们可以使用Pig内置的&nbsp;JsonLoader&nbsp;来加载数据，使用它唯一的坏处就是需要定义JSON的结构：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;LOAD&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">JsonLoader</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;items:&nbsp;{(id:chararray,&nbsp;name:chararray,&nbsp;extra:(k:chararray,&nbsp;v:chararray))}&#39;</span>)</span></span>;
B&nbsp;=&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">FOREACH&nbsp;A&nbsp;GENERATE&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">FLATTEN</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(items)</span></span>;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;id,&nbsp;name,&nbsp;FLATTEN(extra)&nbsp;AS&nbsp;(k,&nbsp;v);
D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;id,&nbsp;name,&nbsp;k,&nbsp;v;
DUMP&nbsp;D;
</code></pre>
</section>
<p>我们仔细看一下本文开头的JSON，它的 <span style="color:#ff0000;">items</span>&nbsp;是一个数组，对应到Pig里就是一个bag（bag里的每个元素是一个tuple），因此用JsonLoader定义JSON的结构时，<span style="color:#ff0000;">items</span>&nbsp;是 <span style="color:#0000ff;">items: {()}</span>&nbsp;这样一种结构。<br />
而JSON中的&nbsp;<span style="color:#ff0000;">extra</span>&nbsp;字段对应的并不是一个数组，所以它对应到Pig里就是一个tuple，因此在JsonLoader中定义结构时&nbsp;<span style="color:#ff0000;">extra</span>&nbsp;是&nbsp;<span style="color:#0000ff;">extra:(k:chararray, v:chararray)</span> 这样的形式。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
注意：JsonLoader看似好用，然而这里<span style="background-color:#ffff00;">有一个巨坑</span>&mdash;&mdash;JsonLoader中定义的字段名和JSON字符串中的字段名并不对应，它只跟顺序相关！<br />
例如，你代码这样写：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;LOAD&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">JsonLoader</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;items:&nbsp;{(id:chararray,&nbsp;invalidField:chararray,&nbsp;extra:(k:chararray,&nbsp;v:chararray))}&#39;</span>)</span></span>;
B&nbsp;=&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">FOREACH&nbsp;A&nbsp;GENERATE&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">FLATTEN</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(items)</span></span>;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;id,&nbsp;invalidField,&nbsp;FLATTEN(extra)&nbsp;AS&nbsp;(k,&nbsp;v);
D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;id,&nbsp;invalidField,&nbsp;k,&nbsp;v;
DUMP&nbsp;D;
</code></pre>
</section>
<p>竟然能正确输出：</p>
<blockquote>
<div>
		(111,aaa,ttt,uuu)</div>
<div>
		(222,bbb,rrr,sss)</div>
</blockquote>
<p>在上面的代码中，<span style="color:#ff0000;">invalidField</span>&nbsp;是一个在JSON字符串中根本不存在的字段名，然而它却能被JsonLoader正确识别，那是因为JsonLoader是按字段的顺序来识别的，JSON字符串中的第2个字段 <span style="color:#0000ff;">name</span>，被JsonLoader映射到了&nbsp;<span style="color: rgb(255, 0, 0);">invalidField&nbsp;</span>这个字段上。<br />
也正是因为这个原因，如果你的JSON字符串的各个字段顺序是不恒定的，就不能用这种方法来读取！<br />
并且，在JsonLoader中定义JSON的结构时要非常小心，如果不小心写错了顺序，那么肯定会得到错误的值，例如，你这样写：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;LOAD&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">JsonLoader</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;items:&nbsp;{(name:chararray,&nbsp;id:chararray,&nbsp;extra:(k:chararray,&nbsp;v:chararray))}&#39;</span>)</span></span>;
</code></pre>
</section>
<p>你以为JsonLoader会自动把JSON字符串中的&nbsp;<span style="color:#ff0000;">id</span>&nbsp;字段映射到你定义的第二个字段（id:chararray）上吗？根本不会！它会把它映射到 name:chararray&nbsp;上，因为在JSON字符串中，id&nbsp;字段是排在第1位的字段，而在JsonLoader中，name:chararray&nbsp;也是排第1位的字段，因此它俩就映射上了。<br />
所以，如果你有JSON数据的控制权，我建议在生成JSON数据的时候，固定生成字段的顺序。例如，如果是使用 Jackson的ObjectMapper生成JSON，那么可以通过&nbsp;@JsonPropertyOrder&nbsp;标注来控制JSON字段的顺序。<br />
另外，如果你定义的结构中漏了一个字段，很可能会造成解析不成功，导致程序逻辑不能按设想的执行！</p>
<p>最后，这仅仅是一个嵌套层数非常少的JSON，如果层数非常多的话，会导致你代码很难写，这个时候还是用JAVA写Map-Reduce&nbsp;job去处理吧。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：<br />
<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e8%a7%a3%e6%9e%90json%e6%95%b0%e6%8d%ae/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 使用Apache DataFu中的Coalesce()简化Apache Pig的三元运算符</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8apache-datafu%e4%b8%ad%e7%9a%84coalesce%e7%ae%80%e5%8c%96apache-pig%e7%9a%84%e4%b8%89%e5%85%83%e8%bf%90%e7%ae%97%e7%ac%a6/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8apache-datafu%e4%b8%ad%e7%9a%84coalesce%e7%ae%80%e5%8c%96apache-pig%e7%9a%84%e4%b8%89%e5%85%83%e8%bf%90%e7%ae%97%e7%ac%a6/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Fri, 04 Dec 2020 03:15:31 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[Apache DataFu]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[Coalesce]]></category>
		<category><![CDATA[UDF]]></category>
		<category><![CDATA[三元运算符]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=13139</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>来看这个例子。有数据文件 1.txt，内容为：</p>
<blockquote>
<div>
		a<span style="color:#ff0000;">[\t]</span><span style="color: rgb(255, 0, 0);">[\t]</span>c</div>
<div>
		<span style="color: rgb(255, 0, 0);">[\t]</span>f<span style="color: rgb(255, 0, 0);">[\t]</span>g</div>
<div>
		h<span style="color: rgb(255, 0, 0);">[\t]</span>k<span style="color: rgb(255, 0, 0);">[\t]</span><br />
		<span style="color: rgb(255, 0, 0);">[\t]</span><span style="color: rgb(255, 0, 0);">[\t]</span><span style="color: rgb(255, 0, 0);">[\t]</span></div>
</blockquote>
<div>
	其中 <span style="color:#ff0000;">[\t]</span>&#160;表示制表符(tab)，并不是真的在文件中写了&#160;<span style="color: rgb(255, 0, 0);">[\t]</span>。<br />
	在Pig命令行交互模式下加载这个文件，并把其中为空（NULL）的列替换成一些数字：
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&#160;(col1:&#160;chararray,&#160;col2:&#160;chararray,&#160;col3:&#160;chararray);
B&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;(col1&#160;IS&#160;NOT&#160;NULL&#160;?&#160;col1&#160;:&#160;&#39;1&#39;)&#160;AS&#160;col1,&#160;(col2&#160;IS&#160;NOT&#160;NULL&#160;?&#160;col2&#160;:&#160;&#39;2&#39;)&#160;AS&#160;col2,&#160;(col3&#160;IS&#160;NOT&#160;NULL&#160;?&#160;col3&#160;:&#160;&#39;3&#39;)&#160;AS&#160;col3;
DUMP&#160;B;
</code></pre>
</section>
<p>输出：</p>
<blockquote>
<div>
			(a,2,c)</div>
<div>
			(1,f,g)</div>
<div>
			(h,k,3)</div>
<div>
			(1,2,3)</div>
</blockquote>
<p>	代码非常简单：如果第一列col1为空则替换为1，如果第二列为空则替换为2，如果第三列为空则替换为3。<br />
	这里使用了三元运算符 <span style="color:#0000ff;">?</span></p></div>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8apache-datafu%e4%b8%ad%e7%9a%84coalesce%e7%ae%80%e5%8c%96apache-pig%e7%9a%84%e4%b8%89%e5%85%83%e8%bf%90%e7%ae%97%e7%ac%a6/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>来看这个例子。有数据文件 1.txt，内容为：</p>
<blockquote>
<div>
		a<span style="color:#ff0000;">[\t]</span><span style="color: rgb(255, 0, 0);">[\t]</span>c</div>
<div>
		<span style="color: rgb(255, 0, 0);">[\t]</span>f<span style="color: rgb(255, 0, 0);">[\t]</span>g</div>
<div>
		h<span style="color: rgb(255, 0, 0);">[\t]</span>k<span style="color: rgb(255, 0, 0);">[\t]</span><br />
		<span style="color: rgb(255, 0, 0);">[\t]</span><span style="color: rgb(255, 0, 0);">[\t]</span><span style="color: rgb(255, 0, 0);">[\t]</span></div>
</blockquote>
<div>
	其中 <span style="color:#ff0000;">[\t]</span>&nbsp;表示制表符(tab)，并不是真的在文件中写了&nbsp;<span style="color: rgb(255, 0, 0);">[\t]</span>。<br />
	在Pig命令行交互模式下加载这个文件，并把其中为空（NULL）的列替换成一些数字：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;(col1&nbsp;IS&nbsp;NOT&nbsp;NULL&nbsp;?&nbsp;col1&nbsp;:&nbsp;&#39;1&#39;)&nbsp;AS&nbsp;col1,&nbsp;(col2&nbsp;IS&nbsp;NOT&nbsp;NULL&nbsp;?&nbsp;col2&nbsp;:&nbsp;&#39;2&#39;)&nbsp;AS&nbsp;col2,&nbsp;(col3&nbsp;IS&nbsp;NOT&nbsp;NULL&nbsp;?&nbsp;col3&nbsp;:&nbsp;&#39;3&#39;)&nbsp;AS&nbsp;col3;
DUMP&nbsp;B;
</code></pre>
</section>
<p>输出：</p>
<blockquote>
<div>
			(a,2,c)</div>
<div>
			(1,f,g)</div>
<div>
			(h,k,3)</div>
<div>
			(1,2,3)</div>
</blockquote>
<p>	代码非常简单：如果第一列col1为空则替换为1，如果第二列为空则替换为2，如果第三列为空则替换为3。<br />
	这里使用了三元运算符 <span style="color:#0000ff;">? :</span> 来做这个判断，写法非常丑陋。<br />
	<span id="more-13139"></span><br />
	如果要实现更麻烦的逻辑，&ldquo;取第一列、第二列、第三列中，第1个不为空的列作为结果，如果全为空则取888&rdquo;：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;(col1&nbsp;IS&nbsp;NOT&nbsp;NULL&nbsp;?&nbsp;col1&nbsp;:&nbsp;(col2&nbsp;IS&nbsp;NOT&nbsp;NULL&nbsp;?&nbsp;col2&nbsp;:&nbsp;(col3&nbsp;IS&nbsp;NOT&nbsp;NULL&nbsp;?&nbsp;col3&nbsp;:&nbsp;&#39;888&#39;)))&nbsp;AS&nbsp;col;
DUMP&nbsp;B;
</code></pre>
</section>
<p>输出：</p>
<blockquote>
<div>
			(a)</div>
<div>
			(f)</div>
<div>
			(h)</div>
<div>
			(888)</div>
</blockquote>
<p>	这代码是真&bull;丑出天际。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	这时，我们就可以使用<a href="https://datafu.apache.org/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Apache DataFu</span></a>中的&nbsp;<a href="https://github.com/apache/datafu/blob/master/datafu-pig/src/main/java/datafu/pig/util/Coalesce.java" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Coalesce()</span></a>&nbsp;函数来简化代码。</div>
<blockquote>
<div>
		Apache DataFu&trade;是用于在Hadoop中处理大规模数据的库的集合。</div>
</blockquote>
<div>
	Coalesce()的功能：</div>
<blockquote>
<div>
		返回一个元组(tuple)的第一个非空值，就像SQL中的Coalesce一样。</div>
</blockquote>
<div>
	简化上面的第一段代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">DEFINE&nbsp;Coalesce&nbsp;datafu.pig.util.Coalesce();

A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;Coalesce(col1,&nbsp;&#39;1&#39;)&nbsp;AS&nbsp;col1,&nbsp;Coalesce(col2,&nbsp;&#39;2&#39;)&nbsp;AS&nbsp;col2,&nbsp;Coalesce(col3,&nbsp;&#39;3&#39;)&nbsp;AS&nbsp;col3;
DUMP&nbsp;B;
</code></pre>
</section>
<p>输出结果完全一样。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	简化上面的第二段代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">DEFINE&nbsp;Coalesce&nbsp;datafu.pig.util.Coalesce();

A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;Coalesce(col1,&nbsp;col2,&nbsp;col3,&nbsp;&#39;888&#39;)&nbsp;AS&nbsp;col;
DUMP&nbsp;B;
</code></pre>
</section>
<p>三元运算符嵌套的层数越多，使用Coalesce能简化得就越多。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：<br />
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
</div>
<div>
	&nbsp;</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e4%bd%bf%e7%94%a8apache-datafu%e4%b8%ad%e7%9a%84coalesce%e7%ae%80%e5%8c%96apache-pig%e7%9a%84%e4%b8%89%e5%85%83%e8%bf%90%e7%ae%97%e7%ac%a6/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 如何减少map-only的Pig job的输出文件数</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%87%8f%e5%b0%91map-only%e7%9a%84pig-job%e7%9a%84%e8%be%93%e5%87%ba%e6%96%87%e4%bb%b6%e6%95%b0/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%87%8f%e5%b0%91map-only%e7%9a%84pig-job%e7%9a%84%e8%be%93%e5%87%ba%e6%96%87%e4%bb%b6%e6%95%b0/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sun, 11 Aug 2019 06:17:39 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[Pig Latin]]></category>
		<category><![CDATA[pig教程]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=10494</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>如果一个Pig job是map-only的job，并且其输入文件数很多的话，那么输出的文件数也会同样多，此时，如果每个文件大小又比较小的话，长久下去就会对Haodoop NameNode造成很大压力。我们可以通过给Pig&#160;job添加一个reduce过程来减少输出文件数。<br />
<span id="more-10494"></span><br />
话不多说，直接上代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">D&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;/user/abc/xxx&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&#160;(col1:&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>,&#160;col2:&#160;chararray);&#160;&#160;
E&#160;=&#160;FOREACH&#160;(GROUP&#160;D&#160;BY&#160;RANDOM()&#160;PARALLEL&#160;100)&#160;GENERATE&#160;FLATTEN(D);&#160;
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
其中：<br />
加载到D里的数据就是文件数量巨大，并且每个文件大小都比较小的一份数据。<br />
<span style="color:#0000ff;">PARALLEL</span>后的100可以视情况更改，设置为100则输出文件数为100。<br />
这里的实现逻辑是：添加一个reduce过程，其shuffle&#160;key是由Pig的 <span style="color:#0000ff;">RANDOM()</span>&#160;函数生成的一个随机数，因此不会出现让大量数据聚集在同一个reducer里的情况，这样做是试图使输出数据尽可能均匀地分布在100个输出文件里。</p>
<p>但实际上你会发现，实际输出的文件大小并不均衡，比如有些300多MB大，有些只有100多MB大，这是因为 RANDOM()&#160;函数输出的是一个 [0.0, 1.0)&#160;范围内的数，这个小数作为shuffle key的话，很难说会有什么后果，因此我们把它放大1000倍再转为int，就可以避免这个问题，真正让输出数据的大小很平均：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; letter-spacing: 0px; color: rgb(62, 62, 62); line-height: 1.6; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="padding: 0px; font-size: inherit; width: 800.656px; margin-top: 0px; margin-bottom: 0px; color: inherit; line-height: inherit;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">&#160;E&#160;=&#160;FOREACH&#160;(GROUP&#160;D&#160;BY&#160;((int) (RANDOM() * 1000))&#160;PARALLEL&#160;100)&#160;GENERATE&#160;FLATTEN(D);&#160;
</code></pre>
</section>
<p>
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;版权声明&#160;<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%87%8f%e5%b0%91map-only%e7%9a%84pig-job%e7%9a%84%e8%be%93%e5%87%ba%e6%96%87%e4%bb%b6%e6%95%b0/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>如果一个Pig job是map-only的job，并且其输入文件数很多的话，那么输出的文件数也会同样多，此时，如果每个文件大小又比较小的话，长久下去就会对Haodoop NameNode造成很大压力。我们可以通过给Pig&nbsp;job添加一个reduce过程来减少输出文件数。<br />
<span id="more-10494"></span><br />
话不多说，直接上代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">D&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;/user/abc/xxx&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>,&nbsp;col2:&nbsp;chararray);&nbsp;&nbsp;
E&nbsp;=&nbsp;FOREACH&nbsp;(GROUP&nbsp;D&nbsp;BY&nbsp;RANDOM()&nbsp;PARALLEL&nbsp;100)&nbsp;GENERATE&nbsp;FLATTEN(D);&nbsp;
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
其中：<br />
加载到D里的数据就是文件数量巨大，并且每个文件大小都比较小的一份数据。<br />
<span style="color:#0000ff;">PARALLEL</span>后的100可以视情况更改，设置为100则输出文件数为100。<br />
这里的实现逻辑是：添加一个reduce过程，其shuffle&nbsp;key是由Pig的 <span style="color:#0000ff;">RANDOM()</span>&nbsp;函数生成的一个随机数，因此不会出现让大量数据聚集在同一个reducer里的情况，这样做是试图使输出数据尽可能均匀地分布在100个输出文件里。</p>
<p>但实际上你会发现，实际输出的文件大小并不均衡，比如有些300多MB大，有些只有100多MB大，这是因为 RANDOM()&nbsp;函数输出的是一个 [0.0, 1.0)&nbsp;范围内的数，这个小数作为shuffle key的话，很难说会有什么后果，因此我们把它放大1000倍再转为int，就可以避免这个问题，真正让输出数据的大小很平均：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; letter-spacing: 0px; color: rgb(62, 62, 62); line-height: 1.6; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="padding: 0px; font-size: inherit; width: 800.656px; margin-top: 0px; margin-bottom: 0px; color: inherit; line-height: inherit;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">&nbsp;E&nbsp;=&nbsp;FOREACH&nbsp;(GROUP&nbsp;D&nbsp;BY&nbsp;((int) (RANDOM() * 1000))&nbsp;PARALLEL&nbsp;100)&nbsp;GENERATE&nbsp;FLATTEN(D);&nbsp;
</code></pre>
</section>
<p>
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%87%8f%e5%b0%91map-only%e7%9a%84pig-job%e7%9a%84%e8%be%93%e5%87%ba%e6%96%87%e4%bb%b6%e6%95%b0/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] Apache Pig问题：Encountered IOException. org.apache.pig.tools.parameters.ParseException: Encountered &quot;&quot;</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e9%97%ae%e9%a2%98%ef%bc%9aencountered-ioexception-org-apache-pig-tools-parameters-parseexception-encountered/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e9%97%ae%e9%a2%98%ef%bc%9aencountered-ioexception-org-apache-pig-tools-parameters-parseexception-encountered/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 25 Jul 2019 03:13:17 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[Encountered EOF]]></category>
		<category><![CDATA[教程]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=10465</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>运行Pig脚本时报错：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">Error&#160;before&#160;Pig&#160;is&#160;launched
<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">----------------------------</span>
ERROR&#160;2997:&#160;Encountered&#160;IOException.&#160;org.apache.pig.tools.parameters.ParseException:&#160;Encountered&#160;&#34;&#60;EOF&#62;&#34;&#160;at&#160;line&#160;1,&#160;column&#160;8.
Was&#160;expecting&#160;one&#160;of:
&#160;&#160;&#160;&#160;&#60;IDENTIFIER&#62;&#160;...
&#160;&#160;&#160;&#160;&#60;OTHER&#62;&#160;...
&#160;&#160;&#160;&#160;&#60;LITERAL&#62;&#160;...
&#160;&#160;&#160;&#160;&#60;SHELLCMD&#62;&#160;...
java.io.IOException:&#160;org.apache.pig.tools.parameters.ParseException:&#160;Encountered&#160;&#34;&#60;EOF&#62;&#34;&#160;at&#160;line&#160;1,&#160;column&#160;8.
Was&#160;expecting&#160;one&#160;of:
&#160;&#160;&#160;&#160;&#60;IDENTIFIER&#62;&#160;...
&#160;&#160;&#160;&#160;&#60;OTHER&#62;&#160;...
&#160;&#160;&#160;&#160;&#60;LITERAL&#62;&#160;...
&#160;&#160;&#160;&#160;&#60;SHELLCMD&#62;&#160;...

&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;org.apache.pig.impl.PigContext.doParamSubstitution(PigContext.java:408)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;org.apache.pig.Main.runParamPreprocessor(Main.java:783)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;org.apache.pig.Main.run(Main.java:446)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;org.apache.pig.Main.main(Main.java:158)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;sun.reflect.NativeMethodAccessorImpl.invoke0(Native&#160;Method)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;java.lang.reflect.Method.invoke(Method.java:497)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;org.apache.hadoop.util.RunJar.run(RunJar.java:221)
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;at&#160;org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused&#160;by:&#160;org.apache.pig.tools.parameters.ParseException:&#160;Encountered&#160;&#34;&#60;EOF&#62;&#34;&#160;at&#160;line&#160;1,&#160;column&#160;8.
</code></pre>
</section>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	这个问题有可能有多种原因，比如某行漏写了语句结尾的分号。这里我遇到的是另一个原因：调用该Pig脚本的shell脚本，用 -p &#34;xxx=$X&#34; 这种形式传参时，参数为空，修正参数为空的问题即可解决。<br />
	<span id="more-10465"></span><br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;版权声明&#160;<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u></div>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e9%97%ae%e9%a2%98%ef%bc%9aencountered-ioexception-org-apache-pig-tools-parameters-parseexception-encountered/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>运行Pig脚本时报错：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">Error&nbsp;before&nbsp;Pig&nbsp;is&nbsp;launched
<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">----------------------------</span>
ERROR&nbsp;2997:&nbsp;Encountered&nbsp;IOException.&nbsp;org.apache.pig.tools.parameters.ParseException:&nbsp;Encountered&nbsp;&quot;&lt;EOF&gt;&quot;&nbsp;at&nbsp;line&nbsp;1,&nbsp;column&nbsp;8.
Was&nbsp;expecting&nbsp;one&nbsp;of:
&nbsp;&nbsp;&nbsp;&nbsp;&lt;IDENTIFIER&gt;&nbsp;...
&nbsp;&nbsp;&nbsp;&nbsp;&lt;OTHER&gt;&nbsp;...
&nbsp;&nbsp;&nbsp;&nbsp;&lt;LITERAL&gt;&nbsp;...
&nbsp;&nbsp;&nbsp;&nbsp;&lt;SHELLCMD&gt;&nbsp;...
java.io.IOException:&nbsp;org.apache.pig.tools.parameters.ParseException:&nbsp;Encountered&nbsp;&quot;&lt;EOF&gt;&quot;&nbsp;at&nbsp;line&nbsp;1,&nbsp;column&nbsp;8.
Was&nbsp;expecting&nbsp;one&nbsp;of:
&nbsp;&nbsp;&nbsp;&nbsp;&lt;IDENTIFIER&gt;&nbsp;...
&nbsp;&nbsp;&nbsp;&nbsp;&lt;OTHER&gt;&nbsp;...
&nbsp;&nbsp;&nbsp;&nbsp;&lt;LITERAL&gt;&nbsp;...
&nbsp;&nbsp;&nbsp;&nbsp;&lt;SHELLCMD&gt;&nbsp;...

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;org.apache.pig.impl.PigContext.doParamSubstitution(PigContext.java:408)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;org.apache.pig.Main.runParamPreprocessor(Main.java:783)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;org.apache.pig.Main.run(Main.java:446)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;org.apache.pig.Main.main(Main.java:158)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;sun.reflect.NativeMethodAccessorImpl.invoke0(Native&nbsp;Method)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;java.lang.reflect.Method.invoke(Method.java:497)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;org.apache.hadoop.util.RunJar.run(RunJar.java:221)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;at&nbsp;org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused&nbsp;by:&nbsp;org.apache.pig.tools.parameters.ParseException:&nbsp;Encountered&nbsp;&quot;&lt;EOF&gt;&quot;&nbsp;at&nbsp;line&nbsp;1,&nbsp;column&nbsp;8.
</code></pre>
</section>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	这个问题有可能有多种原因，比如某行漏写了语句结尾的分号。这里我遇到的是另一个原因：调用该Pig脚本的shell脚本，用 -p &quot;xxx=$X&quot; 这种形式传参时，参数为空，修正参数为空的问题即可解决。<br />
	<span id="more-10465"></span><br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
		以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-apache-pig%e9%97%ae%e9%a2%98%ef%bc%9aencountered-ioexception-org-apache-pig-tools-parameters-parseexception-encountered/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 如何在Apache Pig中判断一个bag中是否包含特定的元素</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%9c%a8apache-pig%e4%b8%ad%e5%88%a4%e6%96%ad%e4%b8%80%e4%b8%aabag%e4%b8%ad%e6%98%af%e5%90%a6%e5%8c%85%e5%90%ab%e7%89%b9%e5%ae%9a%e7%9a%84%e5%85%83%e7%b4%a0/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%9c%a8apache-pig%e4%b8%ad%e5%88%a4%e6%96%ad%e4%b8%80%e4%b8%aabag%e4%b8%ad%e6%98%af%e5%90%a6%e5%8c%85%e5%90%ab%e7%89%b9%e5%ae%9a%e7%9a%84%e5%85%83%e7%b4%a0/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Fri, 05 Aug 2016 08:47:00 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[bag包含指定元素]]></category>
		<category><![CDATA[check if an element is present in a bag]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=8875</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><span style="color:#b22222;">In Pig Latin, how to check if an element is present in a bag?</span></p>
<p>假设一个bag是由 int 元素组成的（可以理解为一个list），那么，如何判断这个bag中是否包含指定的元素（例如 5）呢？<br />
如果你看过Pig的doc，就知道它并没有自带这样一个函数，可以输入一个bag，以及另一个值作为参数，然后输出1或0来表示bag是否包含这个元素。<br />
所以，我们该如何实现这个功能？<br />
<span id="more-8875"></span><br />
现在我就以一个实际的例子来说明这个问题。<br />
假设我们有数据文件 1.txt：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[codelast@&#160;~]$&#160;cat&#160;1.txt&#160;
a&#160;&#160;&#160;&#160;{(1),(2),(3),(5),(6)}
b&#160;&#160;&#160;&#160;{(1)}
c&#160;&#160;&#160;&#160;{(1),(2)}
d&#160;&#160;&#160;&#160;{(1),(3),(5)}
3&#160;&#160;&#160;&#160;{(1),(2),(5),(6)}
</code></pre>
</section>
<p>一共有两列，用 \t 分隔。其中，第一列是一个字符串，第二列样子很怪，它之所以写成那样，是为了可以在Pig读入的时候直接加载为一个bag，在这里，你可以把第二列理解为一个list，以第一行为例，这个list包含的元素就是1、2、3、5、6。<br />
如果我想判断每一行数据的第二列中，是否包含 5 这个元素，代码该怎么写？<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%9c%a8apache-pig%e4%b8%ad%e5%88%a4%e6%96%ad%e4%b8%80%e4%b8%aabag%e4%b8%ad%e6%98%af%e5%90%a6%e5%8c%85%e5%90%ab%e7%89%b9%e5%ae%9a%e7%9a%84%e5%85%83%e7%b4%a0/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><span style="color:#b22222;">In Pig Latin, how to check if an element is present in a bag?</span></p>
<p>假设一个bag是由 int 元素组成的（可以理解为一个list），那么，如何判断这个bag中是否包含指定的元素（例如 5）呢？<br />
如果你看过Pig的doc，就知道它并没有自带这样一个函数，可以输入一个bag，以及另一个值作为参数，然后输出1或0来表示bag是否包含这个元素。<br />
所以，我们该如何实现这个功能？<br />
<span id="more-8875"></span><br />
现在我就以一个实际的例子来说明这个问题。<br />
假设我们有数据文件 1.txt：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[codelast@&nbsp;~]$&nbsp;cat&nbsp;1.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;{(1),(2),(3),(5),(6)}
b&nbsp;&nbsp;&nbsp;&nbsp;{(1)}
c&nbsp;&nbsp;&nbsp;&nbsp;{(1),(2)}
d&nbsp;&nbsp;&nbsp;&nbsp;{(1),(3),(5)}
3&nbsp;&nbsp;&nbsp;&nbsp;{(1),(2),(5),(6)}
</code></pre>
</section>
<p>一共有两列，用 \t 分隔。其中，第一列是一个字符串，第二列样子很怪，它之所以写成那样，是为了可以在Pig读入的时候直接加载为一个bag，在这里，你可以把第二列理解为一个list，以第一行为例，这个list包含的元素就是1、2、3、5、6。<br />
如果我想判断每一行数据的第二列中，是否包含 5 这个元素，代码该怎么写？<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></p>
<ul>
<li>
		<span style="background-color:#00ff00;">方法一</span></li>
</ul>
<p>我们不妨直接来看看正确的实现方法：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>:&nbsp;chararray,&nbsp;aList:bag{(item:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>)});
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;{
&nbsp;&nbsp;FILTERED_LIST&nbsp;=&nbsp;FILTER&nbsp;aList&nbsp;BY&nbsp;($0&nbsp;==&nbsp;5);
&nbsp;&nbsp;GENERATE
&nbsp;&nbsp;(IsEmpty(FILTERED_LIST)&nbsp;?&nbsp;&#39;not-contain&#39;&nbsp;:&nbsp;&#39;contain&#39;)&nbsp;AS&nbsp;flag,
&nbsp;&nbsp;name;
}
DUMP&nbsp;B;
</code></pre>
</section>
<p>这段Pig代码的输出是：</p>
<blockquote>
<div>
		(contain,a)</div>
<div>
		(not-contain,b)</div>
<div>
		(not-contain,c)</div>
<div>
		(contain,d)</div>
<div>
		(contain,e)</div>
</blockquote>
<p>可以看到，第一行（a）是contain（包含5），第二行（b）是not-contain（不包含5），第三行（c）是not-contain（不包含5），等等。<br />
从输入数据上我们可以很容易地判断出来，这个输出结果是完全正确的。所以，上面的Pig代码是怎么做到的呢？<br />
且听我一行行分析下来。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
第1行是加载数据，&nbsp;<span style="color:#0000ff;">aList:bag{(item:int)} </span>这个形式有点怪，但它是由我们的输入数据 1.txt 里的格式决定的，这样做我们就能把第二列加载成一个bag，这个bag里有N个tuple，每个tuple里有一个int元素。<br />
第2到第7行是用了一个<a href="http://pig.apache.org/docs/r0.12.0/basic.html#foreach" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">嵌套的FOREACH</span></a>来实现&ldquo;判断一个bag中是否包含指定元素&rdquo;的功能，这一句：</p>
<blockquote>
<p>
		FILTERED_LIST = FILTER aList BY ($0 == 5);</p>
</blockquote>
<p>会先filter得到包含 5 这个元素的bag，然后下面的这一句：</p>
<blockquote>
<p>
		(IsEmpty(FILTERED_LIST) ? &#39;not-contain&#39; : &#39;contain&#39;) AS flag,</p>
</blockquote>
<p>会根根据filter得到的bag是否为空，来输出一个字符串，表示当前数据行是&ldquo;不包含&rdquo;还是&ldquo;包含&rdquo;指定的元素5。<br />
但我觉得有很多人一定有疑问：为什么 FILTER 那一句可以得到包含5的bag？ <span style="color:#0000ff;">$0 == 5</span> 是个什么鬼？其实，这就是查找&ldquo;包含5的bag&rdquo;，千万不要认为 <span style="color:#0000ff;">$0</span>&nbsp;只表示bag的第一个元素！大家可以看看<a href="http://pig.apache.org/docs/r0.12.0/basic.html#foreach" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">这个</span></a>Pig文档，然而它里面也没有对 $0 做具体的解释，大家就强行理解一下吧。<br />
所有又有人会问，那我能不能不用嵌套的FOREACH，直接像下面这样：</p>
<blockquote>
<p>
		B = FILTER A BY (aList.$0 == 5);</p>
</blockquote>
<p>来得到那些包含5的记录呢？答案是不行，这样写语法都是错的，一试便知。<br />
关于方法一，大家也可以参考<a href="http://stackoverflow.com/questions/26390220/check-if-an-element-is-present-in-a-bag" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">这个</span></a>stackoverflow的讨论（但它里面其实是有一些错误的）。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></p>
<ul>
<li>
		<span style="background-color:#00ff00;">方法二</span></li>
</ul>
<p>方法一会让第一次接触的同学有些费解。那么方法二就非常直接明了了。思路是：把bag FLATTEN（展开）出来，每一行数据展开成N行，然后第二列就变成了一个标量（int），就可以不用嵌套的FOREACH，也可以FILTER出来包含5的记录：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>:&nbsp;chararray,&nbsp;aList:bag{(item:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>)});
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;name,&nbsp;FLATTEN(aList)&nbsp;AS&nbsp;item;
C&nbsp;=&nbsp;FILTER&nbsp;B&nbsp;BY&nbsp;(item&nbsp;==&nbsp;5);
DUMP&nbsp;C;
</code></pre>
</section>
<p>输出结果：</p>
<blockquote>
<div>
		(a,5)</div>
<div>
		(d,5)</div>
<div>
		(e,5)</div>
</blockquote>
<p>可见结果是正确的。这里输出的只是bag中包含5的那些记录，如果要找到不包含5的那些记录，可以拿这个输出结果与原始数据做OUTER JOIN，但这显然比方法一麻烦，而且事实上它也不如方法一高效。所以，可以这么说，这算是&ldquo;看上去很容易理解&rdquo;的代价吧。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></p>
<ul>
<li>
		<span style="background-color:#00ff00;">方法三</span></li>
</ul>
<p>所以有没有一种方法，它既写起来简单，看起来又容易理解呢？那就是用<span style="color:#0000ff;">UDF</span>啦。但UDF终归还是要自己写的，这个工作量我们没有把它计算在内，所以，这算是方法三的劣势。</p>
<p>假设我们要编写的UDF名为<span style="color:#0000ff;">BagContains</span>，它接受两个参数，第一个参数是bag，第二个参数是我们要在bag中查找的值（例如5），当bag中包含指定元素时返回1，否则返回0。<br />
根据这个定义，我们来看看UDF的用法：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">REGISTER&nbsp;&#39;/home/codelast/my-pig-lib.jar&#39;;

DEFINE&nbsp;BagContains&nbsp;com.codelast.BagContains();

A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">name</span>:&nbsp;chararray,&nbsp;aList:bag{(item:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>)});
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;name,&nbsp;((BagContains(aList,&nbsp;5)&nbsp;==&nbsp;1)&nbsp;?&nbsp;&#39;contain&#39;&nbsp;:&nbsp;&#39;not-contain&#39;)&nbsp;AS&nbsp;flag;
DUMP&nbsp;B;
</code></pre>
</section>
<p>其中，REGISTER 那行引入的jar包，是我编写的UDF编译生成的jar包的路径。<br />
这段代码的输出结果：</p>
<blockquote>
<div>
		(a,contain)</div>
<div>
		(b,not-contain)</div>
<div>
		(c,not-contain)</div>
<div>
		(d,contain)</div>
<div>
		(e,contain)</div>
</blockquote>
<div>
	这个结果与方法一实际上是一样的。<br />
	使用了UDF的代码是如此清晰易懂，但是话说回来，这个UDF该怎么写呢？<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></div>
<p>一言不合直接上UDF的代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">package</span>&nbsp;com.codelast;

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.EvalFunc;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.data.DataBag;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.data.DataType;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.data.Tuple;

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;java.io.IOException;

<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">/**
&nbsp;*&nbsp;Check&nbsp;whether&nbsp;a&nbsp;list&nbsp;contains&nbsp;a&nbsp;specified&nbsp;item.
&nbsp;*&nbsp;Usage:&nbsp;suppose&nbsp;aList&nbsp;is&nbsp;a&nbsp;bag&nbsp;contains&nbsp;int&nbsp;items,&nbsp;then
&nbsp;*&nbsp;ContainsItem(aList,&nbsp;5)
&nbsp;*&nbsp;will&nbsp;return&nbsp;1(the&nbsp;list&nbsp;contains&nbsp;5)&nbsp;or&nbsp;0(the&nbsp;list&nbsp;doesn&#39;t&nbsp;contains&nbsp;5)
&nbsp;*
&nbsp;*&nbsp;<span class="hljs-doctag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">@author</span>&nbsp;Darran&nbsp;Zhang&nbsp;@&nbsp;codelast.com
&nbsp;*/</span>
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-class" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">class</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">BagContains</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">extends</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">EvalFunc</span>&lt;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">Integer</span>&gt;&nbsp;</span>{
&nbsp;&nbsp;<span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">@Override</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Integer&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">exec</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(Tuple&nbsp;input)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">if</span>&nbsp;(input&nbsp;==&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>&nbsp;||&nbsp;input.size()&nbsp;==&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>&nbsp;||&nbsp;input.get(<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>)&nbsp;==&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>;
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;DataBag&nbsp;inputBag&nbsp;=&nbsp;DataType.toBag(input.get(<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>));
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>&nbsp;item2Find&nbsp;=&nbsp;DataType.toInteger(input.get(<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">1</span>));
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">for</span>&nbsp;(Tuple&nbsp;entry&nbsp;:&nbsp;inputBag)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>&nbsp;item&nbsp;=&nbsp;DataType.toInteger(entry.get(<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">if</span>&nbsp;(item&nbsp;==&nbsp;item2Find)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">1</span>;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>;
&nbsp;&nbsp;}
}
</code></pre>
</section>
<p>在这里，我就不解释每一行代码了，Pig UDF（JAVA）的编写有一个比较类似的套路，很多简单的UDF都可以用上面的格式稍微改改得到。</p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e5%9c%a8apache-pig%e4%b8%ad%e5%88%a4%e6%96%ad%e4%b8%80%e4%b8%aabag%e4%b8%ad%e6%98%af%e5%90%a6%e5%8c%85%e5%90%ab%e7%89%b9%e5%ae%9a%e7%9a%84%e5%85%83%e7%b4%a0/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 如何用Apache Pig输出压缩格式的SequenceFile</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 23 Jul 2015 17:02:49 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[SequenceFile]]></category>
		<category><![CDATA[输出压缩文件]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=8499</guid>

					<description><![CDATA[<div>
	查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。
<p>	SequenceFile是Hadoop API提供的一种二进制文件，它将数据以&#60;key,value&#62;的形式序列化到文件中。</p></div>
<div>
	如果你要用Apache Pig读取这种类型的数据，可以用 PiggyBank 中的SequenceFileLoader&#8212;&#8212;我没有用过，但肯定是没问题的。</div>
<div>
<span id="more-8499"></span></div>
<div>
	但是，如果你保存在SequenceFile中的key或value是<span style="color:#b22222;">ThriftWritable</span>类型的数据，那么，要用Pig来 load ＆ store 这种数据，就不那么容易了。</div>
<div>
	幸好我们有Twitter，它已经帮我们做好了这个工作。利用其开源的 <a href="https://github.com/twitter/elephant-bird/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Elephant Bird</span></a>，我们可以轻松做到这一点。</div>
<div>
	Elephant Bird 中的 <span style="color:#0000ff;">SequenceFileLoader</span> 以及 <span style="color:#0000ff;">SequenceFileStorage</span> 就是用来干这个的。</div>
<div>
	&#160;</div>
<div>
	例如，load数据的做法是：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;xxx&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&#160;com.twitter.elephantbird.pig.load.SequenceFileLoader(
&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&#160;com.codelast.elephantbird.pig.util.BooleanWritableConverter&#39;</span>,
&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&#160;com.twitter.elephantbird.pig.util.ThriftWritableConverter&#160;com.codelast.MyThriftClass&#39;</span></code></pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/" class="read-more">Read More </a></section>]]></description>
										<content:encoded><![CDATA[<div>
	查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>	SequenceFile是Hadoop API提供的一种二进制文件，它将数据以&lt;key,value&gt;的形式序列化到文件中。</p></div>
<div>
	如果你要用Apache Pig读取这种类型的数据，可以用 PiggyBank 中的SequenceFileLoader&mdash;&mdash;我没有用过，但肯定是没问题的。</div>
<div>
<span id="more-8499"></span></div>
<div>
	但是，如果你保存在SequenceFile中的key或value是<span style="color:#b22222;">ThriftWritable</span>类型的数据，那么，要用Pig来 load ＆ store 这种数据，就不那么容易了。</div>
<div>
	幸好我们有Twitter，它已经帮我们做好了这个工作。利用其开源的 <a href="https://github.com/twitter/elephant-bird/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Elephant Bird</span></a>，我们可以轻松做到这一点。</div>
<div>
	Elephant Bird 中的 <span style="color:#0000ff;">SequenceFileLoader</span> 以及 <span style="color:#0000ff;">SequenceFileStorage</span> 就是用来干这个的。</div>
<div>
	&nbsp;</div>
<div>
	例如，load数据的做法是：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;xxx&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;com.twitter.elephantbird.pig.load.SequenceFileLoader(
&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&nbsp;com.codelast.elephantbird.pig.util.BooleanWritableConverter&#39;</span>,
&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-c&nbsp;com.twitter.elephantbird.pig.util.ThriftWritableConverter&nbsp;com.codelast.MyThriftClass&#39;</span>);</code></pre>
</section>
<div>
	其中，这份SequenceFile的key是BooleanWritable类型，value是ThriftWritable类型，它对应的Thrift类是MyThriftClass，这是一个自定义的Thrift class。</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></div>
<div>
	store 数据的做法是：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">STORE&nbsp;B&nbsp;INTO&nbsp;&#39;xxx&#39;&nbsp;USING&nbsp;com.twitter.elephantbird.pig.store.SequenceFileStorage(
&nbsp;&#39;-c&nbsp;com.codelast.elephantbird.pig.util.BooleanWritableConverter&#39;,
&nbsp;&#39;-c&nbsp;com.twitter.elephantbird.pig.util.ThriftWritableConverter&nbsp;com.codelast.MyThriftClass&#39;);
</code></pre>
</section>
<div>
	其中，对key和value的说明和上面一样。</div>
<div>
	&nbsp;</div>
<div>
	这样，就可以实现加载以及存储SequenceFile了。</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a></div>
<div>
	但是你会发现，这样输出的SequenceFile是未压缩的，所以文件体积比较大。如果要压缩，该怎么做呢？</div>
<div>
	答案就是在Pig脚本中添加以下几句话就OK了：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SET</span>&nbsp;output.compression.enabled&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;true&#39;</span>;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SET</span>&nbsp;mapreduce.output.fileoutputformat.compress.type&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;BLOCK&#39;</span>;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SET</span>&nbsp;output.compression.codec&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;org.apache.hadoop.io.compress.DefaultCodec&#39;</span>;
</code></pre>
</section>
<div>
	这会使得输出的SequenceFile是BLOCK压缩类型，默认压缩编码的文件。</p>
<p style="font-size: 16px; margin: 5px 0px; clear: both; font-family: sans-serif;">
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
		转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
		感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
		以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%a6%82%e4%bd%95%e7%94%a8apache-pig%e8%be%93%e5%87%ba%e5%8e%8b%e7%bc%a9%e6%a0%bc%e5%bc%8f%e7%9a%84sequencefile/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创]Apache Pig的一些基础概念及用法总结（2）</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e5%8f%8a%e7%94%a8%e6%b3%95%e6%80%bb%e7%bb%93%ef%bc%882%ef%bc%89/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e5%8f%8a%e7%94%a8%e6%b3%95%e6%80%bb%e7%bb%93%ef%bc%882%ef%bc%89/#comments</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 05 Apr 2012 06:45:06 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[Pig Latin]]></category>
		<category><![CDATA[pig tutorial in Chinese]]></category>
		<category><![CDATA[pig中文教程]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=4611</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><span style="background-color: rgb(0, 255, 0);">▶▶</span>&#160;LIMIT操作并不会减少读入的数据量<br />
如果你只需要输出一个小数据集，通常你可以使用LIMIT来实现，例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&#160;(col1:&#160;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&#160;col2:&#160;chararray);
B&#160;=&#160;LIMIT&#160;A&#160;5;
DUMP&#160;B;
</code></pre>
</section>
<p>Pig会只加载5条记录，就不再读取其他的记录了吗？答案是：不会。Pig将读取数据文件中的所有记录，然后再从中挑5条。这是Pig可以做优化、却没有做的一点。<br />
<span style="color:#800080;">【更新】</span>Pig 0.10已经有了这功能了：</p>
<blockquote>
<div>
		Push Limit into Loader</div>
<div>
		Pig optimizes limit query by pushing limit automatically to the loader, thus requiring only a fraction of the entire input to be scanned.</div></blockquote>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e5%8f%8a%e7%94%a8%e6%b3%95%e6%80%bb%e7%bb%93%ef%bc%882%ef%bc%89/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;LIMIT操作并不会减少读入的数据量<br />
如果你只需要输出一个小数据集，通常你可以使用LIMIT来实现，例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
B&nbsp;=&nbsp;LIMIT&nbsp;A&nbsp;5;
DUMP&nbsp;B;
</code></pre>
</section>
<p>Pig会只加载5条记录，就不再读取其他的记录了吗？答案是：不会。Pig将读取数据文件中的所有记录，然后再从中挑5条。这是Pig可以做优化、却没有做的一点。<br />
<span style="color:#800080;">【更新】</span>Pig 0.10已经有了这功能了：</p>
<blockquote>
<div>
		Push Limit into Loader</div>
<div>
		Pig optimizes limit query by pushing limit automatically to the loader, thus requiring only a fraction of the entire input to be scanned.</div>
</blockquote>
<div>
	按我的理解，上面这段话的含义是：Pig将LIMIT查询自动优化到loader中，这样就只会扫描整个输入数据集的一部分（而不是全部）。</div>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;使用UDF不一定要在Pig脚本中REGISTER，也可以在命令行指定<br />
大家知道，使用UDF需要在Pig脚本中REGISTER该UDF的jar包，但你可能不知道，你也可以不在Pig脚本中REGISTER它，而是通过命令行指定：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">pig&nbsp;-Dpig.additional.jars=/home/codelast/a.jar:/home/codelast/b.jar:/home/codelast/c.jar&nbsp;test.pig
</code></pre>
</section>
<p>以上命令告诉了我们几件事：<br />
<strong><span style="color:#ff0000;">①</span></strong>我们让Pig执行了test.pig脚本；<br />
<strong><span style="color:#ff0000;">②</span></strong>我们向Pig传入了&ldquo;pig.additional.jars&rdquo;这样一个参数，此参数的作用相当于在Pig脚本中REGISTER jar包；<br />
<strong><span style="color:#ff0000;">③</span></strong>如果你要REGISTER多个jar包，只需像上面的例子一样，用分号(:)把多个jar包路径隔开即可；<br />
<strong><span style="color:#ff0000;">④</span></strong>test.pig必须写在最后，而不能写成&ldquo;<span style="color:#0000ff;">pig test.pig -Dpig.additional.jars=XXX</span>&rdquo;这样，否则Pig直接报错：</p>
<blockquote>
<p>
		ERROR 2999: Unexpected internal error. Encountered unexpected arguments on command line - please check the command line.</p>
</blockquote>
<p>当然，为了可维护性好，你最好把REGISTER jar包写在Pig脚本中，不要通过命令行传入。<br />
<span id="more-4611"></span><br />
<span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;使用ORDER排序时，null会比所有值都小<br />
用ORDER按一个字段排序，如果该字段的所有值中有null，那么null会比其他值都小。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;如何按指定的几个字段来去重<br />
去重，即去除重复的记录。通常，我们使用DISTINCT来去除整行重复的记录，但是，如果我们只想用几个字段来去重，怎么做？<br />
假设有以下数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;1.txt&nbsp;
1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;uoip
1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;jklm
9&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;sdfa
8&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;dddd
9&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;qqqq
8&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;sfew
</code></pre>
</section>
<p>我们要按第1、2、3、4个字段来去重，也就是说，去重结果应为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;uoip
9&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;sdfa
8&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;dddd
</code></pre>
</section>
<p>那么，我们可以这样做：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray,&nbsp;col4:&nbsp;chararray,&nbsp;col5:&nbsp;chararray);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col1,&nbsp;col2,&nbsp;col3,&nbsp;col4);
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;D&nbsp;=&nbsp;LIMIT&nbsp;A&nbsp;1;
&nbsp;&nbsp;&nbsp;&nbsp;GENERATE&nbsp;FLATTEN(D);
};
DUMP&nbsp;C;
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,2,3,4,uoip)
(8,8,8,8,dddd)
(9,7,5,3,sdfa)
</code></pre>
</section>
<p>代码很简单，就是利用了GROUP时会自动对group的key去重的功能，这里不用多解释大家应该也能看懂。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;如何设置Pig job的名字，使得在Hadoop jobtracker中可以清晰地识别出来<br />
在Pig脚本中的一开始处，写上这一句：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;job.name&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;This&nbsp;is&nbsp;my&nbsp;job&#39;</span>;
</code></pre>
</section>
<p>将使得Pig job name被设置为&ldquo;This is my job&rdquo;，从而在Hadoop jobtracker的web界面中可以很容易地找到你的job。如果不设置的话，其名字将显示为&ldquo;<span style="color:#b22222;">PigLatin:DefaultJobName</span>&rdquo;。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;&ldquo;<span style="color:#800080;">scalar has more than one row in the output</span>&rdquo;错误的一个原因<br />
遇到了这个错误？我来演示一下如何复现这个错误。<br />
假设有两个文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;a.txt&nbsp;
1&nbsp;&nbsp;&nbsp;&nbsp;2
3&nbsp;&nbsp;&nbsp;&nbsp;4
[root@localhost&nbsp;~]$&nbsp;cat&nbsp;b.txt&nbsp;
3&nbsp;&nbsp;&nbsp;&nbsp;4
5&nbsp;&nbsp;&nbsp;&nbsp;6
</code></pre>
</section>
<p>现在我们来做一个JOIN：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;b.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1,&nbsp;B&nbsp;BY&nbsp;col1;
D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;A.col1;
DUMP&nbsp;D;
</code></pre>
</section>
<p>这段代码是必然会fail的，错误提示为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">org.apache.pig.backend.executionengine.ExecException:&nbsp;ERROR&nbsp;0:&nbsp;Scalar&nbsp;has&nbsp;more&nbsp;than&nbsp;one&nbsp;row&nbsp;in&nbsp;the&nbsp;output.&nbsp;1st&nbsp;:&nbsp;(1,2),&nbsp;2nd&nbsp;:(3,4)
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
乍一看，似乎代码简单得一点问题都没有啊？其实仔细一看，&ldquo;<span style="color:#ff0000;">A.col1</span>&rdquo;的写法根本就是错误的，应该写成&ldquo;<span style="color:#0000ff;">A::col1</span>&rdquo;才对，因为你只要 DESCRIBE 一下 C 的schema就明白了：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">C:&nbsp;{A::col1:&nbsp;int,A::col2:&nbsp;int,B::col1:&nbsp;int,B::col2:&nbsp;int}
</code></pre>
</section>
<p>Pig的这个错误提示得很不直观，在<a href="https://issues.apache.org/jira/browse/PIG-2134" rel="noopener noreferrer" target="_blank">这个链接</a>中也有人提到过了。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;如何输出LZO压缩格式的文本文件<br />
借助于elephant-bird，可以轻易完成这个工作。<br />
<span style="color:#ff0000;">方法1:</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;＝<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;input&#39;</span>;
STORE&nbsp;A&nbsp;INTO&nbsp;&#39;output&#39;&nbsp;USING&nbsp;com.twitter.elephantbird.pig.store.LzoPigStorage();
</code></pre>
</section>
<p>结果就会得到一堆名称类似于&ldquo;<span style="color:#0000ff;">part-m-00000.lzo</span>&rdquo;的文件。<br />
注意以上省略了一堆的&ldquo;<span style="color:#b22222;">REGISTER XXX.jar</span>&rdquo;代码，你需要自己添加上你的jar包路径。<br />
<span style="color:#ff0000;">方法2：</span><br />
在Pig脚本的最前面添加两句话：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;mapred.output.compression.codec&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;com.hadoop.compression.lzo.LzopCodec&#39;</span>;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;mapred.output.compress&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;true&#39;</span>;
</code></pre>
</section>
<p>STORE的时候（不需要USING...）会自动将文件保存为LZO格式。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
有人说，那加载LZO压缩的文本文件呢？很简单：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;output&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;com.twitter.elephantbird.pig.store.LzoPigStorage(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;,&#39;</span>);
</code></pre>
</section>
<p>这表示指定了分隔符为逗号，如果不想指定，省略括号中的内容即可。</p>
<p><span style="background-color:#00ff00;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;<span style="font-family: Ubuntu;">如何输出 gz 及 bz2 压缩格式的文件</span><br />
首先请看<a href="http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">这个</span></a>链接的说明。摘录一段话：</p>
<blockquote>
<div>
		Compression</div>
<div>
		&nbsp;</div>
<div>
		Storing to a directory whose name ends in &quot;.bz2&quot; or &quot;.gz&quot; or &quot;.lzo&quot; (if you have installed support for LZO compression in Hadoop) will automatically use the corresponding compression codec.</div>
<div>
		output.compression.enabled and output.compression.codec job properties also work.</div>
<div>
		Loading from directories ending in .bz2 or .bz works automatically; other compression formats are not auto-detected on loading.</div>
</blockquote>
<div>
	说得简单点就是：当你把保存的目录名设置为以 .bz2 或 .gz 结尾时，输出的文件就自动会被压缩为对应的文件格式。<br />
	因此，我就有了下面这两段极其简单的示例代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">--压缩率稍低</span>
A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>;
STORE&nbsp;A&nbsp;INTO&nbsp;&#39;z.gz&#39;;

<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">--压缩率较高</span>
A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>;
STORE&nbsp;A&nbsp;INTO&nbsp;&#39;z.bz2&#39;;
</code></pre>
</section>
<p>正如上面的注释所示，bz2 的压缩率比 gz 高。<br />
	最后生成的目录下，文件名类似于：</p>
<blockquote>
<div>
			part-m-00000.gz</div>
<div>
			part-m-00001.gz</div>
<div>
			part-m-00002.gz</div>
</blockquote>
</div>
<p>如果是 bz2，则后缀名为 .bz2。</p>
<p>有人可能觉得这种通过目录名实现的方式不直观，那么你也可以在Pig脚本中指定输入文件的压缩格式。下面的例子演示了如何输出gzip格式的压缩文件（此时，目录名就不用以.gz结尾了）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;output.compression.enabled&nbsp;<span class="hljs-literal" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">true</span>;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;output.compression.codec&nbsp;org.apache.hadoop.io.compress.GzipCodec;
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;<span style="font-family: Ubuntu;">类似于Java字符串的contains方法，在Pig中怎么做<br />
例如，要判断一个字符串&ldquo;abc&rdquo;是否包含&ldquo;bc&rdquo;，怎么做？<br />
有一个方法，利用Pig内置的</span>INDEXOF函数来实现：</p>
<blockquote>
<p>
		<span style="color:#0000ff;">INDEXOF(string, &#39;character&#39;, startIndex)</span></p>
<div>
		string: The string to be searched.</div>
<div>
		&#39;character&#39;: The character being searched for, in quotes.</div>
<div>
		startIndex: The index from which to begin the forward search. The string index begins with zero (0).</div>
</blockquote>
<div>
	即：string 相当于上面所说的字符串&ldquo;abc&rdquo;，&#39;character&#39;相当于上面所说的字符串&ldquo;bc&rdquo;，startIndex设置为0即可，表示从字符串&ldquo;abc&rdquo;的头开始搜索。<br />
	如果能查到字符串，则返回值为搜索到的字符串的起始索引位置；如果查不到字符串，则返回值为-1。<br />
	这里给一个例子。输入数据为：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost]&nbsp;~$&nbsp;cat&nbsp;a.txt&nbsp;
abc&nbsp;&nbsp;&nbsp;&nbsp;123
</code></pre>
</section>
<p><span style="font-family: Ubuntu;">Pig脚本：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;INDEXOF(col1,&nbsp;&#39;abc&#39;,&nbsp;0),&nbsp;INDEXOF(col2,&nbsp;&#39;9&#39;,&nbsp;0);
DUMP&nbsp;B;
</code></pre>
</section>
<p><span style="font-family: Ubuntu;">输出结果：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(0,-1)
</code></pre>
</section>
<p><span style="font-family: Ubuntu;">可见，查不到的字符串，其返回值为-1。</span></p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;<span style="font-family: Ubuntu;">类型为long的字段做FILTER的时候，数字没有加&ldquo;L&rdquo;可能会导致结果错误！<br />
我遇到的一个真实问题，下面的代码：</span></p>
<div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;/path/to/my/data&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;MyLOADER;
B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;(fieldId&nbsp;==&nbsp;3125694275663397348L&nbsp;OR&nbsp;fieldId&nbsp;==&nbsp;3125694275741155094L);
</code></pre>
</section>
</div>
<p><span style="font-family: Ubuntu;">对LOAD进来的数据进行了一个简单的FILTER，其中，<span style="color:#0000ff;">fieldId</span> 是一个类型为 long 的字段，这样写没有任何问题。但是如果把FILTER条件里的数字后面的&ldquo;<span style="color:#ff0000;">L</span>&rdquo;去掉，Pig不会报错，但FILTER语句可能就会失效。所以在写字面数字的时候，一定要为long类型后面加上&ldquo;<span style="color:#ff0000;">L</span>&rdquo;！</span></p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;<span style="font-family: Ubuntu;">加载Sequence file不能当作纯文本文件来加载</span><br />
假设Sequence file是 TAB制表符&nbsp;分隔的两列数据，正确加载它可以这样写：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;xxx&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;com.twitter.elephantbird.pig.load.SequenceFileLoader();
<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">--&nbsp;SequenceFileLoader加载出来的数据，自动会加上schema：{key:&nbsp;chararray,value:&nbsp;chararray}</span>
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;key&nbsp;AS&nbsp;col1,&nbsp;value&nbsp;AS&nbsp;col2;
C&nbsp;=&nbsp;LIMIT&nbsp;B&nbsp;10;
DUMP&nbsp;C;
</code></pre>
</section>
<p>如果你把它当作未压缩的纯文本文件来加载（不使用USING XXX），则会发现job不报错，但实际上读出来的数据都是乱码。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：<br />
<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e5%8f%8a%e7%94%a8%e6%b3%95%e6%80%bb%e7%bb%93%ef%bc%882%ef%bc%89/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>[原创]使用Apache Pig时应该注意/避免的操作或事项</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b%e4%bd%bf%e7%94%a8apache-pig%e6%97%b6%e5%ba%94%e8%af%a5%e6%b3%a8%e6%84%8f%e9%81%bf%e5%85%8d%e7%9a%84%e6%93%8d%e4%bd%9c/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b%e4%bd%bf%e7%94%a8apache-pig%e6%97%b6%e5%ba%94%e8%af%a5%e6%b3%a8%e6%84%8f%e9%81%bf%e5%85%8d%e7%9a%84%e6%93%8d%e4%bd%9c/#comments</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Mon, 26 Mar 2012 18:23:09 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[中文教程]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=4577</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="http://pig.apache.org/" rel="noopener noreferrer" target="_blank">Apache Pig</a>是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。</p>
<p>本文基于以下环境：<br />
pig 0.8.1<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0); ">（1）</span>CROSS操作<br />
由于求交叉积可能会导致结果数据量暴增，因此，CROSS操作是一个&#8220;昂贵&#8221;的操作，可能会耗费Hadoop集群较多的资源，使用的时候需要评估一下数据量的大小。<br />
<span id="more-4577"></span><br />
<span style="background-color:#00ff00;">（2）</span>JOIN操作的顺序<br />
如教程《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》中的第(6)条所写，当JOIN的各数据集分布严重不均时，你最好考虑一下JOIN的顺序，可以对你的job效率提升有帮助。</p>
<p><span style="background-color: rgb(0, 255, 0); ">（3）</span>FLATTEN一个空的bag将不会输出任何结果</p>
<div>
	FLATTEN操作本来会将嵌套展开，生成更多行的结果，但如果被展开的bag是空的，则一行记录也不会生成，这与你想像的可能有点不同：至少也应该生成一行结果吧？不会的，就是一行也不会生成。<br />
	在这里我用一个实例来说明。假设有数据文件 1.txt：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&#160;~]$&#160;cat&#160;1.txt
1&#160;&#160;&#160;&#160;2&#160;&#160;&#160;{(a,b),(c,d)}
77&#160;&#160;&#160;&#160;88&#160;&#160;{(p,q),(r,s)}
123&#160;&#160;&#160;&#160;555&#160;{(u,w),(q,t)}
</code></pre>
</section>
<p>有三列，它们之间是以TAB分隔的。<br />
以下代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&#160;(col1:&#160;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&#160;col2:&#160;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&#160;col3:&#160;bag{t:&#160;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">first</span>:&#160;chararray,&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">second</span>:&#160;chararray)});
B&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;col1,&#160;col2,&#160;FLATTEN(col3);
DUMP&#160;B;
</code></pre>
</section>
<p>得到的结果是显而易见的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,2,a,b)
(1,2,c,d)
(77,88,p,q)
(77,88,r,s)
(123,555,u,w)
(123,555,q,t)
</code></pre>
</section>
<p>可见记录被解嵌套了。<br />
但是，如果第三列的bag是空的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&#160;~]$&#160;cat&#160;1.txt</code></pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b%e4%bd%bf%e7%94%a8apache-pig%e6%97%b6%e5%ba%94%e8%af%a5%e6%b3%a8%e6%84%8f%e9%81%bf%e5%85%8d%e7%9a%84%e6%93%8d%e4%bd%9c/" class="read-more">Read More </a></section>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="http://pig.apache.org/" rel="noopener noreferrer" target="_blank">Apache Pig</a>是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。</p>
<p>本文基于以下环境：<br />
pig 0.8.1<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0); ">（1）</span>CROSS操作<br />
由于求交叉积可能会导致结果数据量暴增，因此，CROSS操作是一个&ldquo;昂贵&rdquo;的操作，可能会耗费Hadoop集群较多的资源，使用的时候需要评估一下数据量的大小。<br />
<span id="more-4577"></span><br />
<span style="background-color:#00ff00;">（2）</span>JOIN操作的顺序<br />
如教程《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》中的第(6)条所写，当JOIN的各数据集分布严重不均时，你最好考虑一下JOIN的顺序，可以对你的job效率提升有帮助。</p>
<p><span style="background-color: rgb(0, 255, 0); ">（3）</span>FLATTEN一个空的bag将不会输出任何结果</p>
<div>
	FLATTEN操作本来会将嵌套展开，生成更多行的结果，但如果被展开的bag是空的，则一行记录也不会生成，这与你想像的可能有点不同：至少也应该生成一行结果吧？不会的，就是一行也不会生成。<br />
	在这里我用一个实例来说明。假设有数据文件 1.txt：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;1.txt
1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;{(a,b),(c,d)}
77&nbsp;&nbsp;&nbsp;&nbsp;88&nbsp;&nbsp;{(p,q),(r,s)}
123&nbsp;&nbsp;&nbsp;&nbsp;555&nbsp;{(u,w),(q,t)}
</code></pre>
</section>
<p>有三列，它们之间是以TAB分隔的。<br />
以下代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:&nbsp;bag{t:&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">first</span>:&nbsp;chararray,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">second</span>:&nbsp;chararray)});
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1,&nbsp;col2,&nbsp;FLATTEN(col3);
DUMP&nbsp;B;
</code></pre>
</section>
<p>得到的结果是显而易见的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,2,a,b)
(1,2,c,d)
(77,88,p,q)
(77,88,r,s)
(123,555,u,w)
(123,555,q,t)
</code></pre>
</section>
<p>可见记录被解嵌套了。<br />
但是，如果第三列的bag是空的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;1.txt
1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;{}
77&nbsp;&nbsp;&nbsp;&nbsp;88&nbsp;&nbsp;{}
123&nbsp;&nbsp;&nbsp;&nbsp;555&nbsp;{}
</code></pre>
</section>
<p>那么，与上面同样的代码将什么也不会输出（你可以自己试验一下）。<br />
这一点需要特别注意，所以一般来说，你在FLATTEN一个bag之前，需要判断一下它是否是空的（IsEmpty），如果你需要在FLATTEN的结果中标记空的那些bag，那么你就需要自己在FLATTEN之前将空的bag替换为自己指定的内容。</p>
<p><span style="background-color: rgb(0, 255, 0); ">（4）</span>SAMPLE的结果数量是不确定的<br />
SAMPLE操作符可以对一个关系（relation）进行取样，得到其一定百分比的数据（例如随机取其中10%的数据），但是，这并不保证对同一个relation进行同样比例的SAMPLE，得到的tuple的数量就是相同的。例如，对一个有千万行的数据集，SAMPLE 0.1的结果，可能第一次会得到100万行，重做一次却得到了101万行（这里只是举一个例子，具体的数字是未知的）。<br />
我的试验结果可以肯定地告诉大家：我拿一个含上亿条记录的数据集SAMPLE 0.1两次的结果相差了4万多条记录（指的是数据条数）。</p>
<p><span style="background-color: rgb(0, 255, 0); ">（5）</span>输出几个简单数据的job，没必要单独跑，使用UNION整合在一个Pig脚本中即可<br />
假设有几个Pig job，它们输出的都是一行数据（当然，一行可能有多列），那么没必要单独跑几个job再得到所有结果，你可以用UNION把它们整合放在一个Pig脚本中，例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A1&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">col</span>:&nbsp;chararray);
A2&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">col</span>:&nbsp;chararray);
A2&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;P.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">col</span>:&nbsp;chararray);

B1&nbsp;=&nbsp;GROUP&nbsp;A1&nbsp;ALL;
C1&nbsp;=&nbsp;FOREACH&nbsp;B1&nbsp;GENERATE&nbsp;COUNT(A1);

B2&nbsp;=&nbsp;GROUP&nbsp;A2&nbsp;ALL;
C2&nbsp;=&nbsp;FOREACH&nbsp;B2&nbsp;GENERATE&nbsp;COUNT(A2);

B3&nbsp;=&nbsp;GROUP&nbsp;A3&nbsp;ALL;
C3&nbsp;=&nbsp;FOREACH&nbsp;B3&nbsp;GENERATE&nbsp;COUNT(A3);

U&nbsp;=&nbsp;UNION&nbsp;C1,&nbsp;C2,&nbsp;C3;

STORE&nbsp;U&nbsp;INTO&nbsp;&#39;res&#39;;
</code></pre>
</section>
<p>有人说，为什么不把A1，A2，A3使用通配符一起加载？答：这里我假设了一种非常简单的情况：三个数据文件都只有一列，而实际中，可能三个文件完全有不同的格式，而且后面的处理针对每个job也是不同的（在这里为简单起见才写成相同的），因此，单独加载有时候是必要的。<br />
这样做之后，会输出3个文件，每个文件中有一个数。比跑3个Pig job方便。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color:#00ff00;">（6）</span>用ORDER排序时，Pig并不遵守&ldquo;相同的key的记录会被发送到同一个partition&rdquo;的惯例<br />
处理海量数据时，我们常常会遇到这样一种情况：某些key的数据远远多于其他key的数据。例如，我们要分析用户的web访问日志，会发现用户访问Google的次数远远多于其他网站的次数。<br />
所以，如果我们要按&ldquo;访问的网站&rdquo;这个字段来GROUP或者ORDER的话，就会造成落入某些reducer的数据远远多于其他reducer（GROUP、ORDER都会触发reduce过程）&mdash;&mdash;注意，这里说&ldquo;某些reducer&rdquo;，是因为在这个例子中，不仅访问Google是个大户，可能还有其他的网站访问大户。<br />
由于这些reducer需要处理的数据量特别大，也就会导致所需的时间特别长，从而整个job所需的总时间特别长。为了解决这一弊端，Pig使用了一种聪明的方法：先对需要ORDER的数据进行采样，获知其key分布情况，然后根据此分布构造一个可以均衡全体数据的partitioner，从而将数据比较均匀地送到N个reducer上。Pig的这种算法是很有效的，它使得各个reducer之间的执行时间相差不大。<br />
正因为Pig做了这样的工作，所以，前面例子中所说的对Google的访问记录可能会被送到多个reducer中，它们有相同的key，却没有被送到同一reducer中，这没有遵守MapReduce的惯例。如果你的数据处理流程需要遵守此惯例，那么就不能用Pig的ORDER来排序。<br />
同时，正因为Pig在ORDER时需要对输入数据进行采样，所以，ORDER的时候Pig会为你的数据流程添加一个额外的轻量job来完成采样工作&mdash;&mdash;从Pig在控制台输出的信息中，你可以看到一个Feature为&ldquo;<span style="color:#0000cd;">SAMPLER</span>&rdquo;的job，这个job就是采样用的。</p>
<p><span style="background-color: rgb(0, 255, 0); ">（7）</span>关系操作符(Relational Operator)只能对关系(relation)进行操作，而不能对表达式(expression)进行操作吗？<br />
有人说，从这一句的陈述来看，它根本就是废话&mdash;&mdash;关系操作符当然是操作关系的啊！<br />
不过，答案是：不一定。例如，DISTINCT 是一个关系操作符，但是它却可以对表达式进行操作！<br />
我拿一个例子来说明这个问题。有以下数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;1.txt&nbsp;
1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3
2&nbsp;&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3
2&nbsp;&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;7
8&nbsp;&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;3
1&nbsp;&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7
</code></pre>
</section>
<p>有以下Pig代码（没什么计算上的意义，就是为了演示用）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;col3;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;E&nbsp;=&nbsp;DISTINCT&nbsp;A.col3;
&nbsp;&nbsp;&nbsp;&nbsp;GENERATE&nbsp;group,&nbsp;COUNT(E);
};
DUMP&nbsp;C;
</code></pre>
</section>
<p>完全可以成功执行，结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(3,1)
(7,1)
</code></pre>
</section>
<div>
	从这段代码能看出什么？首先，A.col3 是一个表达式(expression)，而不是一个关系(relation)，但是，DISTINCT 这个操作符却应用在了它上面，这说明，关系操作符完全是有可能应用在表达式上的。<br />
	某些书/资料上有一种说法是，上面的代码是有语法错误的，你必须用以下的代码来替代：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;col3;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;D&nbsp;=&nbsp;A.col3;
&nbsp;&nbsp;&nbsp;&nbsp;E&nbsp;=&nbsp;DISTINCT&nbsp;D;
&nbsp;&nbsp;&nbsp;&nbsp;GENERATE&nbsp;group,&nbsp;COUNT(E);
};
DUMP&nbsp;C;
</code></pre>
</section>
<p>这段代码确实没有错误，它先通过 A.col3 这个表达式，生成了一个关系(relation) D，再将 DISTINCT 关系操作符应用于其上，这样就避开了所谓的&ldquo;关系操作符只能操作关系&rdquo;的限制。但事实上，我们完全没有必要这样做，正如上面的代码的试验结果，你完全可以直接用 DISTINCT A.col3，并不会出错。<br />
所以说，书本也有坑爹的时候，不可全信。</p>
<p><span style="background-color:#00ff00;">（8）</span>嵌套的 FOREACH 中的代码是串行执行的<br />
嵌套的 FOREACH 语句可以完成复杂的操作，例如在嵌套的 FOREACH 中对bag进行排序等。FOREACH会被并行执行，但是对嵌套在其中的子语句来说，它们却是串行执行的。</p>
<p><span style="background-color: rgb(0, 255, 0); ">（9）</span>PARALLEL只对强制产生reduce过程的操作符有效<br />
通过PARALLEL，你可以控制Pig程序的并行度（说白了就是控制reducer的数量）。PARALLEL可以附加在任何关系操作符上，但是它只对reduce端的并行度起控制作用，因为MapReduce不允许用户自己控制map端的并行度，它只允许用户控制reduce端的并行度，所以，这就是PARALLEL只对强制产生reduce过程的操作符有效的原因了。<br />
另外，在本地模式（pig -x local）下，PARALLEL是不起任何作用的（被忽略），因为在本地模式下所有操作都是串行执行的。</p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b%e4%bd%bf%e7%94%a8apache-pig%e6%97%b6%e5%ba%94%e8%af%a5%e6%b3%a8%e6%84%8f%e9%81%bf%e5%85%8d%e7%9a%84%e6%93%8d%e4%bd%9c/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>[原创]Apache Pig中文教程（进阶）</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e4%b8%ad%e6%96%87%e6%95%99%e7%a8%8b%ef%bc%88%e8%bf%9b%e9%98%b6%ef%bc%89/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e4%b8%ad%e6%96%87%e6%95%99%e7%a8%8b%ef%bc%88%e8%bf%9b%e9%98%b6%ef%bc%89/#comments</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 15 Mar 2012 08:36:25 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[Pig Latin]]></category>
		<category><![CDATA[pig tutorial in Chinese]]></category>
		<category><![CDATA[pig教程]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=4249</guid>

					<description><![CDATA[<p>
本文包含Apache Pig的一些进阶技巧及用法小结。如要学习基础教程，请查看我写的<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color:#afeeee;">【其他几篇文章】</span></a>。<br />
本文的大量实例都是作者Darran Zhang（website: codelast.com）在工作、学习中总结的经验或解决的问题，并且添加了较为详尽的说明及注解，此外，作者还在不断地添加本文的内容，希望能帮助一部分人。</p>
<p><a href="http://pig.apache.org/" rel="noopener noreferrer" target="_blank">Apache Pig</a>是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。<br />
<span id="more-4249"></span><br />
本文基于以下环境：<br />
pig 0.8.1<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color:#00ff00;">✔</span> 如何编写及使用自定义函数（UDF）<br />
首先给出一个链接：<a href="http://pig.apache.org/docs/r0.8.1/api/" rel="noopener noreferrer" target="_blank">Pig 0.8.1 API</a>，还有<a href="http://pig.apache.org/docs/r0.8.1/udf.html" rel="noopener noreferrer" target="_blank">Pig UDF Manual</a>。这两个文档能提供很多有用的参考。<br />
自定义函数有何用？这里以一个极其简单的例子来说明一下。<br />
假设你有如下数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&#160;pig]$&#160;cat&#160;a.txt&#160;
uidk&#160;&#160;&#160;&#160;12&#160;&#160;3
hfd&#160;&#160;&#160;&#160;132&#160;99
bbN&#160;&#160;&#160;&#160;463&#160;231
UFD&#160;&#160;&#160;&#160;13&#160;&#160;10
</code></pre>
</section>
<p>现在你要将第二列的值先+500，再-300，然后再&#247;2.6，那么我们可以这样写：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&#62;&#160;A&#160;=&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&#160;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>(col1:chararray,&#160;col2:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&#160;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&#62;&#160;B&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;col1,&#160;(col2&#160;+&#160;500&#160;-&#160;300)/2.6,&#160;col3;
grunt&#62;&#160;DUMP&#160;B;
(uidk,81.53846153846153,3)
(hfd,127.6923076923077,99)
(bbN,255.0,231)
(UFD,81.92307692307692,10)
</code></pre>
</section>
<p>我们看到，对第二列进行了 (col2 + 500 - 300)/2.6 这样的计算。麻烦不？或许这点小意思没什么。但是，如果有比这复杂得多的处理，每次你需要输入多少pig代码呢？我们希望有这样一个函数，可以让第二行pig代码简化如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&#62;&#160;B&#160;=&#160;FOREACH&#160;A&#160;GENERATE&#160;col1,&#160;com.codelast.MyUDF(col2),&#160;col3;</code></pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e4%b8%ad%e6%96%87%e6%95%99%e7%a8%8b%ef%bc%88%e8%bf%9b%e9%98%b6%ef%bc%89/" class="read-more">Read More </a></section>]]></description>
										<content:encoded><![CDATA[<p>
本文包含Apache Pig的一些进阶技巧及用法小结。如要学习基础教程，请查看我写的<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color:#afeeee;">【其他几篇文章】</span></a>。<br />
本文的大量实例都是作者Darran Zhang（website: codelast.com）在工作、学习中总结的经验或解决的问题，并且添加了较为详尽的说明及注解，此外，作者还在不断地添加本文的内容，希望能帮助一部分人。</p>
<p><a href="http://pig.apache.org/" rel="noopener noreferrer" target="_blank">Apache Pig</a>是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。<br />
<span id="more-4249"></span><br />
本文基于以下环境：<br />
pig 0.8.1<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color:#00ff00;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span> 如何编写及使用自定义函数（UDF）<br />
首先给出一个链接：<a href="http://pig.apache.org/docs/r0.8.1/api/" rel="noopener noreferrer" target="_blank">Pig 0.8.1 API</a>，还有<a href="http://pig.apache.org/docs/r0.8.1/udf.html" rel="noopener noreferrer" target="_blank">Pig UDF Manual</a>。这两个文档能提供很多有用的参考。<br />
自定义函数有何用？这里以一个极其简单的例子来说明一下。<br />
假设你有如下数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
uidk&nbsp;&nbsp;&nbsp;&nbsp;12&nbsp;&nbsp;3
hfd&nbsp;&nbsp;&nbsp;&nbsp;132&nbsp;99
bbN&nbsp;&nbsp;&nbsp;&nbsp;463&nbsp;231
UFD&nbsp;&nbsp;&nbsp;&nbsp;13&nbsp;&nbsp;10
</code></pre>
</section>
<p>现在你要将第二列的值先+500，再-300，然后再&divide;2.6，那么我们可以这样写：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>(col1:chararray,&nbsp;col2:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1,&nbsp;(col2&nbsp;+&nbsp;500&nbsp;-&nbsp;300)/2.6,&nbsp;col3;
grunt&gt;&nbsp;DUMP&nbsp;B;
(uidk,81.53846153846153,3)
(hfd,127.6923076923077,99)
(bbN,255.0,231)
(UFD,81.92307692307692,10)
</code></pre>
</section>
<p>我们看到，对第二列进行了 (col2 + 500 - 300)/2.6 这样的计算。麻烦不？或许这点小意思没什么。但是，如果有比这复杂得多的处理，每次你需要输入多少pig代码呢？我们希望有这样一个函数，可以让第二行pig代码简化如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1,&nbsp;com.codelast.MyUDF(col2),&nbsp;col3;
</code></pre>
</section>
<p>这样的话，对于我们经常使用的操作，岂不是很方便？<br />
pig的UDF（user-defined function）就是拿来做这个的。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
下面，就以IntelliJ这个IDE为例（其实用什么IDE倒无所谓，大同小异吧），说明我们如何实现这样一个功能。<br />
<strong><span style="color: rgb(255, 0, 0); ">①</span></strong>新建一个新工程，在工程下创建&ldquo;lib&rdquo;目录，然后把pig安装包中的&ldquo;pig-0.8.1-core.jar&rdquo;文件放置到此lib目录下，然后在&ldquo;Project Structure&rarr;Libraries&rdquo;下添加（点击&ldquo;+&rdquo;号）一个库，就命名为&ldquo;lib&rdquo;，然后点击右侧的&ldquo;Attach Classes&rdquo;按钮，选择pig-0.8.1-core.jar文件，再点击下方的&ldquo;Apply&rdquo;按钮应用此更改。这样做之后，你就可以在IDE的编辑器中实现输入代码时看到智能提示了。<br />
此外，你还需要用同样的方法，将一堆Hadoop的jar包添加到工程中，包括以下文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">hadoop-XXX-ant.jar
hadoop-XXX-core.jar
hadoop-XXX-examples.jar
hadoop-XXX-test.jar
hadoop-XXX-tools.jar
</code></pre>
</section>
<p>其中，XXX是版本号。<br />
如果没有这些文件，你在编译jar包的时候会报错。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color: rgb(255, 0, 0); ">②</span></strong>跟我一起，在工程目录下的 src/com/coldelast/ 目录下创建Java源代码文件&nbsp;MyUDF.java，其内容如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">package</span>&nbsp;com.codelast;

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;java.io.IOException;

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.EvalFunc;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.data.Tuple;

<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">/**
&nbsp;*&nbsp;Author:&nbsp;Darran&nbsp;Zhang&nbsp;@&nbsp;codelast.com
&nbsp;*&nbsp;Date:&nbsp;2011-09-29
&nbsp;*/</span>

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-class" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">class</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">MyUDF</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">extends</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">EvalFunc</span>&lt;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">Double</span>&gt;&nbsp;</span>{

&nbsp;&nbsp;<span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">@Override</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Double&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">exec</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(Tuple&nbsp;input)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">if</span>&nbsp;(input&nbsp;==&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>&nbsp;||&nbsp;input.size()&nbsp;==&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>;
&nbsp;&nbsp;&nbsp;&nbsp;}

&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">try</span>&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Double&nbsp;val&nbsp;=&nbsp;(Double)&nbsp;input.get(<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;val&nbsp;=&nbsp;(val&nbsp;+&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">500</span>&nbsp;-&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">300</span>)&nbsp;/&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">2.6</span>;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;val;
&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">catch</span>&nbsp;(Exception&nbsp;e)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">throw</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">new</span>&nbsp;IOException(e.getMessage());
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;}
}
</code></pre>
</section>
<p>在上面的代码中，input.get(0)是获取UDF的第一个参数（可以向UDF传入多个参数）；同理，如果你的UDF接受两个参数（例如一个求和的UDF），那么input.get(1)可以取到第二个参数。<br />
然后编写build.xml（相当于C++里面的Makefile），用ant来编译、打包此工程&mdash;&mdash;这里就不把冗长的build.xml写上来了，而且这也不是关键，没有太多意义。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color: rgb(255, 0, 0); ">③</span></strong>假定编译、打包得到的jar包名为cl.jar，我们到这里几乎已经完成了大部分工作。下面就看看如何在pig中调用我们刚编写的自定义函数了。</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;REGISTER&nbsp;cl.jar;
grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>(col1:chararray,&nbsp;col2:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;&nbsp;&nbsp;&nbsp;
grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1,&nbsp;com.codelast.MyUDF(col2),&nbsp;col3;
grunt&gt;&nbsp;DUMP&nbsp;B;
(uidk,81.53846153846153,3)
(hfd,127.6923076923077,99)
(bbN,255.0,231)
(UFD,81.92307692307692,10)
</code></pre>
</section>
<p>注：第一句是注册你编写的UDF，使用前必须先注册。<br />
从结果可见，我们实现了预定的效果。<br />
UDF大有用途！<br />
注意：对如果你的UDF返回一个标量类型（类似于我上面的例子），那么pig就可以使用反射（reflection）来识别出返回类型。如果你的UDF返回的是一个包（bag）或一个元组（tuple），并且你希望pig能理解包（bag）或元组（tuple）的内容的话，那么你就要实现outputSchema方法，否则后果很不好。具体可看<a href="http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html#eval_func_schema" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 0, 0); ">这个链接</span></a>的说明。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;怎样自己写一个UDF中的加载函数(load function)<br />
<strong><span style="color: rgb(255, 0, 0); ">①</span></strong>加载函数(load function)是干什么的？<br />
先举一个很简单的例子，来说明load function的作用。<br />
假设有如下数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;a.txt</span>
1,2,3
a,b,c
9,5,7
</code></pre>
</section>
<p>我们知道，pig默认是以tab作为分隔符来加载数据的，所以，如果你没有指定分隔符的话，将使得每一行都被认为只有一个字段：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;$0;
grunt&gt;&nbsp;DUMP&nbsp;B;
(1,2,3)
(a,b,c)
(9,5,7)
</code></pre>
</section>
<p>而我们想要以逗号作为分隔符，则应该使用pig内置函数<span style="color: rgb(255, 0, 0); ">PigStorage</span>：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">using</span>&nbsp;PigStorage(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;,&#39;</span>);
</code></pre>
</section>
<p>这样的话，我们再用上面的方法DUMP B，得到的结果就是：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1)
(a)
(9)
</code></pre>
</section>
<p>这个例子实在太简单了，在这里，PigStorage这个函数就是一个加载函数(load function)。<br />
定义：</p>
<blockquote>
<div>
		Load/Store Functions</div>
<div>
		&nbsp;</div>
<div>
		These user-defined functions control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case.</div>
</blockquote>
<p>即：加载函数定义了数据如何流入和流出pig。一般来说，同一函数即处理输入数据，又处理输出数据，但并不是必须要这样。<br />
有了这个定义，就很好理解加载函数的作用了。再举个例子：你在磁盘上保存了只有你自己知道怎么读取其格式的数据（例如，数据是按一定规则加密过的，只有你知道如何解密成明文），那么，你想用pig来处理这些数据，把它们转换成一个个字段的明文时，你就必须要有这样一个加载函数(load function)，来进行LOAD数据时的转换工作。这就是加载函数(load function)的作用。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color: rgb(255, 0, 0); ">②</span></strong>知道了load function是干嘛的，现在怎么写一个load function？如果你看的是这个链接的UDF手册：<a href="http://wiki.apache.org/pig/UDFManual" rel="noopener noreferrer" target="_blank">Pig Wiki UDF Manual</a>中，会发现它是这样说的&mdash;&mdash;<br />
加载函数必须要实现&nbsp;<span style="color: rgb(0, 0, 255); ">LoadFunc</span>&nbsp;接口，这个接口类似于下面的样子：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-class" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">interface</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">LoadFunc</span>&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">void</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">bindTo</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(String&nbsp;fileName,&nbsp;BufferedPositionedInputStream&nbsp;is,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>&nbsp;offset,&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>&nbsp;end)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException</span>;
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Tuple&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">getNext</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">()</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException</span>;
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">//&nbsp;conversion&nbsp;functions</span>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Integer&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">bytesToInteger</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">byte</span>[]&nbsp;b)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException</span>;
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Long&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">bytesToLong</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">byte</span>[]&nbsp;b)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException</span>;
&nbsp;&nbsp;&nbsp;&nbsp;......
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">void</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">fieldsToRead</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(Schema&nbsp;schema)</span></span>;
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Schema&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">determineSchema</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(String&nbsp;fileName,&nbsp;ExecType&nbsp;execType,&nbsp;DataStorage&nbsp;storage)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException</span>;
}
</code></pre>
</section>
<p>其中：</p>
<ul>
<li>
		bindTo函数在pig任务开始处理数据之前被调用一次，它试图将函数与输入数据关联起来。</li>
<li>
		getNext函数读取输入的数据流并构造下一个元组（tuple）。当完成数据处理时该函数会返回null，当该函数无法处理输入的元组（tuple）时它会抛出一个IOException异常。</li>
<li>
		接下来就是一批转换函数，例如bytesToInteger，bytesToLong等。这些函数的作用是将数据从bytearray转换成要求的类型。</li>
<li>
		fieldsToRead函数被保留作未来使用，应被留空。</li>
<li>
		determineSchema函数对不同的loader应有不同的实现：对返回真实数据类型（而不是返回bytearray字段）的loader，必须要实现该函数；其他类型的loader只要将determineSchema函数返回null就可以了。</li>
</ul>
<p>但是，如果你在IDE中import了pig 0.8.1的jar包&ldquo;pig-0.8.1-core.jar&rdquo;，会发现 LoadFunc 根本不是一个接口（interface），而是一个抽象类（abstract class），并且要实现的函数也与该文档中所说的不一致。因此，只能说是文档过时了。<br />
所以，要看文档的话，还是要看这个<a href="http://pig.apache.org/docs/r0.8.1/udf.html" rel="noopener noreferrer" target="_blank">Pig UDF Manual</a>，这里面的内容才是对的。<br />
同时，我也推荐另外一个关于Load/Store Function的链接：<a href="http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html" rel="noopener noreferrer" target="_blank">《Programming Pig》Chapter 11. Writing Load and Store Functions</a>。这本书很好很强大。</p>
<p><strong><span style="color: rgb(255, 0, 0); ">③</span></strong>开始写一个loader。我们现在写一个①中所描述的、可以按逗号分隔符加载数据文件的loader&mdash;&mdash;PigStorage已经有这个功能了，不过为了演示loader是怎么写出来的，这里还是用这个功能来说明。<br />
代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">package</span>&nbsp;com.codelast.udf.pig;

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.hadoop.io.Text;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.hadoop.mapreduce.*;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.*;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.backend.executionengine.ExecException;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.*;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;org.apache.pig.data.*;

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;java.io.IOException;
<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">import</span>&nbsp;java.util.*;

<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">/**
&nbsp;*&nbsp;A&nbsp;loader&nbsp;class&nbsp;of&nbsp;pig.
&nbsp;*
&nbsp;*&nbsp;<span class="hljs-doctag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">@author</span>&nbsp;Darran&nbsp;Zhang&nbsp;(codelast.com)
&nbsp;*&nbsp;<span class="hljs-doctag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">@version</span>&nbsp;11-10-11
&nbsp;*&nbsp;<span class="hljs-doctag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">@declaration</span>&nbsp;These&nbsp;codes&nbsp;are&nbsp;only&nbsp;for&nbsp;non-commercial&nbsp;use,&nbsp;and&nbsp;are&nbsp;distributed&nbsp;on&nbsp;an&nbsp;&quot;AS&nbsp;IS&quot;&nbsp;basis,&nbsp;WITHOUT&nbsp;WARRANTY&nbsp;OF&nbsp;ANY&nbsp;KIND,&nbsp;either&nbsp;express&nbsp;or&nbsp;implied.
&nbsp;*&nbsp;You&nbsp;must&nbsp;not&nbsp;remove&nbsp;this&nbsp;declaration&nbsp;at&nbsp;any&nbsp;time.
&nbsp;*/</span>

<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-class" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">class</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">MyLoader</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">extends</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">LoadFunc</span>&nbsp;</span>{
&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">protected</span>&nbsp;RecordReader&nbsp;recordReader&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>;

&nbsp;&nbsp;<span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">@Override</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">void</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">setLocation</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(String&nbsp;s,&nbsp;Job&nbsp;job)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;FileInputFormat.setInputPaths(job,&nbsp;s);
&nbsp;&nbsp;}

&nbsp;&nbsp;<span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">@Override</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;InputFormat&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">getInputFormat</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">()</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">new</span>&nbsp;PigTextInputFormat();
&nbsp;&nbsp;}

&nbsp;&nbsp;<span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">@Override</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">void</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">prepareToRead</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(RecordReader&nbsp;recordReader,&nbsp;PigSplit&nbsp;pigSplit)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">this</span>.recordReader&nbsp;=&nbsp;recordReader;
&nbsp;&nbsp;}

&nbsp;&nbsp;<span class="hljs-meta" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(91, 218, 237); word-wrap: inherit !important; word-break: inherit !important;">@Override</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;Tuple&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">getNext</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">()</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">try</span>&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">boolean</span>&nbsp;flag&nbsp;=&nbsp;recordReader.nextKeyValue();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">if</span>&nbsp;(!flag)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">null</span>;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Text&nbsp;value&nbsp;=&nbsp;(Text)&nbsp;recordReader.getCurrentValue();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String[]&nbsp;strArray&nbsp;=&nbsp;value.toString().split(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;,&quot;</span>);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;List&nbsp;lst&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">new</span>&nbsp;ArrayList&lt;String&gt;();
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>&nbsp;i&nbsp;=&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">0</span>;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">for</span>&nbsp;(String&nbsp;singleItem&nbsp;:&nbsp;strArray)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;lst.add(i++,&nbsp;singleItem);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;TupleFactory.getInstance().newTuple(lst);
&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">catch</span>&nbsp;(InterruptedException&nbsp;e)&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">throw</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">new</span>&nbsp;ExecException(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;Read&nbsp;data&nbsp;error&quot;</span>,&nbsp;PigException.REMOTE_ENVIRONMENT,&nbsp;e);
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;}
}
</code></pre>
</section>
<p>如上，你的loader类要继承自LoadFunc虚类，并且需要重写它的4个方法。其中，getNext方法是读取数据的方法，它做了读取出一行数据、按逗号分割字符串、构造一个元组（tuple）并返回的事情。这样我们就实现了按逗号分隔符加载数据的loader。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color: rgb(255, 0, 0); ">④</span></strong>关于load function不得不说的一些话题<br />
如果你要加载一个数据文件，例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;myfile&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
</code></pre>
</section>
<p>假设此文件的结构不复杂，你可以手工写 AS 语句，但如果此文件结构特别复杂，你总不可能每次都手工写上几十个甚至上百个字段名及类型定义吧？<br />
这个时候，如果我们可以让pig从哪里读出来要加载的数据的schema（模式），就显得特别重要了。<br />
在实现load function的时候，我们是通过实现&nbsp;<span style="color: rgb(0, 0, 255); ">LoadMetadata</span>&nbsp;这个接口中的&nbsp;<span style="color: rgb(255, 0, 0); ">getSchema</span>&nbsp;方法来做到这一点的。例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-class" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">class</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">MyLoadFunc</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">extends</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">LoadFunc</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">implements</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">LoadMetadata</span>&nbsp;</span>{

&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;ResourceSchema&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">getSchema</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(String&nbsp;filename,&nbsp;Job&nbsp;job)</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">throws</span>&nbsp;IOException&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">//<span class="hljs-doctag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">TODO:</span></span>
&nbsp;&nbsp;}
}
</code></pre>
</section>
<p>实现了 getSchema 方法之后，在pig脚本中加载数据的时候，就可以无需编写 AS 语句，就可以使用你在&nbsp;getSchema 方法中指定的模式了。例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">REGISTER&nbsp;&#39;myUDF.jar&#39;;
A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;myfile&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;com.codelast.MyLoadFunc();
B&nbsp;=&nbsp;foreach&nbsp;A&nbsp;generate&nbsp;col1;
SOTRE&nbsp;B&nbsp;INTO&nbsp;&#39;output&#39;;
</code></pre>
</section>
<p>看清楚了，在 LOAD 的时候，我们并没有写 AS 语句来指定字段名，但是在后面的 FOREACH 中，我们却可以使用 col1 这样的字段名，这正是因为&nbsp;<span style="color: rgb(255, 0, 0); ">getSchema</span>&nbsp;方法的实现为我们做到了这一点。在数据文件的结构特别复杂的时候，这个功能几乎是不可或缺的，否则难以想像会给分析数据的人带来多大的不便。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;重载(overloading)一个UDF<br />
类似于C++的函数重载，pig中也可以重载UDF，例如一个函数ADD可以对两个int进行操作，也可以对两个double进行操作，那么我们可以为该函数实现&nbsp;<span style="color: rgb(0, 0, 255); ">getArgToFuncMapping</span>&nbsp;方法，该函数返回一个&nbsp;<span style="color: rgb(0, 0, 255); ">List&lt;FuncSpec&gt;</span>&nbsp;对象，这个对象中包含了参数的类型信息。具体怎么实现，可以看<a href="http://ofps.oreilly.com/titles/9781449302641/writing_udfs.html" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 0, 0); ">这个链接</span></a>（搜索&ldquo;<span style="color: rgb(128, 0, 128); ">Overloading UDFs</span>&rdquo;定位到所需章节）。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;pig运行不起来，提示&ldquo;<span style="color: rgb(178, 34, 34); ">org.apache.hadoop.ipc.Client - Retrying connect to server: localhost/127.0.0.1:9000. Already tried 1 time(s)</span>&rdquo;错误的解决办法<br />
发生这个错误时，请先检查Hadoop的各个进程是否都运行起来了，例如，在我的一次操作中，遇到这个错误时，我发现Hadoop namenode进程没有启动起来：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="bash language-bash hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">ps&nbsp;-ef&nbsp;|&nbsp;grep&nbsp;java&nbsp;|&nbsp;grep&nbsp;NameNode
</code></pre>
</section>
<p>应该有两个进程启动起来了：</p>
<blockquote>
<p>
		org.apache.hadoop.hdfs.server.namenode.NameNode<br />
		org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode</p>
</blockquote>
<p>如果没有，那么你要到Hadoop安装目录下的&ldquo;logs&rdquo;目录下，查看NameNode的日志记录文件（视用户不同，日志文件的名字也会有不同），例如，我的NameNone日志文件 hadoop--namenode-root-XXX.log&nbsp;的末尾，显示出有如下错误：</p>
<blockquote>
<p>
		ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.</p>
</blockquote>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
我到它提示的地方一看，果然不存在最后一级目录（我是伪分布式运行的Hadoop，不要觉得奇怪），于是手工创建了这个目录，然后停掉Hadoop：</p>
<blockquote>
<p>
		stop-all.sh</p>
</blockquote>
<p>稍候一会儿再重新启动Hadoop：</p>
<blockquote>
<p>
		start-all.sh</p>
</blockquote>
<p>然后再去看一下NameNode的日志，又发现了新的错误信息：</p>
<blockquote>
<p>
		ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted.</p>
</blockquote>
<p>这表明NameNode没有被格式化。于是将其格式化：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;bin]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;hadoop&nbsp;namenode&nbsp;-format</span></code></pre>
</section>
<p>命令行问你是否要格式化的时候，选择YES即可。格式化完后会提示：</p>
<blockquote>
<p>
		common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.</p>
</blockquote>
<p>说明成功了。这个时候，再像前面一样重启Hadoop进程，再去看NameNode的日志文件的最后若干行，应该会发现前面的那些错误提示没了。这个时候，再检查Hadoop各进程是否都成功地启动了，如果是的话，那么这个时候你就可以在Hadoop的伪分布式模式下启动pig：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;home]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;pig</span></code></pre>
</section>
<p>而不用以本地模式来运行pig了（pig -x local）。<br />
总之，配置一个伪分布式的Hadoop来调试pig在某些情况下是很有用的，但是比较麻烦，因为还牵涉到Hadoop的正确配置，但是最好搞定它，以后大有用处啊。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;用Pig加载HBase数据时遇到的错误&ldquo;<span style="color: rgb(0, 0, 255); ">ERROR 2999: Unexpected internal error. could not instantiate &#39;com.twitter.elephantbird.pig.load.HBaseLoader&#39; with arguments XXX</span>&rdquo;的原因之一<br />
你也许早就知道了：Pig可以加载HBase数据，从而更方便地进行数据处理。但是在使用HBase的loader的时候，可能会遇到这样那样的问题，我这里就遇到了一例，给大家分析一下原因。<br />
使用&nbsp;<span style="color: rgb(178, 34, 34); ">org.apache.pig.backend.hadoop.hbase.HBaseStorage()</span>&nbsp;可以加载HBase数据，例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;hbase://table_name&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;org.apache.pig.backend.hadoop.hbase.HBaseStorage(<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;column_family_name:qualifier_name&#39;</span>,&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;-loadKey&nbsp;true&nbsp;-limit&nbsp;100&#39;</span>)&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:chararray);
</code></pre>
</section>
<p>其中，<span style="color: rgb(0, 0, 255); ">table_name</span>&nbsp;是你要加载数据的HBase表名，<span style="color: rgb(0, 0, 255); ">column_family_name:qualifier_name</span>&nbsp;是表的column family:qualifier（当然，可以有多个column family:qualifier，以空格隔开即可），<span style="color: rgb(0, 0, 255); ">-loadKey true -limit 100</span>&nbsp;是加载数据时指定的参数，支持的参数如下：</p>
<blockquote>
<div>
		-loadKey=(true|false) &nbsp;Load the row key as the first column</div>
<div>
		-gt=minKeyVal</div>
<div>
		-lt=maxKeyVal&nbsp;</div>
<div>
		-gte=minKeyVal</div>
<div>
		-lte=maxKeyVal</div>
<div>
		-limit=numRowsPerRegion max number of rows to retrieve per region</div>
<div>
		-delim=char delimiter to use when parsing column names (default is space or comma)</div>
<div>
		-ignoreWhitespace=(true|false) ignore spaces when parsing column names (default true)</div>
<div>
		-caching=numRows &nbsp;number of rows to cache (faster scans, more memory).</div>
<div>
		-noWAL=(true|false) Sets the write ahead to false for faster loading.</div>
<div>
		&nbsp; &nbsp; To be used with extreme caution, since this could result in data loss</div>
<div>
		&nbsp; &nbsp; (see http://hbase.apache.org/book.html#perf.hbase.client.putwal).</div>
</blockquote>
<div>
	由这些参数的解释，可知我上面的&nbsp;<span style="color: rgb(0, 0, 255); ">-loadKey true</span>&nbsp;使得加载出来的数据的第一列是HBase表的row key；<span style="color: rgb(0, 0, 255); ">-limit 100</span>&nbsp;使得从每一个region加载的最大数据行数为100（当你有N个region时，总共加载的数据是不是&nbsp;<span style="color: rgb(128, 0, 128); ">N*region总数</span>&nbsp;条，我没有试验）。<br />
	org.apache.pig.backend.hadoop.hbase.HBaseStorage() 包含在Pig的jar包中，所以你不需要REGISTER额外的jar包。<br />
	我遇到的问题是：在按上面的代码加载HBase数据之后，在grunt中一回车，马上报错：</div>
<blockquote>
<div>
		ERROR 2999: Unexpected internal error. could not instantiate &#39;com.twitter.elephantbird.pig.load.HBaseLoader&#39; with arguments XXX<br />
		Details at logfile: XXX</div>
</blockquote>
<div>
	这个时候，你当然应该去查看logfile，以确定具体问题是什么。<br />
	logfile内容较多，在其尾部，有下面的内容：</p>
<blockquote>
<div>
			Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HBaseConfiguration.create()Lorg/apache/hadoop/conf/Configuration;</div>
<div>
			&nbsp; &nbsp; &nbsp; &nbsp; at org.apache.pig.backend.hadoop.hbase.HBaseStorage.&lt;init&gt;(HBaseStorage.java:185)</div>
</blockquote>
<div>
		<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
		好吧，到了这里，只能去看看pig的源码了。打开&nbsp;HBaseStorage.java 文件，找到提示的185行，看到如下代码：</div>
</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">m_conf&nbsp;=&nbsp;HBaseConfiguration.create();
</code></pre>
</section>
<p>可见它调用了HBase代码中的一个类HBaseConfiguration的create方法。按上面的提示，它是找不到这个方法，于是我们再看看使用的HBase的&nbsp;HBaseConfiguration.java 里的代码，找遍全文，都找不到create方法！那么，我们再看看更新一点的版本的HBase的相同文件中是否有这个方法呢？下载0.90.4版本的HBase，果然在 HBaseConfiguration.java 中找到了create方法：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">&nbsp;&nbsp;<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">/**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;*&nbsp;Creates&nbsp;a&nbsp;Configuration&nbsp;with&nbsp;HBase&nbsp;resources&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;*&nbsp;<span class="hljs-doctag" style="font-size: inherit; color: inherit; line-height: inherit; margin: 0px; padding: 0px; word-wrap: inherit !important; word-break: inherit !important;">@return</span>&nbsp;a&nbsp;Configuration&nbsp;with&nbsp;HBase&nbsp;resources&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;*/</span>
&nbsp;&nbsp;<span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">public</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">static</span>&nbsp;Configuration&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">create</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">()</span>&nbsp;</span>{
&nbsp;&nbsp;&nbsp;&nbsp;Configuration&nbsp;conf&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">new</span>&nbsp;Configuration();
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">return</span>&nbsp;addHbaseResources(conf);
&nbsp;&nbsp;}
</code></pre>
</section>
<p>所以，问题就在这里了：Pig的HBase loader不能使用某些版本的HBase，升级HBase吧！<br />
另外，就算HBase版本适用了，你也得让Pig知道HBase的参数配置（要不然怎么可以指定一个HBase表名就可以加载其数据了呢），具体你可以看<a href="http://blog.whitepages.com/2011/10/27/hbase-storage-and-pig/" rel="noopener noreferrer" target="_blank">这个</a>链接的说明。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;JOIN的优化<br />
如果你对N个关系(relation)的某些字段进行JOIN，也就是所谓的&ldquo;多路的&rdquo;(multi-way)JOIN&mdash;&mdash;我不知道用中文这样描述是否正确&mdash;&mdash;在这种情况下，请遵循这样的原则来写JOIN语句：<br />
JOIN用到的key所对应的记录最多的那个关系(relation)应该被排在最后。例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">D&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1,&nbsp;B&nbsp;BY&nbsp;col1,&nbsp;C&nbsp;BY&nbsp;col1;
</code></pre>
</section>
<p>在这里，假设C这个relation就是上面所说的那种情况，所以把它排在最后。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
为什么要遵循这样的原则？这是由Pig处理JOIN的方式来决定的。在JOIN的n个关系中，前面的n-1个关系的处理结果会被cache在内存中，然后才会轮到第n个关系，因此，把最占内存的放在最后，有时候是能起到优化作用的。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;错误&ldquo;<span style="color:#0000ff;">Backend error : org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.pig.data.DataBag</span>&rdquo;的原因<br />
如果你正在使用Pig 0.8，那么要注意了：出现这个错误，可能是Pig的bug导致的，详见<a href="https://issues.apache.org/jira/browse/PIG-1866" rel="noopener noreferrer" target="_blank">这个</a>链接。<br />
说得简单点就是：此bug会导致无法解引用一个tuple中的bag。通常我们取一个tuple中的bag，是为了FLATTEN它，将记录展开，但是此bug使得你根本连tuple中的bag都输出不了。<br />
此bug并不会影响你的Pig脚本语法解析，也就是说，你的Pig脚本只要写对了，就能运行起来，但是它执行到后面会报错。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;如何加载LZO压缩的纯文本数据<br />
如果你的数据是纯文本经由LZO压缩而成，那么你可以用elephant-bird的&nbsp;com.twitter.elephantbird.pig.store.LzoPigStorage 来加载它：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;/user/codelast/text-lzo-file&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">USING</span>&nbsp;com.twitter.elephantbird.pig.store.LzoPigStorage()&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
</code></pre>
</section>
<p>注意，这里没有写REGISTER jar包的命令，你需要自己补上。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;如何用Pig设置map端的并行度(map数)<br />
从<a href="http://www.codelast.com/?p=4577" rel="noopener noreferrer" target="_blank"><span style="background-color:#faebd7;">这个</span></a>链接中的第(9)条，我们知道，无法通过PARALLEL来设置Pig job map端的并行度，但是，有没有什么办法可以间接实现这一点呢？<br />
在Java编写的MapReduce程序中，你可以像<a href="http://www.codelast.com/?p=3182" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(175, 238, 238); ">这个</span></a>链接中的第(25)点所说的一样，通过FileInputFormat.setMinInputSplitSize()来间接更改map的数量，其实它就与设置 mapred.min.split.size 参数值的效果是一样的。<br />
在Pig中，我们是可以通过set命令来设置job参数的，所以，我们如果在Pig脚本的开头写上：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;mapred.min.split.size&nbsp;<span class="hljs-number" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(174, 135, 250); word-wrap: inherit !important; word-break: inherit !important;">2147483648</span>;
</code></pre>
</section>
<p>将使得对map来说，小于2G的文件将被作为一个split输入，从而一个小于2G的文件将只有一个map。假设我们的Pig job是一个纯map的job，那么，map数的减少将使得输出文件的数量减少，在某些情况下，这种功能还是很有用的。<br />
注意：上面的命令中，set的参数的单位是<span style="color:#ff0000;">字节</span>，所以2G=2*1024*1024*1024=2147483648。</p>
<p><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;Pig调用现存的静态Java方法<br />
不是每个人都会开发UDF，或者每个人都愿意去写一个UDF来完成一件极其简单的操作的，例如，把一个编码过的URL解码，如果我只想临时用一下这个功能，那么我还要去写一个UDF，累不累啊？<br />
我们知道，java.net.URLDecoder.decode 这个静态方法已经实现了URL解码功能：</p>
<blockquote>
<div>
		static String decode(String s, String enc)&nbsp;</div>
<div>
		使用指定的编码机制对 application/x-www-form-urlencoded 字符串解码</div>
</blockquote>
<p>那么，如何在Pig中使用这个现成的静态方法呢？为了展示这个使用过程，我造了一个数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;a.txt&nbsp;
http://zh.wikipedia.org/zh/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E
</code></pre>
</section>
<p>就一行，这个URL解码之后应该是：<br />
<span style="color:#0000ff;">http://zh.wikipedia.org/wiki/搜索引擎</span><br />
因为里面含中文，所以被编码了。<br />
处理此文件的Pig脚本url.pig如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">DEFINE&nbsp;DecodeURL&nbsp;InvokeForString(&#39;java.net.URLDecoder.decode&#39;,&nbsp;&#39;String&nbsp;String&#39;);
A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">url</span>:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;DecodeURL(url,&nbsp;&#39;UTF-8&#39;);
STORE&nbsp;B&nbsp;INTO&nbsp;&#39;url&#39;;
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
用 pig -x local url.pig 执行这个脚本，完成后我们查看输出目录下的&nbsp;part-m-00000 文件内容，可见它确实被解码成了正确的字符串。<br />
这样，我们就利用了现存的静态Java方法来偷了个懒，很方便。<br />
需要注意的是：只能调用静态方法，并且此调用比同样的UDF实现速度要慢，这是因为调用器没有使用Accumulator或Algebraic接口。根据<a href="http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/" rel="noopener noreferrer" target="_blank">这位</a>兄台的测试，百万条的记录规模下，调用Java静态方法比UDF大约要慢一倍。至于这样的代价能不能接受，就看你自己的判断了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;REGISTER UDF的前后覆盖关系<br />
在Pig脚本中使用多个UDF时，我们通常会使用通配符的方式来引入：</p>
<blockquote>
<p>
		REGISTER /user/hadoop/*.jar</p>
</blockquote>
<p>假设你有另一个包含UDF的jar：/user/codelast/my-pig-lib.jar<br />
并且这个jar的旧版本已经包含在 /user/hadoop/*.jar&nbsp;中，那么，当你使用下面的方式来引入jar的时候：</p>
<blockquote>
<p>
		REGISTER /user/hadoop/*.jar<br />
		REGISTER /user/codelast/my-pig-lib.jar</p>
</blockquote>
<p>my-pig-lib.jar这个jar，使用的是新版还是旧版呢？这是不一定的，并不是说以最后REGISTER的为准，而是取决于CLASSPATH中jar的顺序。所以，如果你想保证能用到新版的UDF，为了保险你应该不要把旧版的UDF用 /user/hadoop/*.jar 的方式引入，而应该把 my-pig-lib.jar&nbsp;从 /user/hadoop/*.jar&nbsp;中剔除出去再REGISTER。</p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /><br />
	以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9bapache-pig%e4%b8%ad%e6%96%87%e6%95%99%e7%a8%8b%ef%bc%88%e8%bf%9b%e9%98%b6%ef%bc%89/feed/</wfw:commentRss>
			<slash:comments>4</slash:comments>
		
		
			</item>
		<item>
		<title>[原创]Apache Pig的一些基础概念及用法总结（1）</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9bpig%e4%b8%ad%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e6%80%bb%e7%bb%93/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9bpig%e4%b8%ad%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e6%80%bb%e7%bb%93/#comments</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Fri, 23 Sep 2011 09:42:47 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[apache pig]]></category>
		<category><![CDATA[pig中文教程]]></category>
		<category><![CDATA[UDF]]></category>
		<category><![CDATA[中文教程]]></category>
		<category><![CDATA[基础]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=3621</guid>

					<description><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>本文可以让刚接触pig的人对一些基础概念有个初步的了解。<br />
很久很久以前，本文大概是互联网上第一篇公开发表的且涵盖大量实际例子的Apache Pig中文教程（由Google搜索可知），文中的大量实例都是作者Darran Zhang（website: codelast.com）在工作、学习中总结的经验或解决的问题，并且添加了较为详尽的说明及注解，希望能帮助一部分人。</p>
<p><a href="http://pig.apache.org/" rel="noopener noreferrer" target="_blank">Apache pig</a>是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。<br />
但是刚接触pig时，可能会觉得里面的某些概念以及程序实现方法与想像中的很不一样，所以，你需要仔细地研究一下基础概念，这样在写pig程序的时候，才不会觉得非常别扭。<br />
<span id="more-3621"></span><br />
本文大部分内容基于 Pig 0.8.1 写作而成。众所周知，这个版本的Pig现在已经非常过时，但是，无论新版旧版，Pig的基础用法在很多情况下都是一致的，所以，这并不影响你学习。<br />
本文的部分内容来自Pig官方文档，但涉及到翻译的部分，也是我自己翻译的，因此可能理解与英文有偏差，如果你觉得有疑义，可参考英文内容。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
下面开始学习Pig。</p>
<p><span style="background-color:#00ff00;">➤</span> 关系（relation）、包（bag）、元组（tuple）、字段（field）、数据（data）的关系</p>
<ul>
<li>
		一个关系（relation）是一个包（bag），更具体地说，是一个外部的包（outer bag）。</li>
<li>
		一个包（bag）是一个元组（tuple）的集合。<span style="color:#0000cd;">在pig中表示数据时，用大括号{}括起来的东西表示一个包&#8212;&#8212;无论是在教程中的实例演示，还是在pig交互模式下的输出，都遵循这样的约定，请牢记这一点，因为不理解的话就会对数据结构的掌握产生偏差</span>。</li>
<li>
		一个元组（tuple）是若干字段（field）的一个有序集（ordered set）。<span style="color:#0000cd;">在pig中表示数据时，用小括号()括起来的东西表示一个元组</span>。</li>
<li>
		一个字段是一块数据（data）。</li>
</ul>
<p>&#8220;元组&#8221;这个词很抽象，你可以把它想像成关系型数据库表中的一行，它含有一个或多个字段，其中，每一个字段可以是任何数据类型，并且可以有或者没有数据。<br />
&#8220;关系&#8221;可以比喻成关系型数据库的一张表，而上面说了，&#8220;元组&#8221;可以比喻成数据表中的一行，那么这里有人要问了，在关系型数据库中，同一张表中的每一行都有固定的字段数，pig中的&#8220;关系&#8221;与&#8220;元组&#8221;之间，是否也是这样的情况呢？不是的。&#8220;关系&#8221;并不要求每一个&#8220;元组&#8221;都含有相同数量的字段，并且也不会要求各&#8220;元组&#8221;中在相同位置处的字段具有相同的数据类型（太随意了，是吧？）<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&#160;一个 计算多维度组合下的平均值 的实际例子<br />
为了帮助大家理解pig的一个基本的数据处理流程，我造了一些简单的数据来举个例子&#8212;&#8212;<br />
假设有数据文件：a.txt（各数值之间是以tab分隔的）：&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9bpig%e4%b8%ad%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e6%80%bb%e7%bb%93/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>查看更多Apache Pig的教程请点击<a href="https://www.codelast.com/?p=4550" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>本文可以让刚接触pig的人对一些基础概念有个初步的了解。<br />
很久很久以前，本文大概是互联网上第一篇公开发表的且涵盖大量实际例子的Apache Pig中文教程（由Google搜索可知），文中的大量实例都是作者Darran Zhang（website: codelast.com）在工作、学习中总结的经验或解决的问题，并且添加了较为详尽的说明及注解，希望能帮助一部分人。</p>
<p><a href="http://pig.apache.org/" rel="noopener noreferrer" target="_blank">Apache pig</a>是用来处理大规模数据的高级查询语言，配合Hadoop使用，可以在处理海量数据时达到事半功倍的效果，比使用Java，C++等语言编写大规模数据处理程序的难度要小N倍，实现同样的效果的代码量也小N倍。<br />
但是刚接触pig时，可能会觉得里面的某些概念以及程序实现方法与想像中的很不一样，所以，你需要仔细地研究一下基础概念，这样在写pig程序的时候，才不会觉得非常别扭。<br />
<span id="more-3621"></span><br />
本文大部分内容基于 Pig 0.8.1 写作而成。众所周知，这个版本的Pig现在已经非常过时，但是，无论新版旧版，Pig的基础用法在很多情况下都是一致的，所以，这并不影响你学习。<br />
本文的部分内容来自Pig官方文档，但涉及到翻译的部分，也是我自己翻译的，因此可能理解与英文有偏差，如果你觉得有疑义，可参考英文内容。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
下面开始学习Pig。</p>
<p><span style="background-color:#00ff00;">➤</span> 关系（relation）、包（bag）、元组（tuple）、字段（field）、数据（data）的关系</p>
<ul>
<li>
		一个关系（relation）是一个包（bag），更具体地说，是一个外部的包（outer bag）。</li>
<li>
		一个包（bag）是一个元组（tuple）的集合。<span style="color:#0000cd;">在pig中表示数据时，用大括号{}括起来的东西表示一个包&mdash;&mdash;无论是在教程中的实例演示，还是在pig交互模式下的输出，都遵循这样的约定，请牢记这一点，因为不理解的话就会对数据结构的掌握产生偏差</span>。</li>
<li>
		一个元组（tuple）是若干字段（field）的一个有序集（ordered set）。<span style="color:#0000cd;">在pig中表示数据时，用小括号()括起来的东西表示一个元组</span>。</li>
<li>
		一个字段是一块数据（data）。</li>
</ul>
<p>&ldquo;元组&rdquo;这个词很抽象，你可以把它想像成关系型数据库表中的一行，它含有一个或多个字段，其中，每一个字段可以是任何数据类型，并且可以有或者没有数据。<br />
&ldquo;关系&rdquo;可以比喻成关系型数据库的一张表，而上面说了，&ldquo;元组&rdquo;可以比喻成数据表中的一行，那么这里有人要问了，在关系型数据库中，同一张表中的每一行都有固定的字段数，pig中的&ldquo;关系&rdquo;与&ldquo;元组&rdquo;之间，是否也是这样的情况呢？不是的。&ldquo;关系&rdquo;并不要求每一个&ldquo;元组&rdquo;都含有相同数量的字段，并且也不会要求各&ldquo;元组&rdquo;中在相同位置处的字段具有相同的数据类型（太随意了，是吧？）<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;一个 计算多维度组合下的平均值 的实际例子<br />
为了帮助大家理解pig的一个基本的数据处理流程，我造了一些简单的数据来举个例子&mdash;&mdash;<br />
假设有数据文件：a.txt（各数值之间是以tab分隔的）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
a&nbsp;1&nbsp;2&nbsp;3&nbsp;4.2&nbsp;9.8
a&nbsp;3&nbsp;0&nbsp;5&nbsp;3.5&nbsp;2.1
b&nbsp;7&nbsp;9&nbsp;9&nbsp;-&nbsp;-
a&nbsp;7&nbsp;9&nbsp;9&nbsp;2.6&nbsp;6.2
a&nbsp;1&nbsp;2&nbsp;5&nbsp;7.7&nbsp;5.9
a&nbsp;1&nbsp;2&nbsp;3&nbsp;1.4&nbsp;0.2
</code></pre>
</section>
<p>问题如下：怎样求出在第2、3、4列的所有组合的情况下，最后两列的平均值分别是多少？<br />
例如，第2、3、4列有一个组合为（1，2，3），即第一行和最后一行数据。对这个维度组合来说，最后两列的平均值分别为：<br />
（4.2+1.4）/2＝2.8<br />
（9.8+0.2）/2＝5.0<br />
而对于第2、3、4列的其他所有维度组合，都分别只有一行数据，因此最后两列的平均值其实就是它们自身。<br />
特别地，组合（7，9，9）有两行记录：第三、四行，但是第三行数据的最后两列没有值，因此它不应该被用于平均值的计算，也就是说，在计算平均值时，第三行是无效数据。所以（7，9，9）组合的最后两列的平均值为 2.6 和 6.2。<br />
我们现在用pig来算一下，并且输出最终的结果。<br />
先进入本地调试模式（pig -x local），再依次输入如下pig代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col2,&nbsp;col3,&nbsp;col4);
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;group,&nbsp;AVG(A.col5),&nbsp;AVG(A.col6);
DUMP&nbsp;C;
</code></pre>
</section>
<p>pig输出结果如下：</p>
<blockquote>
<div>
		((1,2,3),2.8,5.0)</div>
<div>
		((1,2,5),7.7,5.9)</div>
<div>
		((3,0,5),3.5,2.1)</div>
<div>
		((7,9,9),2.6,6.2)</div>
</blockquote>
<p>这个结果对吗？手工算一下就知道是对的。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
下面，我们依次来看看每一句pig代码分别得到了什么样的数据。<br />
<strong><span style="color:#ff0000;">①</span></strong>加载 a.txt 文件，并指定每一列的数据类型分别为 chararray（字符串），int，int，int，double，double。同时，我们还给予了每一列别名，分别为 col1，col2，&hellip;&hellip;，col6。这个别名在后面的数据处理中会用到&mdash;&mdash;如果你不指定别名，那么在后面的处理中，就只能使用索引（$0，$1，&hellip;&hellip;）来标识相应的列了，这样可读性会变差，因此，在列固定的情况下，还是指定别名的好。<br />
将数据加载之后，保存到变量A中，A的数据结构如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A:&nbsp;{col1:&nbsp;chararray,col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int,col5:&nbsp;double,col6:&nbsp;double}
</code></pre>
</section>
<p>可见，A是用大括号括起来的东西。根据本文前面的说法，A是一个包（bag）。<br />
这个时候，A与你想像中的样子应该是一致的，也就是与前面打印出来的 a.txt 文件的内容是一样的，还是一行一行的类似于&ldquo;二维表&rdquo;的数据。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color:#ff0000;">②</span></strong>按照A的第2、3、4列，对A进行分组。pig会找出所有第2、3、4列的组合，并按照升序进行排列，然后将它们与对应的包A整合起来，得到如下的数据结构：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">B:&nbsp;{group:&nbsp;(col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int),A:&nbsp;{col1:&nbsp;chararray,col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int,col5:&nbsp;double,col6:&nbsp;double}}
</code></pre>
</section>
<p>可见，A的第2、3、4列的组合被pig赋予了一个别名：<span style="color:#0000ff;">group</span>，这很形象。同时我们也观察到，B的每一行其实就是由一个group和若干个A组成的&mdash;&mdash;注意，是若干个A。这里之所以只显示了一个A，是因为这里表示的是数据结构，而不表示具体数据有多少组。<br />
实际的数据为：</p>
<blockquote>
<div>
		((1,2,3),{(a,1,2,3,4.2,9.8),(a,1,2,3,1.4,0.2)})</div>
<div>
		((1,2,5),{(a,1,2,5,7.7,5.9)})</div>
<div>
		((3,0,5),{(a,3,0,5,3.5,2.1)})</div>
<div>
		((7,9,9),{(b,7,9,9,,),(a,7,9,9,2.6,6.2)})</div>
</blockquote>
<p>可见，与前面所说的一样，组合（1，2，3）对应了两行数据，组合（7，9，9）也对应了两行数据。<br />
这个时候，B的结构就不那么明朗了，可能与你想像中有一点不一样了。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color:#ff0000;">③</span></strong>计算每一种组合下的最后两列的平均值。<br />
根据上面得到的B的数据，你可以把B想像成一行一行的数据（只不过这些行不是对称的），FOREACH 的作用是对 B 的每一行数据进行遍历，然后进行计算。<br />
GENERATE 可以理解为要生成什么样的数据，这里的 group 就是上一步操作中B的第一项数据（即pig为A的第2、3、4列的组合赋予的别名），所以它告诉了我们：在数据集 C 的每一行里，第一项就是B中的group&mdash;&mdash;类似于（1，2，5）这样的东西）。<br />
而 AVG(A.col5) 这样的计算，则是调用了pig的一个求平均值的函数 AVG，用于对 A 的名为 col5 的列求平均值。前文说了，在加载数据到A的时候，我们已经给每一列起了个别名，col5就是倒数第二列。<br />
到这里，可能有人要迷糊了：难道 AVG(A.col5) 不是表示对 A 的col5这一列求平均值吗？也就是说，在遍历B（FOREACH B）的每一行时候，计算结果都是相同的啊！<br />
事实上并不是这样。我们遍历的是B，我们需要注意到，B的数据结构中，每一行数据里，一个group对应的是若干个A，因此，这里的 A.col5，指的是B的每一行中的A，而不是包含全部数据的那个A。拿B的第一行来举例：<br />
((1,2,3),{(a,1,2,3,4.2,9.8),(a,1,2,3,1.4,0.2)})<br />
遍历到B的这一行时，要计算AVG(A.col5)，pig会找到&nbsp;(a,1,2,3,4.2,9.8) 中的4.2，以及(a,1,2,3,1.4,0.2)中的1.4，加起来除以2，就得到了平均值。<br />
同理，我们也知道了AVG(A.col6)是怎么算出来的。但还有一点要注意的：对(7,9,9)这个组，它对应的数据(b,7,9,9,,)里最后两列是无值的，这是因为我们的数据文件对应位置上不是有效数字，而是两个&ldquo;-&rdquo;，pig在加载数据的时候自动将它置为空了，并且计算平均值的时候，也不会把这一组数据考虑在内（相当于忽略这组数据的存在）。<br />
到了这里，我们不难理解，为什么C的数据结构是这样的了：</p>
<blockquote>
<p>
		C: {group: (col2: int,col3: int,col4: int),double,double}</p>
</blockquote>
<p><span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color:#ff0000;">④</span></strong>DUMP C就是将C中的数据输出到控制台。如果要输出到文件，需要使用：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">STORE&nbsp;C&nbsp;INTO&nbsp;&#39;output&#39;;
</code></pre>
</section>
<p>这样pig就会在当前目录下新建一个&ldquo;output&rdquo;目录（该目录必须事先不存在），并把结果文件放到该目录下。</p>
<p>请想像一下，如果要实现相同的功能，用Java或C++写一个Map-Reduce应用程序需要多少时间？可能仅仅是写一个build.xml或者Makefile，所需的时间就是写这段pig代码的几十倍了！<br />
正因为pig有如此优势，它才得到了广泛应用。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;怎样统计数据行数<br />
在SQL语句中，要统计表中数据的行数，很简单：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">SELECT</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">COUNT</span>(*)&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">FROM</span>&nbsp;table_name&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">WHERE</span>&nbsp;condition
</code></pre>
</section>
<p>在pig中，也有一个COUNT函数，在pig手册中，对COUNT函数有这样的说明：</p>
<blockquote>
<p>
		Computes the number of elements in a bag.</p>
</blockquote>
<p>假设要计算数据文件a.txt的行数：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4.2&nbsp;9.8
a&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3.5&nbsp;2.1
b&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;-&nbsp;&nbsp;&nbsp;-
a&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;2.6&nbsp;6.2
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7.7&nbsp;5.9
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;1.4&nbsp;0.2
</code></pre>
</section>
<p>你是否可以这样做呢：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);
B&nbsp;=&nbsp;COUNT(*);
DUMP&nbsp;B;
</code></pre>
</section>
<p>答案是：绝对不行。pig会报错。pig手册中写得很明白：</p>
<blockquote>
<p>
		Note: You cannot use the tuple designator (*) with COUNT; that is, COUNT(*) will not work.</p>
</blockquote>
<p>那么，这样对某一列计数行不行呢：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">B&nbsp;=&nbsp;COUNT(A.col2);
</code></pre>
</section>
<p>答案是：仍然不行。pig会报错。<br />
这就与我们想像中的&ldquo;正确做法&rdquo;有点不一样了：我为什么不能直接统计一个字段的数目有多少呢？刚接触pig的时候，一定非常疑惑这样明显&ldquo;不应该出错&rdquo;的写法为什么行不通。<br />
要统计A中含col2字段的数据有多少行，正确的做法是：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;ALL;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;COUNT(A.col2);
DUMP&nbsp;C;
</code></pre>
</section>
<p>输出结果：</p>
<blockquote>
<p>
		(6)</p>
</blockquote>
<p>表明有6行数据。<br />
如此麻烦？没错。这是由pig的数据结构决定的。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
在这个例子中，统计COUNT(A.col2)和COUNT(A)的结果是一样的，但是，如果col2这一列中含有空值：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4.2&nbsp;9.8
a&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3.5&nbsp;2.1
b&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;-&nbsp;&nbsp;&nbsp;-
a&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;2.6&nbsp;6.2
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7.7&nbsp;5.9
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;1.4&nbsp;0.2
</code></pre>
</section>
<p>则以下pig程序及执行结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;ALL;
grunt&gt;&nbsp;C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;COUNT(A.col2);
grunt&gt;&nbsp;DUMP&nbsp;C;
(5)
</code></pre>
</section>
<p>可见，结果为5行。那是因为你LOAD数据的时候指定了col2的数据类型为int，而a.txt的第二行数据是空的，因此数据加载到A以后，有一个字段就是空的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;DUMP&nbsp;A;
(a,1,2,3,4.2,9.8)
(a,,0,5,3.5,2.1)
(b,7,9,9,,)
(a,7,9,9,2.6,6.2)
(a,1,2,5,7.7,5.9)
(a,1,2,3,1.4,0.2)
</code></pre>
</section>
<p>在COUNT的时候，null的字段不会被计入在内，所以结果是5。</p>
<blockquote>
<p>
		The COUNT function follows syntax semantics and ignores nulls. What this means is that a tuple in the bag will not be counted if the first field in this tuple is NULL. If you want to include NULL values in the count computation, use COUNT_STAR.</p>
</blockquote>
<p><span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a></p>
<div>
	<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;<span style="color:#ff0000;">FLATTEN</span>操作符的作用</div>
<div>
	这个玩意一开始还是挺让我费解的。从字面上看，flatten就是&ldquo;弄平&rdquo;的意思，但是在对一个pig的数据结构操作时，flatten到底是&ldquo;弄平&rdquo;了什么，又有什么作用呢？<br />
	我们还是采用前面的a.txt数据文件来说明：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4.2&nbsp;9.8
a&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3.5&nbsp;2.1
b&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;-&nbsp;&nbsp;&nbsp;-
a&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;2.6&nbsp;6.2
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7.7&nbsp;5.9
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;1.4&nbsp;0.2
</code></pre>
</section>
<p>如果我们按照前文的做法，计算多维度组合下的最后两列的平均值，则：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col2,&nbsp;col3,&nbsp;col4);
grunt&gt;&nbsp;C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;group,&nbsp;AVG(A.col5),&nbsp;AVG(A.col6);
grunt&gt;&nbsp;DUMP&nbsp;C;
((1,2,3),2.8,5.0)
((1,2,5),7.7,5.9)
((3,0,5),3.5,2.1)
((7,9,9),2.6,6.2)
</code></pre>
</section>
<p>可见，输出结果中，每一行的第一项是一个tuple（元组），我们来试试看 FLATTEN 的作用：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col2,&nbsp;col3,&nbsp;col4);
grunt&gt;&nbsp;C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;FLATTEN(group),&nbsp;AVG(A.col5),&nbsp;AVG(A.col6);
grunt&gt;&nbsp;DUMP&nbsp;C;
(1,2,3,2.8,5.0)
(1,2,5,7.7,5.9)
(3,0,5,3.5,2.1)
(7,9,9,2.6,6.2)
</code></pre>
</section>
<p>看到了吗？被 FLATTEN 的group本来是一个元组，现在变成了扁平的结构了。按照pig文档的说法，FLATTEN用于对元组（tuple）和包（bag）&ldquo;解嵌套&rdquo;（un-nest）：</p>
<blockquote>
<div>
		The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.</div>
<div>
		&nbsp;</div>
<div>
		For tuples, flatten substitutes the fields of a tuple in place of the tuple. For example, consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c).</div>
</blockquote>
<p><span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
所以我们就看到了上面的结果。<br />
在有的时候，不&ldquo;解嵌套&rdquo;的数据结构是不利于观察的，输出这样的数据可能不利于外围数程序的处理（例如，pig将数据输出到磁盘后，我们还需要用其他程序做后续处理，而对一个元组，输出的内容里是含括号的，这就在处理流程上又要多一道去括号的工序），因此，FLATTEN提供了一个让我们在某些情况下可以清楚、方便地分析数据的机会。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;关于<span style="color:#ff0000;">GROUP</span>操作符<br />
在上文的例子中，已经演示了GROUP操作符会生成什么样的数据。在这里，需要说得更理论一些：</p>
<ul>
<li>
		用于GROUP的key如果多于一个字段（正如本文前面的例子），则GROUP之后的数据的key是一个元组（tuple），否则它就是与用于GROUP的key相同类型的东西。</li>
<li>
		GROUP的结果是一个关系（relation），在这个关系中，每一组包含一个元组（tuple），这个元组包含两个字段：<strong><span style="color:#008080;">（1）</span></strong>第一个字段被命名为&ldquo;<span style="color:#0000ff;">group</span>&rdquo;&mdash;&mdash;<span style="color:#ff0000;">这一点非常容易与GROUP关键字相混淆</span>，但请区分开来。该字段的类型与用于GROUP的key类型相同。<strong><span style="color:#008080;">（2）</span></strong>第二个字段是一个包（bag），它的类型与被GROUP的关系的类型相同。</li>
</ul>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;把数据当作&ldquo;元组&rdquo;（tuple）来加载<br />
还是假设有如下数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4.2&nbsp;9.8
a&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3.5&nbsp;2.1
b&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;-&nbsp;&nbsp;&nbsp;-
a&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;2.6&nbsp;6.2
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7.7&nbsp;5.9
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;1.4&nbsp;0.2
</code></pre>
</section>
<p>如果我们按照以下方式来加载数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>);&nbsp;
</code></pre>
</section>
<p>那么得到的A的数据结构为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;A;
A:&nbsp;{col1:&nbsp;chararray,col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int,col5:&nbsp;double,col6:&nbsp;double}
</code></pre>
</section>
<p>如果你要把A当作一个元组（tuple）来加载：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(T&nbsp;:&nbsp;tuple&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>));
</code></pre>
</section>
<p>也就是想要得到这样的数据结构：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;A;
A:&nbsp;{T:&nbsp;(col1:&nbsp;chararray,col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int,col5:&nbsp;double,col6:&nbsp;double)}
</code></pre>
</section>
<p>那么，上面的方法将得到一个空的A：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;DUMP&nbsp;A;
()
()
()
()
()
()
</code></pre>
</section>
<p>那是因为数据文件a.txt的结构不适合于这样加载成元组（tuple）。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
如果有数据文件b.txt：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;b.txt&nbsp;
(a,1,2,3,4.2,9.8)
(a,3,0,5,3.5,2.1)
(b,7,9,9,-,-)
(a,7,9,9,2.6,6.2)
(a,1,2,5,7.7,5.9)
(a,1,2,3,1.4,0.2)
</code></pre>
</section>
<p>则使用上面所说的加载方法及结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;b.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(T&nbsp;:&nbsp;tuple&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>));
grunt&gt;&nbsp;DUMP&nbsp;A;
((a,1,2,3,4.2,9.8))
((a,3,0,5,3.5,2.1))
((b,7,9,9,,))
((a,7,9,9,2.6,6.2))
((a,1,2,5,7.7,5.9))
((a,1,2,3,1.4,0.2))
</code></pre>
</section>
<p>可见，加载的数据的结构确实被定义成了元组（tuple）。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;在多维度组合下，如何计算某个维度组合里的不重复记录的条数<br />
以数据文件 c.txt 为例：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;c.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4.2&nbsp;9.8&nbsp;100
a&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;3.5&nbsp;2.1&nbsp;200
b&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;-&nbsp;&nbsp;&nbsp;-&nbsp;&nbsp;&nbsp;300
a&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;2.6&nbsp;6.2&nbsp;300
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;7.7&nbsp;5.9&nbsp;200
a&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;1.4&nbsp;0.2&nbsp;500
</code></pre>
</section>
<p>问题：如何计算在第2、3、4列的所有维度组合下，最后一列不重复的记录分别有多少条？例如，第2、3、4列有一个维度组合是（1，2，3），在这个维度维度下，最后一列有两种值：100 和 500，因此不重复的记录数为2。同理可求得其他的记录条数。<br />
pig代码及输出结果如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;c.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col5:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col6:<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">double</span>,&nbsp;col7:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col2,&nbsp;col3,&nbsp;col4);
grunt&gt;&nbsp;C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;{D&nbsp;=&nbsp;DISTINCT&nbsp;A.col7;&nbsp;GENERATE&nbsp;group,&nbsp;COUNT(D);};
grunt&gt;&nbsp;DUMP&nbsp;C;
((1,2,3),2)
((1,2,5),1)
((3,0,5),1)
((7,9,9),1)
</code></pre>
</section>
<p>我们来看看每一步分别生成了什么样的数据：<br />
<strong><span style="color:#ff0000;">①</span></strong>LOAD不用说了，就是加载数据；<br />
<strong><span style="color:#ff0000;">②</span></strong>GROUP也不用说了，和前文所说的一样。GROUP之后得到了这样的数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;DUMP&nbsp;B;
((1,2,3),{(a,1,2,3,4.2,9.8,100),(a,1,2,3,1.4,0.2,500)})
((1,2,5),{(a,1,2,5,7.7,5.9,200)})
((3,0,5),{(a,3,0,5,3.5,2.1,200)})
((7,9,9),{(b,7,9,9,,,300),(a,7,9,9,2.6,6.2,300)})
</code></pre>
</section>
<p>其实到这里，我们肉眼就可以看出来最后要求的结果是什么了，当然，必须要由pig代码来完成，要不然怎么应对海量数据？<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<strong><span style="color:#ff0000;">③</span></strong>这里的 FOREACH 与前面有点不一样，这就是所谓的&ldquo;<span style="color:#0000ff;">嵌套的FOREACH</span>&rdquo;。第一次看到这种写法，肯定会觉得很奇怪。先看一下用于<span style="color:#0000ff;">去重</span>的<span style="color:#ff0000;">DISTINCT</span>关键字的说明：</p>
<blockquote>
<p>
		Removes duplicate tuples in a relation.</p>
</blockquote>
<p>然后再解释一下：FOREACH 是对B的每一行进行遍历，其中，B的每一行里含有一个包（bag），每一个包中含有若干元组（tuple）A，因此，FOREACH 后面的大括号里的操作，其实是对所谓的&ldquo;内部包&rdquo;（<span style="color:#ff0000;">inner bag</span>）的操作（详情请参看<a href="http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#FOREACH" rel="noopener noreferrer" target="_blank"><span style="color:#800080;">FOREACH的说明</span></a>），在这里，我们指定了对A的col7这一列进行去重，去重的结果被命名为D，然后再对D计数（COUNT），就得到了我们想要的结果。<br />
<strong><span style="color:#ff0000;">④</span></strong>输出结果数据，与前文所述的差不多。<br />
这样就达成了我们的目的。从总体上说，刚接触pig不久的人会觉得这些写法怪怪的，就是扭不过来，但是要坚持，时间长了，连倒影也会让你觉得是正的了。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何将关系（relation）转换为标量（scalar）<br />
在前文中，我们要统计符合某些条件的数据的条数，使用了COUNT函数来计算，但在COUNT之后，我们得到的还是一个关系（relation），而不是一个标量的数字，如何把一个关系转换为标量，从而可以在后续处理中便于使用呢？<br />
具体请看<a href="http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars" rel="noopener noreferrer" target="_blank">这个链接</a>。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;pig中如何使用shell进行辅助数据处理<br />
pig中可以嵌套使用shell进行辅助处理，下面，就以一个实际的例子来说明。<br />
假设我们在某一步pig处理后，得到了类似于下面 b.txt 中的数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;b.txt&nbsp;
1&nbsp;&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;98&nbsp;&nbsp;=&nbsp;&nbsp;&nbsp;7
34&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;2
62&nbsp;&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;=&nbsp;&nbsp;&nbsp;65
</code></pre>
</section>
<p>问题：如何将数据中第4列中的&ldquo;=&rdquo;符号全部替换为9999？<br />
pig代码及输出结果如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;b.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:chararray,&nbsp;col5:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;STREAM&nbsp;A&nbsp;THROUGH&nbsp;`awk&nbsp;&#39;{if($4&nbsp;==&nbsp;&quot;=&quot;)&nbsp;print&nbsp;$1&quot;\t&quot;$2&quot;\t&quot;$3&quot;\t9999\t&quot;$5;&nbsp;else&nbsp;print&nbsp;$0}&#39;`;
grunt&gt;&nbsp;DUMP&nbsp;B;
(1,5,98,9999,7)
(34,8,6,3,2)
(62,0,6,9999,65)
</code></pre>
</section>
<p>我们来看看这段代码是如何做到的：<br />
<strong><span style="color:#ff0000;">①</span></strong>加载数据，这个没什么好说的。<br />
<strong><span style="color:#ff0000;">②</span></strong>通过&ldquo;STREAM &hellip; THROUGH &hellip;&rdquo;的方式，我们可以调用一个shell语句，用该shell语句对A的每一行数据进行处理。此处的shell逻辑为：当某一行数据的第4列为&ldquo;=&rdquo;符号时，将其替换为&ldquo;9999&rdquo;；否则就照原样输出这一行。<br />
<strong><span style="color:#ff0000;">③</span></strong>输出B，可见结果正确。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;向pig脚本中传入参数<br />
假设你的pig脚本输出的文件是通过外部参数指定的，则此参数不能写死，需要传入。在pig中，使用传入的参数如下所示：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">STORE&nbsp;A&nbsp;INTO&nbsp;&#39;$output_dir&#39;;
</code></pre>
</section>
<p>则这个&ldquo;output_dir&rdquo;就是个传入的参数。在调用这个pig脚本的shell脚本中，我们可以这样传入参数：</p>
<blockquote>
<p>
		pig -param output_dir=&quot;/home/my_ourput_dir/&quot; my_pig_script.pig</p>
</blockquote>
<p>这里传入的参数&ldquo;output_dir&rdquo;的值为&ldquo;/home/my_output_dir/&rdquo;。<br />
<span style="color: rgb(255, 255, 255); font-family: arial, helvetica, sans-serif; font-size: 14px; line-height: 20px; text-align: left; background-color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;就算是同样一段pig代码，多次计算所得的结果也有可能是不同的<br />
例如用<span style="color:#ff0000;">AVG</span>函数来计算平均值时，同样一段pig代码，多次计算所得的结果中，小数点的最后几位也有可能是不相同的（当然也有可能相同），大概是因为精度的原因吧。不过，一般来说小数点的最后几位已经不重要了。例如我对一个数据集进行处理后，小数点后13位才开始有不同，这样的精度完全足够了。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何编写及使用自定义函数（UDF）<br />
请看这个链接：《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color:#0000ff;">Apache Pig中文教程（进阶）</span></a>》</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;什么是聚合函数（Aggregate Function）<br />
在pig中，聚合函数就是那些接受一个输入包（bag），返回一个标量（scalar）值的函数。COUNT函数就是一个例子。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;COGROUP做了什么<br />
与GROUP操作符一样，<span style="color:#ff0000;"><span style="background-color:#e6e6fa;">CO</span>GROUP</span>也是用来分组的，不同的是，COGROUP可以按多个关系中的字段进行分组。<br />
还是以一个实例来说明，假设有以下两个数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;a.txt&nbsp;
uidk&nbsp;&nbsp;&nbsp;&nbsp;12&nbsp;&nbsp;3
hfd&nbsp;&nbsp;&nbsp;&nbsp;132&nbsp;99
bbN&nbsp;&nbsp;&nbsp;&nbsp;463&nbsp;231
UFD&nbsp;&nbsp;&nbsp;&nbsp;13&nbsp;&nbsp;10

[root@localhost&nbsp;pig]$&nbsp;cat&nbsp;b.txt&nbsp;
908&nbsp;&nbsp;&nbsp;&nbsp;uidk&nbsp;&nbsp;&nbsp;&nbsp;888
345&nbsp;&nbsp;&nbsp;&nbsp;hfd&nbsp;557
28790&nbsp;&nbsp;&nbsp;&nbsp;re&nbsp;&nbsp;00000
</code></pre>
</section>
<p>现在我们用pig做如下操作及得到的结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(acol1:chararray,&nbsp;acol2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;acol3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;b.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(bcol1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;bcol2:chararray,&nbsp;bcol3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;C&nbsp;=&nbsp;COGROUP&nbsp;A&nbsp;BY&nbsp;acol1,&nbsp;B&nbsp;BY&nbsp;bcol2;
grunt&gt;&nbsp;DUMP&nbsp;C;
(re,{},{(28790,re,0)})
(UFD,{(UFD,13,10)},{})
(bbN,{(bbN,463,231)},{})
(hfd,{(hfd,132,99)},{(345,hfd,557)})
(uidk,{(uidk,12,3)},{(908,uidk,888)})
</code></pre>
</section>
<p>每一行输出的第一项都是分组的key，第二项和第三项分别都是一个包（bag），其中，第二项是根据前面的key找到的A中的数据包，第三项是根据前面的key找到的B中的数据包。<br />
来看看第一行输出：&ldquo;re&rdquo;作为group的key时，其找不到对应的A中的数据，因此第二项就是一个空的包&ldquo;{}&rdquo;，&ldquo;re&rdquo;这个key在B中找到了对应的数据（28790 &nbsp; &nbsp;re &nbsp; &nbsp;00000），因此第三项就是bag {(28790,re,0)}。<br />
其他输出数据也类似。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;安装pig后，运行pig命令时提示&ldquo;Cannot find hadoop configurations in classpath&rdquo;等错误的解决办法<br />
pig安装好后，运行pig命令时提示以下错误：</p>
<blockquote>
<p>
		ERROR org.apache.pig.Main - ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use local mode, please put -x local option in command line</p>
</blockquote>
<p>显而易见，提示找不到与hadoop相关的配置文件。所以我们需要把hadoop安装目录下的&ldquo;conf&rdquo;子目录添加到系统环境变量PATH中：<br />
修改 /etc/profile 文件，添加：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="bash language-bash hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">export</span>&nbsp;HADOOP_HOME=/usr/<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">local</span>/hadoop
<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">export</span>&nbsp;PIG_CLASSPATH=<span class="hljs-variable" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(98, 151, 85); word-wrap: inherit !important; word-break: inherit !important;">$HADOOP_HOME</span>/conf

PATH=<span class="hljs-variable" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(98, 151, 85); word-wrap: inherit !important; word-break: inherit !important;">$JAVA_HOME</span>/bin:<span class="hljs-variable" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(98, 151, 85); word-wrap: inherit !important; word-break: inherit !important;">$HADOOP_HOME</span>/bin:<span class="hljs-variable" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(98, 151, 85); word-wrap: inherit !important; word-break: inherit !important;">$PIG_CLASSPATH</span>:<span class="hljs-variable" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(98, 151, 85); word-wrap: inherit !important; word-break: inherit !important;">$PATH</span></code></pre>
</section>
<p>然后重新加载 /etc/profile 文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="bash language-bash hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">source</span>&nbsp;/etc/profile
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color:#ffffff;">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;piggybank是什么东西</p>
<blockquote>
<p>
		Pig also hosts a UDF repository called piggybank that allows users to share UDFs that they have written.</p>
</blockquote>
<p>说白了就是Apache把大家写的自定义函数放在一块儿，起了个名字，就叫做piggybank。你可以把它理解为一个SVN代码仓库。具体请看<a href="https://cwiki.apache.org/confluence/display/PIG/PiggyBank" rel="noopener noreferrer" target="_blank"><span style="color:#ff0000;">这里</span></a>。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;UDF的构造函数会被调用几次<br />
你可能会想在UDF的构造函数中做一些初始化的工作，例如创建一些文件，等等。但是你不能假设UDF的构造函数只被调用一次，因此，如果你要在构造函数中做一些只能做一次的工作，你就要当心了&mdash;&mdash;可能会导致错误。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;LOAD数据时，如何一次LOAD多个目录下的数据<br />
例如，我要LOAD两个HDFS目录下的数据：/abc/2010 和 /abc/2011，则我们可以这样写LOAD语句：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;/abc/201{0,1}&#39;</span>;
</code></pre>
</section>
<p>
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;怎样自己写一个UDF中的加载函数(load function)<br />
请看这个链接：《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;重载(overloading)一个UDF<br />
请看这个链接：《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;pig运行不起来，提示&ldquo;<span style="color:#b22222;">org.apache.hadoop.ipc.Client - Retrying connect to server:&nbsp;</span><br />
请看这个链接：《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;用含有null的字段来GROUP，结果会如何<br />
假设有数据文件 a.txt 内容为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">1&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;5
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3
1&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;
6&nbsp;&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;8
</code></pre>
</section>
<p>其中，每两列数据之间是用tab分割的，第二行的第2列、第三行的第3列没有内容（也就是说，加载到Pig里之后，对应的数据会变成null），如果把这些数据按第1、第2列来GROUP的话，第1、2列中含有null的行会被忽略吗？<br />
来做一下试验：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col1,&nbsp;col2);
DUMP&nbsp;B;
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">((1,2),{(1,2,5)})
((1,3),{(1,3,)})
((1,),{(1,,3)})
((6,9),{(6,9,8)})
</code></pre>
</section>
<p>从上面的结果（第三行）可见，原数据中第1、2列里含有null的行也被计入在内了，也就是说，GROUP操作是不会忽略null的，这与COUNT有所不同（见本文前面的部分）。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何统计数据中某些字段的组合有多少种<br />
假设有如下数据：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;a.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;7
1&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;4
2&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;0&nbsp;&nbsp;&nbsp;5
9&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;6
</code></pre>
</section>
<p>现在我们要统计第1、2列的不同组合有多少种，对本例来说，组合有三种：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">1&nbsp;&nbsp;&nbsp;&nbsp;3
2&nbsp;&nbsp;&nbsp;&nbsp;7
9&nbsp;&nbsp;&nbsp;&nbsp;8
</code></pre>
</section>
<p>也就是说我们要的答案是3。<br />
用Pig怎么计算？<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
先写出全部的Pig代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col4:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col1,&nbsp;col2);&nbsp;
C&nbsp;=&nbsp;GROUP&nbsp;B&nbsp;ALL;
D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;COUNT(B);&nbsp;
DUMP&nbsp;D;
</code></pre>
</section>
<p>然后再来看看这些代码是如何计算出上面的结果的：<br />
<strong><span style="color:#ff0000;">①</span></strong>第一行代码加载数据，没什么好说的。<br />
<strong><span style="color:#ff0000;">②</span></strong>第二行代码，得到第1、2列数据的所有组合。B的数据结构为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;B;
B:&nbsp;{group:&nbsp;(col1:&nbsp;int,col2:&nbsp;int),A:&nbsp;{col1:&nbsp;int,col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int}}
</code></pre>
</section>
<p>把B DUMP出来，得到：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">((1,3),{(1,3,4,7),(1,3,5,4)})
((2,7),{(2,7,0,5)})
((9,8),{(9,8,6,6)})
</code></pre>
</section>
<p>非常明显，(1,3)，(2,7)，(9,8)的所有组合已经被排列出来了，这里得到了若干行数据。下一步我们要做的就是统计这样的数据一共有多少行，也就得到了第1、2列的组合有多少组。<br />
<strong><span style="color:#ff0000;">③</span></strong>第三和第四行代码，就实现了统计数据行数的功能。参考本文前面部分的&ldquo;怎样统计数据行数&rdquo;一节。就明白这两句代码是什么意思了。<br />
这里需要特别说明的是：<br />
<strong><span style="color:#0000ff;">a)</span></strong>为什么倒数第二句代码中是COUNT(B)，而不是COUNT(group)？<br />
我们是对C进行FOREACH，所以要先看看C的数据结构：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;C;
C:&nbsp;{group:&nbsp;chararray,B:&nbsp;{group:&nbsp;(col1:&nbsp;int,col2:&nbsp;int),A:&nbsp;{col1:&nbsp;int,col2:&nbsp;int,col3:&nbsp;int,col4:&nbsp;int}}}
</code></pre>
</section>
<p>可见，你可以把C想像成一个map的结构，key是一个group，value是一个包（bag），它的名字是B，这个包中有N个元素，每一个元素都对应到②中所说的一行。根据②的分析，我们就是要统计B中元素的个数，因此，这里当然就是COUNT(B)了。<br />
<strong><span style="color:#0000ff;">b)</span></strong>COUNT函数的作用是统计一个包（bag）中的元素的个数：</p>
<blockquote>
<div>
		COUNT</div>
<div>
		Computes the number of elements in a bag.</div>
</blockquote>
<div>
	从C的数据结构看，B是一个bag，所以COUNT函数是可以用于它的。<br />
	如果你试图把COUNT应用于一个非bag的数据结构上，会发生错误，例如：、</p>
<div>
<blockquote>
<div>
				java.lang.ClassCastException: org.apache.pig.data.BinSedesTuple cannot be cast to org.apache.pig.data.DataBag</div>
</blockquote></div>
</div>
<p>这是把Tuple传给COUNT函数时发生的错误。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;两个整型数相除，如何转换为浮点型，从而得到正确的结果<br />
这个问题其实很傻，或许不用说你也知道了：假设有int a = 3 和 int b = 2两个数，在大多数编程语言里，a/b得到的是1，想得到正确结果1.5的话，需要转换为float再计算。在Pig中其实和这种情况一样，下面就拿几行数据来做个实验：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;a.txt&nbsp;</span>
3&nbsp;&nbsp;&nbsp;&nbsp;2
4&nbsp;&nbsp;&nbsp;&nbsp;5
</code></pre>
</section>
<p>在Pig中：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1/col2;
grunt&gt;&nbsp;DUMP&nbsp;B;
(1)
(0)
</code></pre>
</section>
<p>可见，不加类型转换的计算结果是取整之后的值。<br />
那么，转换一下试试：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;(float)(col1/col2);
grunt&gt;&nbsp;DUMP&nbsp;B;
(1.0)
(0.0)
</code></pre>
</section>
<p>这样转换还是不行的，这与大多数编程语言的结果一致&mdash;&mdash;它只是把取整之后的数再转换为浮点数，因此当然是不行的。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
正确的做法应该是：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;
grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;(float)col1/col2;&nbsp;&nbsp;
grunt&gt;&nbsp;DUMP&nbsp;B;
(1.5)
(0.8)
</code></pre>
</section>
<p>或者这样也行：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
grunt&gt;&nbsp;B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1/(float)col2;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
grunt&gt;&nbsp;DUMP&nbsp;B;
(1.5)
(0.8)
</code></pre>
</section>
<p>这与我们的通常做法是一致的，因此，你要做除法运算的时候，需要注意这一点。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;UNION的一个例子<br />
假设有两个数据文件为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
0&nbsp;&nbsp;&nbsp;&nbsp;3
1&nbsp;&nbsp;&nbsp;&nbsp;5
0&nbsp;&nbsp;&nbsp;&nbsp;8

[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;2.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;6
0&nbsp;&nbsp;&nbsp;&nbsp;9
</code></pre>
</section>
<p>现在要求出：在第一列相同的情况下，第二列的和分别为多少？<br />
例如，第一列为 1 的时候，第二列有5和6两个值，和为11。同理，第一列为0的时候，第二列的和为 3+8+9=20。<br />
计算此问题的Pig代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(a:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;b:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;
B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(c:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;d:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;
C&nbsp;=&nbsp;UNION&nbsp;A,&nbsp;B;
D&nbsp;=&nbsp;GROUP&nbsp;C&nbsp;BY&nbsp;$0;&nbsp;
E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;FLATTEN(group),&nbsp;SUM(C.$1);
DUMP&nbsp;E;
</code></pre>
</section>
<p>输出为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(0,20)
(1,11)
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
我们来看看每一步分别做了什么：<br />
<strong><span style="color:#ff0000;">①</span></strong>第1行、第2行代码分别加载数据到关系A、B中，没什么好说的。<br />
<strong><span style="color:#ff0000;">②</span></strong>第3行代码，将关系A、B合并起来了。合并后的数据结构为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;C;
C:&nbsp;{a:&nbsp;int,b:&nbsp;int}
</code></pre>
</section>
<p>其数据为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;DUMP&nbsp;C;
(0,3)
(1,5)
(0,8)
(1,6)
(0,9)
</code></pre>
</section>
<p><strong><span style="color:#ff0000;">③</span></strong>第4行代码按第1列（即$0）进行分组，分组后的数据结构为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;D;
D:&nbsp;{group:&nbsp;int,C:&nbsp;{a:&nbsp;int,b:&nbsp;int}}
</code></pre>
</section>
<p>其数据为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;DUMP&nbsp;D;
(0,{(0,9),(0,3),(0,8)})
(1,{(1,5),(1,6)})
</code></pre>
</section>
<p><strong><span style="color:#ff0000;">④</span></strong>最后一行代码，遍历D，将D中每一行里的所有bag(即C)的第2列(即$1)进行累加，就得到了我们要的结果。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;错误&ldquo;<span style="color:#0000ff;">ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.</span>&rdquo;的可能原因<br />
<strong><span style="color:#ff0000;">①</span></strong>Pig的bug，详见<a href="https://issues.apache.org/jira/browse/PIG-1683" rel="noopener noreferrer" target="_blank">此链接</a>；<br />
<strong><span style="color:#ff0000;">②</span></strong>其他原因。我遇到并解决了一例。具体的代码不便在此陈列，但是基本可以说是由于自己写的Pig代码对复杂数据结构的处理不当导致的，后来我尝试更改了一种实现方式，就绕过了这个问题。关于这点，确实还是要具体问题具体分析的，在这里没有实例的话，无法给大家一个明确的解决问题的指南。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何在Pig中使用正则表达式对字符串进行匹配<br />
假设你有如下数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;a.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;http://ui.qq.com/abcd.html
5&nbsp;&nbsp;&nbsp;&nbsp;http://tr.qq.com/743.html
8&nbsp;&nbsp;&nbsp;&nbsp;http://vid.163.com/trees.php
9&nbsp;&nbsp;&nbsp;&nbsp;http:auto.qq.com/us.php
</code></pre>
</section>
<p>现在要找出该文件中，第二列符合&ldquo;<span style="color:#ff0000;">*//*.qq.com/*</span>&rdquo;模式的所有行（此处只有前两行符合条件），怎么做？<br />
Pig代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;col2&nbsp;matches&nbsp;&#39;.*//.*\\.qq\\.com/.*&#39;;&nbsp;&nbsp;
DUMP&nbsp;B;
</code></pre>
</section>
<p>我们看到，matches关键字对 col2 进行了正则匹配，它使用的是Java格式的正则表达式匹配规则。<br />
<span style="color:#ff0000;">.&nbsp;</span>表示任意字符，<strong><span style="color:#ff0000;">*&nbsp;</span></strong>表示字符出现任意次数；<strong><span style="color:#ff0000;">\.&nbsp;</span></strong>对&nbsp;<span style="color:#ff0000;">.&nbsp;</span>进行了转义，表示匹配&nbsp;<span style="color:#ff0000;">.&nbsp;</span>这个字符；<span style="color:#ff0000;">/&nbsp;</span>就是表示匹配&nbsp;<strong><span style="color:#ff0000;">/&nbsp;</span></strong>这个字符。<br />
这里需要注意的是，在引号中，用于转义的字符&nbsp;<strong><span style="color:#ff0000;">\&nbsp;</span></strong>需要打两个才能表示一个，所以上面的&nbsp;<strong><span style="color:#ff0000;">\\.&nbsp;</span></strong>就是与正则中的&nbsp;<strong><span style="color:#ff0000;">\.&nbsp;</span></strong>是一样的，即匹配<strong>&nbsp;</strong><span style="color:#ff0000;">.&nbsp;</span>这个字符。所以，如果你要匹配数字的话，应该用这种写法（<strong><span style="color:#ff0000;">\d</span></strong>表示匹配数字，在引号中必须用<strong><span style="color:#ff0000;">\\d</span></strong>）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;(col&nbsp;matches&nbsp;&#39;\\d.*&#39;);
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
最后输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,http://ui.qq.com/abcd.html)
(5,http://tr.qq.com/743.html)
</code></pre>
</section>
<p>可见结果是正确的。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何截取一个字符串中的某一段<br />
在处理数据时，如果你想提取出一个日期字符串的年份，例如提取出&ldquo;2011-10-26&rdquo;中的&ldquo;2011&rdquo;，可以用内置函数 <span style="color:#0000ff;">SUBSTRING</span> 来实现：</p>
<blockquote>
<div>
		<span style="font-size:20px;">SUBSTRING</span></div>
<div>
		Returns a substring from a given string.</div>
<div>
		<span style="font-size:18px;">Syntax</span></div>
<div>
		SUBSTRING(string, startIndex, stopIndex)</div>
</blockquote>
<p>下面举一个例子。假设有数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;a.txt&nbsp;</span>
2010-05-06&nbsp;&nbsp;&nbsp;&nbsp;abc
2008-06-18&nbsp;&nbsp;&nbsp;&nbsp;uio
2011-10-11&nbsp;&nbsp;&nbsp;&nbsp;tyr
2010-12-23&nbsp;&nbsp;&nbsp;&nbsp;fgh
2011-01-05&nbsp;&nbsp;&nbsp;&nbsp;vbn
</code></pre>
</section>
<p>第一列是日期，现在要找出所有不重复的年份有哪些，可以这样做：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(dateStr:&nbsp;chararray,&nbsp;flag:&nbsp;chararray);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;SUBSTRING(dateStr,&nbsp;0,&nbsp;4);
C&nbsp;=&nbsp;DISTINCT&nbsp;B;
DUMP&nbsp;C;
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(2008)
(2010)
(2011)
</code></pre>
</section>
<p>可见达到了我们想要的效果。<br />
上面的代码太简单了，不必多言，唯一需要说明一下的是 SUBSTRING 函数，它的第一个参数是要截取的字符串，第二个参数是起始索引（从0开始），第三个参数是结束索引。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何拼接两个字符串<br />
假设有以下数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
abc&nbsp;&nbsp;&nbsp;&nbsp;123
cde&nbsp;&nbsp;&nbsp;&nbsp;456
fgh&nbsp;&nbsp;&nbsp;&nbsp;789
ijk&nbsp;&nbsp;&nbsp;&nbsp;200
</code></pre>
</section>
<p>现在要把第一列和第二列作为字符串拼接起来，例如第一行会变成&ldquo;abc123&rdquo;，那么使用CONCAT这个求值函数（eval function）就可以做到：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;CONCAT(col1,&nbsp;(chararray)col2);
DUMP&nbsp;B;
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(abc123)
(cde456)
(fgh789)
(ijk200)
</code></pre>
</section>
<p>注意这里故意在加载数据的时候把第二列指定为int类型，这是为了说明数据类型不一致的时候CONCAT会出错（你可以试验一下）：</p>
<blockquote>
<p>
		ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast.</p>
</blockquote>
<p>所以在后面CONCAT的时候，对第二列进行了类型转换。<br />
另外，如果数据文件内容为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
5&nbsp;&nbsp;&nbsp;&nbsp;123
7&nbsp;&nbsp;&nbsp;&nbsp;456
8&nbsp;&nbsp;&nbsp;&nbsp;789
0&nbsp;&nbsp;&nbsp;&nbsp;200
</code></pre>
</section>
<p>那么，如果对两列整数CONCAT：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;CONCAT(col1,&nbsp;col2);
</code></pre>
</section>
<p><span style="color:#ff0000;">同样也会出错</span>：</p>
<blockquote>
<p>
		ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast.</p>
</blockquote>
<p>要注意这一点。<br />
有人可能会问：要拼接几个字符串的话怎么办？CONCAT 套 CONCAT 就要可以了（有点笨，但管用）： <span style="color:#0000ff;">CONCAT(a, CONCAT(b, c))</span></p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何求两个数据集的重合 &amp; 不同的数据类型JOIN会失败<br />
假设有以下两个数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
123
456
789
200
</code></pre>
</section>
<p>以及：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;2.txt&nbsp;</span>
200
333
789
</code></pre>
</section>
<p>现在要找出两个文件中，相同的数据有多少行，怎么做？这也就是所谓的求两个数据集的<span style="color:#ff0000;">重合</span>。<br />
用关系操作符JOIN，我们可以达到这个目的。在处理海量数据时，经常会有求重合的需求。所以JOIN是Pig中一个极其重要的操作。<br />
在本例中，两个文件中有两个相同的数据行：789以及200，因此，结果应该是2。<br />
我们先来看看正确的代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(a:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;&nbsp;&nbsp;
B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(b:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;a,&nbsp;B&nbsp;BY&nbsp;b;
D&nbsp;=&nbsp;GROUP&nbsp;C&nbsp;ALL;
E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;COUNT(C);
DUMP&nbsp;E;
</code></pre>
</section>
<p>解释一下：<br />
<span style="color:#ff0000;">①</span>第一、二行是加载数据，不必多言。<br />
<span style="color:#ff0000;">②</span>第三行按A的第1列、B的第二列进行&ldquo;结合&rdquo;，JOIN之后，a、b两列不相同的数据就被剔除掉了。C的数据结构为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">C:&nbsp;{A::a:&nbsp;int,B::b:&nbsp;int}
</code></pre>
</section>
<p>C的数据为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(200,200)
(789,789)
</code></pre>
</section>
<p><span style="color:#ff0000;">③</span>由于我们要统计的是数据行数，所以上面的Pig代码中的第4、5行就进行了计数的运算。<br />
<span style="color:#ff0000;">④</span>如果文件 2.txt 多一行数据&ldquo;200&rdquo;，结果会是什么？答案是：结果为3行。这个时候C的数据为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(200,200)
(200,200)
(789,789)
</code></pre>
</section>
<p>所以如果你要去除重复的，还需要用DISTINCE对C处理一下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(a:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(b:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;a,&nbsp;B&nbsp;BY&nbsp;b;
uniq_C&nbsp;=&nbsp;DISTINCT&nbsp;C;
D&nbsp;=&nbsp;GROUP&nbsp;uniq_C&nbsp;ALL;
E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;COUNT(uniq_C);
DUMP&nbsp;E;
</code></pre>
</section>
<p>这样得到的结果就是2了。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
尤其需要注意的是，如果JOIN的两列具有不同的数据类型，是会失败的。例如以下代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(a:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);&nbsp;&nbsp;&nbsp;
B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(b:&nbsp;chararray);
C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;a,&nbsp;B&nbsp;BY&nbsp;b;
D&nbsp;=&nbsp;GROUP&nbsp;C&nbsp;ALL;
E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;COUNT(C);
DUMP&nbsp;E;
</code></pre>
</section>
<div>
	在语法上是没有错误的，但是一运行就会报错：</div>
<blockquote>
<div>
		ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1107: Cannot merge join keys, incompatible types</div>
</blockquote>
<div>
	这是因为a、b具有不同的类型：int和chararray。</p>
<p>	<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;使用三目运算符&ldquo; <span style="color:#ff0000;">? :</span> &rdquo;有时候必须加括号<br />
	假设有以下数据文件：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;a.txt&nbsp;</span>
5&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;9
6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0
4&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;1
</code></pre>
</section>
<p>其中，第二行的第二列数据是有缺失的，因此，加载数据之后，它会成为null。顺便废话一句，在处理海量数据时，数据有缺失是经常遇到的现象。<br />
现在，我们如果要把所有缺失的数据填为 -1， 可以使用三目运算符来操作：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col3:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;col1,&nbsp;((col2&nbsp;is&nbsp;null)?&nbsp;-1&nbsp;:&nbsp;col2),&nbsp;col3;
DUMP&nbsp;B;
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(5,8,9)
(6,-1,0)
(4,3,1)
</code></pre>
</section>
<p><span style="color:#0000ff;">((col2 is null)? -1 : col2)</span> 的含义不用解释你也知道，就是当col2为null的时候将其置为-1，否则就保持原来的值，但是注意，它最外面是用括号括起来的，如果去掉括号，写成&nbsp;<span style="color:#0000ff;">(col2 is null)? -1 : col2</span>，那么就会有语法错误：</p>
<blockquote>
<div>
		ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered &quot; &quot;is&quot; &quot;is &quot;&quot; at line 1, column 36.</div>
<div>
		Was expecting one of （后面省略）</div>
</blockquote>
<div>
	错误提示有点不直观。所以，有时候使用三目运算符是必须要使用括号的。</p>
<p>	<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何补上缺失的数据<br />
	通过前面的文章，我们已经知道了如何按自己的需求补上缺失的数据，那么这里还有一个例子，可以让你多了解一些特殊的情况。<br />
	数据文件如下：</div>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;(4,9)
5&nbsp;&nbsp;&nbsp;&nbsp;
8&nbsp;&nbsp;&nbsp;&nbsp;(3,0)
5&nbsp;&nbsp;&nbsp;&nbsp;(9,2)
6&nbsp;&nbsp;&nbsp;&nbsp;
</code></pre>
</section>
<p>这些数据的布局比较怪，我们要把它加载成什么样的schema呢？答：第一列为一个int，第二列为一个tuple，此tuple又含两个int。加载成这样的模式不是为了制造复杂度，而是为了说明后面的问题而设计的。<br />
同时，我们也注意到，第二列数据是有缺失的。<br />
问题：怎样求在第一列数据相同的情况下，第二列数据中的第一个整数的和分别为多少？<br />
例如，第一列为1的数据只有一行（即第一行），因此，第二列的第一个整数的和就是4。<br />
但是对最后一行，也就是第一列为6时，由于其第二列数据缺失，我们希望它输出的结果是0。<br />
先来看看Pig代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(a:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;b:tuple(x:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;y:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>));
B&nbsp;=&nbsp;FOREACH&nbsp;A&nbsp;GENERATE&nbsp;a,&nbsp;FLATTEN(b);
C&nbsp;=&nbsp;GROUP&nbsp;B&nbsp;BY&nbsp;a;
D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;group,&nbsp;SUM(B.x);
DUMP&nbsp;D;
</code></pre>
</section>
<p>结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,4)
(5,9)
(6,)
(8,3)
</code></pre>
</section>
<p>我们注意到，(5,9) 这一行是由数据文件 1.txt 的第 2、4行计算得到的，其中，第2行数据有缺失，但这并不影响求和计算，因为另一行数据没有缺失。你可以这样想：一个包（bag）中有多个数，当其中一个为null，而其他不为null时，把它们相加会自动忽略null。<br />
然而，第三行 (6,) 是不是太刺眼了？没错，因为数据文件 1.txt 的最后一行缺失了第二列，所以，在 SUM(B.x) 中的 B.x 为null就会导致计算结果为null，从而什么也输出不了。<br />
这就与我们期望的输出有点不同了。我们希望这种缺失的数据不要空着，而是输出0。该怎么做呢？<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="color:#ff0000;">想法1</span>：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;group,&nbsp;((IsEmpty(B.x))&nbsp;?&nbsp;0&nbsp;:&nbsp;SUM(B.x));
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,4)
(5,9)
(6,)
(8,3)
</code></pre>
</section>
<p>可见行不通。从这个结果我们知道，IsEmpty(B.x) 为false，即B.x不是empty的，所以不能这样做。<br />
<span style="color:#ff0000;">想法2</span>：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;group,&nbsp;((B.x&nbsp;is&nbsp;null)&nbsp;?&nbsp;0&nbsp;:&nbsp;SUM(B.x));
</code></pre>
</section>
<p>输出结果还是与上面一样！仍然行不通。这更奇怪了：B.x既非empty，也非null，那么它是什么情况？按照我的理解，当group为6时，它应该是一个非空的包（bag），里面有一个null的东西，所以，这个包不是empty的，它也非null。我不知道这样理解是否正确，但是它看上去就像是这样的。<br />
<span style="color:#ff0000;">想法3</span>：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;group,&nbsp;SUM(B.x)&nbsp;AS&nbsp;s;
E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;group,&nbsp;((s&nbsp;is&nbsp;null)&nbsp;?&nbsp;-1&nbsp;:&nbsp;s);
DUMP&nbsp;E;
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,4)
(5,9)
(6,-1)
(8,3)
</code></pre>
</section>
<p>可见达到了我们想要的结果。这与本文前面部分的做法是一致的，即：先得到含null的结果，再把这个结果中的null替换为指定的值。<br />
有人会问：就没有办法在生成数据集D的时候，就直接通过判断语句来实现这个效果吗？据我目前所知是不行的，如果哪位读者知道，不妨告知。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;DISTINCT操作用于去重，正因为它要把数据集合到一起，才知道哪些数据是重复的，因此，它会产生reduce过程。同时，在map阶段，它也会利用combiner来先去除一部分重复数据以加快处理速度。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何将Pig job的优先级设为HIGH<br />
嫌Pig job运行太慢？只需在Pig脚本的开头加上一句：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;job.priority&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">HIGH</span>;
</code></pre>
</section>
<p>即可将Pig job的优先级设为高了。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;&ldquo;<span style="color:#ff0000;">Scalars can be only used with projections</span>&rdquo;错误的原因<br />
这个错误提示比较不直观，光看这句话是不容易发现错误所在的，但是，只要你一Google，可能就找到原因了，例如<a href="https://issues.apache.org/jira/browse/PIG-1788" rel="noopener noreferrer" target="_blank">这个链接</a>里的反馈。<br />
在这里，我也想用一个简单的例子给大家用演示一下产生这个错误的原因之一。<br />
假设有如下数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;1.txt&nbsp;
a&nbsp;&nbsp;&nbsp;&nbsp;1
b&nbsp;&nbsp;&nbsp;&nbsp;8
c&nbsp;&nbsp;&nbsp;&nbsp;3
c&nbsp;&nbsp;&nbsp;&nbsp;3
d&nbsp;&nbsp;&nbsp;&nbsp;6
d&nbsp;&nbsp;&nbsp;&nbsp;3
c&nbsp;&nbsp;&nbsp;&nbsp;5
e&nbsp;&nbsp;&nbsp;&nbsp;7
</code></pre>
</section>
<p>现在要统计：在第1列的每一种组合下，第二列为3和6的数据分别有多少条？<br />
例如，当第1列为 c 时，第二列为3的数据有2条，为6的数据有0条；当第1列为d时，第二列为3的数据有1条，为6的数据有1条。其他的依此类推。<br />
Pig代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:chararray,&nbsp;col2:<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;col1;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;{
&nbsp;&nbsp;&nbsp;&nbsp;D&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;col2&nbsp;==&nbsp;3;
&nbsp;&nbsp;&nbsp;&nbsp;E&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;col2&nbsp;==&nbsp;6;
&nbsp;&nbsp;&nbsp;&nbsp;GENERATE&nbsp;group,&nbsp;COUNT(D),&nbsp;COUNT(E);
};
DUMP&nbsp;C;
</code></pre>
</section>
<p>输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(a,0,0)
(b,0,0)
(c,2,0)
(d,1,1)
(e,0,0)
</code></pre>
</section>
<p>可见结果是正确的。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
那么，如果我在上面的代码中，把&ldquo;D = FILTER <span style="color:#ff0000;">A</span> BY col2 == 3&rdquo;不小心写成了&ldquo;D = FILTER <span style="color:#ff0000;">B</span> BY col2 == 3&rdquo;，就肯定会得到&ldquo;<span style="color:#0000ff;">Scalars can be only used with projections</span>&rdquo;的错误提示。<br />
说白了，还是要时刻注意你每一步生成的数据的结构，眼睛睁大，千万不要用错了relation。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;什么是嵌套的FOREACH/内部的FOREACH<br />
嵌套的（nested）FOREACH和内部的（inner）FOREACH是一个意思，正如你在本文第(35)条中所见，一个FOREACH可以对每一条记录施以多种不同的关系操作，然后再GENERATE得到想要的结果，这就是嵌套的/内部的FOREACH。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;错误&ldquo;Could not infer the matching function for org.apache.pig.builtin.CONCAT&rdquo;的原因之一<br />
如果你遇到这个错误，那么有可能是你在多级CONCAT嵌套的时候，没有写对语句，例如&ldquo;CONCAT(CONCAT(CONCAT(a, b), c), d)&rdquo;这样的嵌套，由于括号众多，所以写错了是一点也不奇怪的。我遇这个错误的时候，是由于CONCAT太多，自己多写了一个都没有发现。希望我的提醒能给你一点解决问题的提示。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;用Pig加载HBase数据时遇到的错误&ldquo;<span style="color:#0000ff;">ERROR 2999: Unexpected internal error. could not instantiate &#39;com.twitter.elephantbird.pig.load.HBaseLoader&#39; with arguments XXX</span>&rdquo;的原因之一<br />
请看这个链接：《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;错误&ldquo;<span style="color:#0000ff;">ERROR 1039: In alias XX, incompatible types in EqualTo Operator left hand side:XXX right hand side:XXX</span>&rdquo;的原因<br />
其实这个错误提示太明显了，就是类型不匹配造成的。上面的XXX可以指代不同的类型。<br />
这说明，前面可能有一个类型为long的字段，后面你却把它当chararray来用了，例如：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">long</span>);
B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;col2&nbsp;==&nbsp;&#39;123456789&#39;;
C&nbsp;=&nbsp;GROUP&nbsp;B&nbsp;ALL;
D&nbsp;=&nbsp;FOREACH&nbsp;C&nbsp;GENERATE&nbsp;COUNT(B);
DUMP&nbsp;D;
</code></pre>
</section>
<p>就会出错：</p>
<blockquote>
<p>
		ERROR 1039: In alias B, incompatible types in EqualTo Operator left hand side:long right hand side:chararray</p>
</blockquote>
<p>只要把col2强制类型转换一下（或者一开始就将其类型指定为chararray）就可以解决问题。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
不仅在进行数据比较中，在JOIN时也经常出现数据类型不匹配导致的错误问题。我在实际工作中发现，有的同学写了比较长的Pig代码，出现了这样的错误却不会仔细去看错误提示，而是绞尽脑汁地逐句去检查语法（语法是没有错的），结果费了很大的劲才知道是类型问题，得不偿失，还不如仔细看错误提示想想为什么。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;在grunt交互模式下，如何在编辑Pig代码的时候跳到行首和行末/行尾<br />
在grunt模式下，如果你写了一句超长的Pig代码，那么，你想通过HOME/END键跳到行首和行末是做不到的。<br />
按HOME时，Pig会在你的光标处插入一个&ldquo;1~&rdquo;：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">col</span>:&nbsp;int1~);
</code></pre>
</section>
<p>按END时，Pig会在你的光标处插入一个&ldquo;4~&rdquo;：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">col</span>:&nbsp;int4~);
</code></pre>
</section>
<p>正确的做法是：按Ctrl+A 和 Ctrl+E 代替 HOME 和 END，就可以跳到行首和行末了。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;不能对同一个关系（relation）进行JOIN<br />
假设有如下文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;a
2&nbsp;&nbsp;&nbsp;&nbsp;e
3&nbsp;&nbsp;&nbsp;&nbsp;v
4&nbsp;&nbsp;&nbsp;&nbsp;n
</code></pre>
</section>
<p>我想对第一列这样JOIN：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);&nbsp;
B&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1,&nbsp;A&nbsp;BY&nbsp;col1;
</code></pre>
</section>
<p>那么当你试图 DUMP B 的时候，会报如下的错：</p>
<blockquote>
<p>
		ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1108: Duplicate schema alias: A::col1 in &quot;B&quot;</p>
</blockquote>
<p>这是因为Pig会弄不清JOIN之后的字段名&mdash;&mdash;两个字段均为A::col1，使得一个关系（relation）中出现了重复的名字，这是不允许的。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
要解决这个问题，只需将数据LOAD两次，并且给它们起不同的名字就可以了：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
grunt&gt;&nbsp;B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
grunt&gt;&nbsp;C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1,&nbsp;B&nbsp;BY&nbsp;col1;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;C;
C:&nbsp;{A::col1:&nbsp;int,A::col2:&nbsp;chararray,B::col1:&nbsp;int,B::col2:&nbsp;chararray}
grunt&gt;&nbsp;DUMP&nbsp;C;
(1,a,1,a)
(2,e,2,e)
(3,v,3,v)
(4,n,4,n)
</code></pre>
</section>
<p>从上面的 C 的schema，你可以看出来，如果对同一个关系A的第一列进行JOIN，会导致schema中出现相同的字段名，所以当然会出错。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;外部的JOIN(outer JOIN)<br />
初次使用JOIN时，一般人使用的都是所谓的&ldquo;内部的JOIN&rdquo;(inner JOIN)，也即类似于 C = JOIN A BY col1, B BY col2 这样的JOIN。Pig也支持&ldquo;外部的JOIN&rdquo;(outer JOIN)，下面就举一个例子。<br />
假设有文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;1.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;a
2&nbsp;&nbsp;&nbsp;&nbsp;e
3&nbsp;&nbsp;&nbsp;&nbsp;v
4&nbsp;&nbsp;&nbsp;&nbsp;n
</code></pre>
</section>
<p>以及：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;2.txt&nbsp;</span>
9&nbsp;&nbsp;&nbsp;&nbsp;a
2&nbsp;&nbsp;&nbsp;&nbsp;e
3&nbsp;&nbsp;&nbsp;&nbsp;v
0&nbsp;&nbsp;&nbsp;&nbsp;n
</code></pre>
</section>
<p>现在来对这两个文件的第一列作一个outer JOIN：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
grunt&gt;&nbsp;B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
grunt&gt;&nbsp;C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1&nbsp;LEFT&nbsp;OUTER,&nbsp;B&nbsp;BY&nbsp;col1;
grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;C;
C:&nbsp;{A::col1:&nbsp;int,A::col2:&nbsp;chararray,B::col1:&nbsp;int,B::col2:&nbsp;chararray}
grunt&gt;&nbsp;DUMP&nbsp;C;
(1,a,,)
(2,e,2,e)
(3,v,3,v)
(4,n,,)
</code></pre>
</section>
<p>在outer JOIN中，&ldquo;OUTER&rdquo;关键字是可以省略的。从上面的结果，我们注意到：如果换成一个inner JOIN，则两个输入文件的第一、第四行都不会出现在结果中（因为它们的第一列不相同），而在LEFT OUTER JOIN中，文件1.txt的第一、四行却被输出了，所以这就是LEFT OUTER JOIN的特点：<span style="color:#800080;">对左边的记录来说，即使<span class="KSFIND_CLASS_SELECT" id="0KSFindDIV">它与右边的记录不匹配，它也会被</span>包含在输出数据中</span>。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
同理可知RIGHT OUTER JOIN的功能&mdash;&mdash;把上面的 LEFT 换成 RIGHT，结果如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(,,0,n)
(2,e,2,e)
(3,v,3,v)
(,,9,a)
</code></pre>
</section>
<p>可见，与左边的记录不匹配的右边的记录被保存了下来，而左边的记录没有保存下来（两个逗号表明其为空），这就是RIGHT OUTER JOIN的效果，与我们想像的一样。<br />
有人会问，OUTER JOIN在实际中可以用来做什么？举一个例子：可以用来求&ldquo;<span style="color:#a52a2a;">不在某数据集中的那些数据（即：不重合的数据）</span>&rdquo;。还是以上面的两个数据文件为例，现在我要求出 1.txt 中，第一列不在 2.txt 中的第一列的那些记录，肉眼一看就知道，1和4这两个数字在 2.txt 的第一列里没有出现，而2和3出现了，因此，我们要找的记录就是：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">1&nbsp;&nbsp;&nbsp;&nbsp;a
4&nbsp;&nbsp;&nbsp;&nbsp;n
</code></pre>
</section>
<p>要实现这个效果，Pig代码及结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
grunt&gt;&nbsp;B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray);
grunt&gt;&nbsp;C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1&nbsp;LEFT&nbsp;OUTER,&nbsp;B&nbsp;BY&nbsp;col1;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
grunt&gt;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">DESCRIBE</span>&nbsp;C;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
C:&nbsp;{A::col1:&nbsp;int,A::col2:&nbsp;chararray,B::col1:&nbsp;int,B::col2:&nbsp;chararray}
grunt&gt;&nbsp;D&nbsp;=&nbsp;FILTER&nbsp;C&nbsp;BY&nbsp;(B::col1&nbsp;is&nbsp;null);&nbsp;
grunt&gt;&nbsp;E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;A::col1&nbsp;AS&nbsp;col1,&nbsp;A::col2&nbsp;AS&nbsp;col2;
grunt&gt;&nbsp;DUMP&nbsp;E;
(1,a)
(4,n)
</code></pre>
</section>
<p>可见，我们确实找出了&ldquo;不重合的记录&rdquo;。在作海量数据分析时，这种功能是极为有用的。<br />
最后来一个总结：<br />
假设有两个数据集（在1.txt和2.txt中），分别都只有1列，则如下代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray);
B&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;2.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray);&nbsp;&nbsp;
C&nbsp;=&nbsp;JOIN&nbsp;A&nbsp;BY&nbsp;col1&nbsp;LEFT&nbsp;OUTER,&nbsp;B&nbsp;BY&nbsp;col1;
D&nbsp;=&nbsp;FILTER&nbsp;C&nbsp;BY&nbsp;(B::col1&nbsp;is&nbsp;null);
E&nbsp;=&nbsp;FOREACH&nbsp;D&nbsp;GENERATE&nbsp;A::col1&nbsp;AS&nbsp;col1;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
DUMP&nbsp;E;
</code></pre>
</section>
<p>计算结果为：在A中，但不在B中的记录。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;JOIN的优化<br />
请看这个链接：《<a href="http://www.codelast.com/?p=4249" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255); ">Apache Pig中文教程（进阶）</span></a>》</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;GROUP时按所有字段分组可以用GROUP ALL吗<br />
假设你有如下数据文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]<span class="hljs-comment" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(128, 128, 128); word-wrap: inherit !important; word-break: inherit !important;">#&nbsp;cat&nbsp;3.txt&nbsp;</span>
1&nbsp;&nbsp;&nbsp;&nbsp;9
2&nbsp;&nbsp;&nbsp;&nbsp;2
3&nbsp;&nbsp;&nbsp;&nbsp;3
4&nbsp;&nbsp;&nbsp;&nbsp;0
1&nbsp;&nbsp;&nbsp;&nbsp;9
1&nbsp;&nbsp;&nbsp;&nbsp;9
4&nbsp;&nbsp;&nbsp;&nbsp;0
</code></pre>
</section>
<p>现在要找出第1、2列的组合中，每一种的个数分别为多少，例如，(1,9)组合有3个，(4,0)组合有两个，依此类推。<br />
显而易见，我们只需要用GROUP就可以轻易完成这个任务。于是写出如下代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;3.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;ALL;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;group,&nbsp;COUNT(A);
DUMP&nbsp;C;
</code></pre>
</section>
<p>可惜，结果不是我们想要的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(all,7)
</code></pre>
</section>
<p>为什么呢？我们的本意是按所有列来GROUP，于是使用了GROUP ALL，但是这实际上变成了统计行数，下面的代码就是一段标准的统计数据行数的代码：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;3.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;ALL;
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;COUNT(A);
DUMP&nbsp;C;
</code></pre>
</section>
<p>因此，上面的&nbsp;<span style="color:#0000ff;">C = FOREACH B GENERATE group, COUNT(A)</span> 也无非就是多打印了一个group的名字（<span style="color:#ff0000;">all</span>）而已&mdash;&mdash;group的名字被设置为&ldquo;<span style="color:#ff0000;">all</span>&rdquo;，这是Pig帮你做的。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
正确的做法很简单，只需要按所有字段GROUP，就可以了：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;3.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>);
B&nbsp;=&nbsp;GROUP&nbsp;A&nbsp;BY&nbsp;(col1,&nbsp;col2);
C&nbsp;=&nbsp;FOREACH&nbsp;B&nbsp;GENERATE&nbsp;group,&nbsp;COUNT(A);
DUMP&nbsp;C;
</code></pre>
</section>
<p>结果如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">((1,9),3)
((2,2),1)
((3,3),1)
((4,0),2)
</code></pre>
</section>
<p>这与我们前面分析的正确结果是一样的。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;在Pig中使用中文字符串<br />
有读者来信问我，<span style="color:#800080;">如何在Pig中使用中文作为FILTER的条件</span>？我做了如下测试，结论是可以使用中文。<br />
数据文件 data.txt 内容为（每一列之间以TAB为分隔符）：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">1&nbsp;&nbsp;&nbsp;&nbsp;北京市&nbsp;a
2&nbsp;&nbsp;&nbsp;&nbsp;上海市&nbsp;b
3&nbsp;&nbsp;&nbsp;&nbsp;北京市&nbsp;c
4&nbsp;&nbsp;&nbsp;&nbsp;北京市&nbsp;f
5&nbsp;&nbsp;&nbsp;&nbsp;天津市&nbsp;e
</code></pre>
</section>
<p>Pig脚本文件 test.pig 内容为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;data.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray);
B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;(col2&nbsp;==&nbsp;&#39;北京市&#39;);
DUMP&nbsp;B;
</code></pre>
</section>
<p>首先，我这两个文件的编码都是<span style="color:#ff0000;">UTF-8(无BOM)</span>，在Linux命令行下，我直接以本地模式执行Pig脚本 test.pig：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">pig&nbsp;-x&nbsp;local&nbsp;test.pig
</code></pre>
</section>
<p>得到的输出结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1,北京市,a)
(3,北京市,c)
(4,北京市,f)
</code></pre>
</section>
<p>可见结果是正确的。<br />
<span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="color:#0000ff;">但是</span>，如果我在grunt交互模式下，把 test.pig 的内容粘贴进去执行，是得不到任何输出结果的：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">grunt&gt;&nbsp;A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;data.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;<span class="hljs-built_in" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">int</span>,&nbsp;col2:&nbsp;chararray,&nbsp;col3:&nbsp;chararray);
grunt&gt;&nbsp;B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;(col2&nbsp;==&nbsp;&#39;北京市&#39;);
grunt&gt;&nbsp;DUMP&nbsp;B;
</code></pre>
</section>
<p>具体原因我不清楚，但是至少有一点是肯定的：可以使用中文作为FILTER的条件，只要不在交互模式下执行你的Pig脚本即可。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何统计 tuple 中的 field 数，bag 中的 tuple 数，map 中的 key/value 组数<br />
一句话：用Pig内建的 <span style="color:#0000ff;">SIZE</span> 函数：</p>
<blockquote>
<p>
		Computes the number of elements based on any Pig data type.</p>
</blockquote>
<p>具体可看<a href="http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#SIZE" rel="noopener noreferrer" target="_blank">这个</a>链接。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;一个字符串为null，与它为空不一定等价<br />
在某些情况下，要获取&ldquo;不为空&rdquo;的字符串，仅仅用 is not null 来判断是不够的，还应该加上 SIZE(field_name) &gt; 0 的条件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">B&nbsp;=&nbsp;FILTER&nbsp;A&nbsp;BY&nbsp;(field_name&nbsp;is&nbsp;not&nbsp;null&nbsp;AND&nbsp;(SIZE(field_name)&nbsp;&gt;&nbsp;0L));
</code></pre>
</section>
<p>注意，这只是在某些情况下需要这样做，在一般情况下，仅用 is not null 来过滤就可以了。我并没有总结出特殊情况是哪些情况，我只能说我我不是第一次遇到此情况了，所以才有了这一个结论。<br />
注意上面使用的是&ldquo;0<span style="color:#ff0000;">L</span>&rdquo;，因为SIZE()返回的是long类型，如果不加L，在Pig0.10下会出现一个警告，例如：</p>
<blockquote>
<p>
		[main] WARN &nbsp;org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_LONG 1 time(s)</p>
</blockquote>
<p>
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;Pig中的各operator（操作符），哪些会触发reduce过程<br />
<span style="color:#ff0000;">①</span>GROUP：由于GROUP操作会将所有具有相同key的记录收集到一起，所以数据如果正在map中处理的话，就会触发shuffle&rarr;reduce的过程。<br />
<span style="color: rgb(255, 0, 0); ">②</span>ORDER：由于需要将所有相等的记录收集到一起（才能排序），所以ORDER会触发reduce过程。同时，除了你写的那个Pig job之外，Pig还会添加一个额外的M-R job到你的数据流程中，因为Pig需要对你的数据集做采样，以确定数据的分布情况，从而解决数据分布严重不均的情况下job效率过于低下的问题。<br />
<span style="color: rgb(255, 0, 0); ">③</span>DISTINCT：由于需要将记录收集到一起，才能确定它们是不是重复的，因此DISTINCT会触发reduce过程。当然，DISTINCT也会利用combiner在map阶段就把重复的记录移除。<br />
<span style="color: rgb(255, 0, 0); ">④</span>JOIN：JOIN用于求重合，由于求重合的时候，需要将具有相同key的记录收集到一起，因此，JOIN会触发reduce过程。<br />
<span style="color: rgb(255, 0, 0); ">⑤</span>LIMIT：由于需要将记录收集到一起，才能统计出它返回的条数，因此，LIMIT会触发reduce过程。<br />
<span style="color: rgb(255, 0, 0); ">⑥</span>COGROUP：与GROUP类似（参看本文前面的部分），因此它会触发reduce过程。<br />
<font color="#ff0000">⑦</font>CROSS：计算两个或多个关系的叉积。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;如何统计一个字符串中包含的指定字符数<br />
这可以不算是个Pig的问题了，你可以把它认为是一个shell的问题。从本文前面部分我们已经知道，Pig中可以用 STREAM ... THROUGH 来调用shell进行辅助数据处理，所以在这我们也能这样干。<br />
假设有文本文件：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">[root@localhost&nbsp;~]$&nbsp;cat&nbsp;1.txt&nbsp;
123&nbsp;&nbsp;&nbsp;&nbsp;abcdef:243789174
456&nbsp;&nbsp;&nbsp;&nbsp;DFJKSDFJ:3646:555558888
789&nbsp;&nbsp;&nbsp;&nbsp;yKDSF:00000%0999:2343324:11111:33333
</code></pre>
</section>
<p>现在要统计：每一行中，第二列里所包含的冒号（&ldquo;:&rdquo;）分别为多少？代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;1.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray);
B&nbsp;=&nbsp;STREAM&nbsp;A&nbsp;THROUGH&nbsp;`awk&nbsp;-F&quot;:&quot;&nbsp;&#39;{print&nbsp;NF-1}&#39;`&nbsp;AS&nbsp;(colon_count:&nbsp;int);
DUMP&nbsp;B;
</code></pre>
</section>
<p>结果为：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">(1)
(2)
(4)
</code></pre>
</section>
<p><span style="color: rgb(255, 255, 255); ">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255); ">http://www.codelast.com/</span></a><br />
<span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;UDF是区分大小写的<br />
因为UDF是由Java类来实现的，所以区分大小写，就这么简单。</p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;<span style="font-family: 文泉驿等宽微米黑;">设置Pig job的job name<br />
在Pig脚本开头加上一句：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">set</span>&nbsp;job.name&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;My-Job-Name&#39;</span>;
</code></pre>
</section>
<p><span style="font-family: 文泉驿等宽微米黑;">那么，执行该Pig脚本之后，在Hadoop的Job Tracker中看到的&ldquo;Name&rdquo;就是&ldquo;My-Job-Name&rdquo;了。<br />
如果不设置，显示的name是类似于&ldquo;</span>Job6245768625829738970.jar<span style="font-family: 文泉驿等宽微米黑;">&rdquo;这样的东西，job多的时候完全没有标识度，建议一定要设置一个特殊的job name。</span></p>
<p><span style="background-color: rgb(0, 255, 0);">➤</span>&nbsp;<span style="font-family: 文泉驿等宽微米黑;">把纯文本转化为JSON<br />
假设输入文件 a.txt 内容为：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">1&nbsp;&nbsp;&nbsp;&nbsp;2
9&nbsp;&nbsp;&nbsp;&nbsp;8
</code></pre>
</section>
<p><span style="font-family: 文泉驿等宽微米黑;">则如下Pig代码将把它转化为JSON格式：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">A&nbsp;=&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">LOAD</span>&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#39;a.txt&#39;</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">AS</span>&nbsp;(col1:&nbsp;chararray,&nbsp;col2:&nbsp;chararray);
B&nbsp;=&nbsp;STORE&nbsp;A&nbsp;INTO&nbsp;&#39;result&#39;&nbsp;USING&nbsp;JsonStorage();
</code></pre>
</section>
<p><span style="font-family: 文泉驿等宽微米黑;">查看输出文件的内容是：</span></p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="sql language-sql hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">{&quot;col1&quot;:&quot;1&quot;,&quot;col2&quot;:&quot;2&quot;}
{&quot;col1&quot;:&quot;9&quot;,&quot;col2&quot;:&quot;8&quot;}
</code></pre>
</section>
<p><span style="font-family: 文泉驿等宽微米黑;">可见，你LOAD输入数据时定义的字段名，就是输出文件中的JSON字段名。</span></p>
<p>
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：<br />
<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="color: rgb(77, 77, 77); font-size: 13px; width: 200px; height: 200px;" /><br />
以及我的微信视频号：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="" src="https://www.codelast.com/wechat_shipinhao_qr_code.jpg" style="text-align: center; width: 200px; height: 199px;" /></p>
<div id="KSFIND_MASK" style="background-color: rgb(0, 0, 0); opacity: 0.22; position: absolute !important; left: 0px !important; top: 0px !important; border: 0px none !important; padding: 0px !important; z-index: 1000000 !important; height: 0px; width: 0px; display: none; cursor: auto; ">
	&nbsp;</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9bpig%e4%b8%ad%e7%9a%84%e4%b8%80%e4%ba%9b%e5%9f%ba%e7%a1%80%e6%a6%82%e5%bf%b5%e6%80%bb%e7%bb%93/feed/</wfw:commentRss>
			<slash:comments>34</slash:comments>
		
		
			</item>
	</channel>
</rss>
