<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>合并小文件 &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/%E5%90%88%E5%B9%B6%E5%B0%8F%E6%96%87%E4%BB%B6/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Mon, 27 Apr 2020 17:57:02 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] Java Hadoop job合并输入的小文件(纯文本)</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-java-hadoop-job%e5%90%88%e5%b9%b6%e8%be%93%e5%85%a5%e7%9a%84%e5%b0%8f%e7%ba%af%e6%96%87%e6%9c%ac%e6%96%87%e4%bb%b6/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-java-hadoop-job%e5%90%88%e5%b9%b6%e8%be%93%e5%85%a5%e7%9a%84%e5%b0%8f%e7%ba%af%e6%96%87%e6%9c%ac%e6%96%87%e4%bb%b6/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 22 Jun 2017 02:44:30 +0000</pubDate>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[CombineTextInputFormat]]></category>
		<category><![CDATA[Hadoop job]]></category>
		<category><![CDATA[合并小文件]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=9376</guid>

					<description><![CDATA[<p>
假设你的JAVA M-R job的输入文件是大量纯文本文件，而且每个文件都比较小（例如几百K），那么job运行起来之后会占用大量mapper数，导致Hadoop集群资源被过度消耗。这种情况可以通过合并输入文件来避免。<br />
<span id="more-9376"></span><br />
Hadoop本身已经提供了这种支持。方法很简单，大致代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &#34;Helvetica Neue&#34;, Helvetica, &#34;Hiragino Sans GB&#34;, &#34;Microsoft YaHei&#34;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">Job&#160;job&#160;=&#160;Job.getInstance(getConf());
job.setJarByClass(getClass());
job.setInputFormatClass(CombineTextInputFormat.class);
FileInputFormat.addInputPaths(job,&#160;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&#34;your_input_path&#34;</span>);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
</code></pre>
</section>
<p>关键就是要使用&#160;CombineTextInputFormat。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
然后在调用该JAVA程序的shell脚本中，向程序传入参数：</p>
<blockquote>
<p>
		-D mapreduce.input.fileinputformat.split.maxsize=268435456</p>
</blockquote>
<p>这设置了每一个split的大小为&#160;268435456 字节（即 256MB，1024*1024*256=268435456）。如果不设置，一般情况下是不行的，因为程序会把所有输入合并到一个split中，如果输入数据量超大，job就跑不动了。</p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;版权声明&#160;<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&#160;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" />&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-java-hadoop-job%e5%90%88%e5%b9%b6%e8%be%93%e5%85%a5%e7%9a%84%e5%b0%8f%e7%ba%af%e6%96%87%e6%9c%ac%e6%96%87%e4%bb%b6/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
假设你的JAVA M-R job的输入文件是大量纯文本文件，而且每个文件都比较小（例如几百K），那么job运行起来之后会占用大量mapper数，导致Hadoop集群资源被过度消耗。这种情况可以通过合并输入文件来避免。<br />
<span id="more-9376"></span><br />
Hadoop本身已经提供了这种支持。方法很简单，大致代码如下：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="java language-java hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;">Job&nbsp;job&nbsp;=&nbsp;Job.getInstance(getConf());
job.setJarByClass(getClass());
job.setInputFormatClass(CombineTextInputFormat.class);
FileInputFormat.addInputPaths(job,&nbsp;<span class="hljs-string" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(238, 220, 112); word-wrap: inherit !important; word-break: inherit !important;">&quot;your_input_path&quot;</span>);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
</code></pre>
</section>
<p>关键就是要使用&nbsp;CombineTextInputFormat。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
然后在调用该JAVA程序的shell脚本中，向程序传入参数：</p>
<blockquote>
<p>
		-D mapreduce.input.fileinputformat.split.maxsize=268435456</p>
</blockquote>
<p>这设置了每一个split的大小为&nbsp;268435456 字节（即 256MB，1024*1024*256=268435456）。如果不设置，一般情况下是不行的，因为程序会把所有输入合并到一个split中，如果输入数据量超大，job就跑不动了。</p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-java-hadoop-job%e5%90%88%e5%b9%b6%e8%be%93%e5%85%a5%e7%9a%84%e5%b0%8f%e7%ba%af%e6%96%87%e6%9c%ac%e6%96%87%e4%bb%b6/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
