<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Reinforcement Learning &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/reinforcement-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Mon, 27 Apr 2020 17:23:35 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] 总有一天，失业不再遥远</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%bb%e6%9c%89%e4%b8%80%e5%a4%a9%ef%bc%8c%e5%a4%b1%e4%b8%9a%e4%b8%8d%e5%86%8d%e9%81%a5%e8%bf%9c/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%bb%e6%9c%89%e4%b8%80%e5%a4%a9%ef%bc%8c%e5%a4%b1%e4%b8%9a%e4%b8%8d%e5%86%8d%e9%81%a5%e8%bf%9c/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sat, 25 Apr 2020 18:14:12 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[RL]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11992</guid>

					<description><![CDATA[<p></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="text-align: center;margin: 10px 0%;">
<section style="text-align:left;width:35px;height:35px;margin-left:20px;background-color: rgb(209,95,238);"></section>
<p></p>
<section style="margin-top: -1.5em; border-style: solid; border-width: 1px; border-color: rgb(178, 58, 238); padding: 8px; box-sizing: border-box;">
<section style="border-style: solid; border-width: 3px; border-color: rgb(178, 58, 238); padding: 15px; box-sizing: border-box;">
<p class="brush active" style="letter-spacing: 2px; font-size: 26px; color: rgb(178, 58, 238); min-width: 1px; text-align: left;">
	<span style="font-size:16px;">尽管人类离[通用人工智能]的路还很远，但越来越多新技术的出现，让这条路得以不断加速。</span></p>
</section>
</section>
<section style="width:35px;height:35px;margin-left:auto;margin-top: -1.2em;margin-right:20px;background-color: rgb(209,95,238);"></section>
</section>
</section>
<p></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="margin-right: auto; margin-left: auto; display: flex; justify-content: center; align-items: center;">
<section style="display: flex;flex-direction: row;justify-content: center;align-items: center;">
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
<section style="margin-right: 10px; margin-left: 10px;text-align: center;">
<p class="title active" style=" color: rgb(51, 51, 51); letter-spacing: 1.5px; line-height: 1.75;min-width:1px;">
	​What？强化学习设计芯片？</p>
</section>
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
</section>
</section>
</section>
<p>就这几天的事：Google已经开始用强化学习技术来设计芯片了！<br />
如果说用强化学习来玩游戏、下围棋，甚至用来帮助提升互联网广告的点击率、收入，都不是什么新鲜事的话，那么用强化学习来设计芯片，就也太新鲜了吧？但Google就做到了<span style="color:#0000ff;"><sup>[1]</sup></span>：</p>
<blockquote>
<p>
		我们提出了一种基于学习的芯片布局方法，这是芯片设计过程中最复杂、最耗时的阶段之一。与之前的方法不同，我们的方法具有从过去的经验中学习并随着时间的推移而改进的能力。特别是随着我们对更多的芯片块进行训练，我们的方法在快速生成以前未见过的芯片块的优化布局方面变得更好。为了实现这些结果，我们将芯片布局作为一个强化学习（RL）问题，并训练一个Agent将芯片网表的节点放置到芯片画布上。为了使我们的RL策略能够泛化到未见过的芯片块，我们将表征学习置于预测布局质量的有监督任务中。通过设计一个能够准确预测各种网表及其布局质量的神经架构，我们能够生成丰富的输入网表的特征嵌入。然后，我们使用这个架构作为我们的策略和价值网络的编码器来实现转移学习。我们的目标是将PPA（功率、性能和面积）降到最低，我们表明，在6个小时内，我们的方法可以在现代加速器网表上生成超越人类或可与之相媲美的芯片布局，而现有的基线需要人类专家在循环中进行，并需要几周的时间。</p>
</blockquote>
<p>硬件工程师为之虎躯一颤。<br />
<span id="more-11992"></span><br />
这是我今年看到的第二个跟我多少有点关系，并且又让我马上喊出一句&#8220;卧槽&#8221;的技术应用了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
在机器学习领域，强化学习可能是目前人类发明的、最接近人类成长过程的机器学习范式了。从婴儿的咿呀学步，到掌握海量知识，人都是在不断接受外界反馈中对自我行为做出修正，而强化学习正是模仿了这一过程。<br />
目前科学家们正在不断拓展强化学习的应用边界，从一开始的相对简单领域，到越来越复杂的工作，都尝试用强化学习来完成。<br />
事实上，在现实世界，真正比较大规模的、普通人摸得着看得见的强化学习应用，还是当属游戏领域的AI玩家，但考虑到游戏受众占总人口数的比例很小，所以客观地说，强化学习并没有像人脸识别、语音识别等机器学习技术一样渗透到民生的方方面面。不过，由于强化学习的可预见潜力很大，我们有理由相信，它会在很多领域代替人类的工作，而这些工作，不是低水平的重复工作，而是需要较高知识储备才能胜任的。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="margin-right: auto; margin-left: auto; display: flex; justify-content: center; align-items: center;">
<section style="display: flex;flex-direction: row;justify-content: center;align-items: center;">
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
<section style="margin-right: 10px; margin-left: 10px;text-align: center;">
<p class="title active" style=" color: rgb(51, 51, 51); letter-spacing: 1.5px; line-height: 1.75;min-width:1px;">
	米娜？你真的可以无障碍聊天？</p>
</section>
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
</section>
</section>
</section>
<p>还是Google，在今年1月的时候发布了一个聊天机器人：Meena<span style="color:#0000ff;"><sup>[2]</sup></span>（&#8220;米娜&#8221;？）。当然，说是发布，其实并没有公开地提供这个服务，也没有App提供下载，Google只是发了篇论文说他们达到了什么样的技术成果。<br />
这个Meena有多牛呢？<br />
举个大家生活中随处可见的例子：无论你是在京东淘宝上购物的时候在线咨询，还是在打各种客服电话的时候接线的是个&#8220;机器人&#8221;，可能都会很容易遇到这样一种情况：只要问题问得不是那么直接，那些&#8220;聊天机器人&#8221;就不知道怎么回答了。<br />
再比如，我家里有一个&#8220;小爱同学&#8221;（小米的智能音箱），我问她&#8220;明天的天气怎么样&#8221;，她能完美回答我；但如果我用和人类随意聊天的方式来和她对话，她马上就会进入懵逼状态：&#8220;哎呀，你说的这个问题小爱不懂&#8221;。<br />
理想和现实的差距，就是人类和市面上所有聊天机器人的差距。<br />
而Google的Meena是一个&#8220;<span style="color:#0000ff;">开放领域聊天机器人</span>&#8221;。开放领域聊天机器人不会仅限于在某个特定领域，而是能够和用户聊近乎所有的话题&#8212;&#8212;这不就是人类的正常表现嘛。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<blockquote>
<p>
		Meena是一个有着26亿参数的端到端神经对话模型，也就是GPT-2模型最大版本（15 亿参数）的1.7倍。通过实验可以看到，Meena 比现有的 SOTA 聊天机器人能够更好地完成对话，对话内容显得更为具体、清楚。</p>
</blockquote>
<p>Google也给出了一些实例，用来说明Meena与人类的对话有多自然。<br />
如果Meena真能达到真人水平，那她一定是我做梦都想拥有的一个chatbot。<br />
我现在每周都在<a href="http://cambly.com/invite/DZZZ" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255);">Cambly</span></a>上和外国人聊天练口语，我当然想把这钱省下来，我曾经也找过英语的chatbot，但没有什么好的结果，在语言学习方面，和人类交流目前还是具有不可替代性。我可以和外国人聊新冠疫情的近况，聊时事政治的发展，但是我和一个chatbot讲这些，它可能当我是傻子（其实它才是傻子）。<br />
所以，如果有一个像Meena那样的chatbot可以和我在开放领域以人类水平用英语聊天，那我真要笑开了花！<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%bb%e6%9c%89%e4%b8%80%e5%a4%a9%ef%bc%8c%e5%a4%b1%e4%b8%9a%e4%b8%8d%e5%86%8d%e9%81%a5%e8%bf%9c/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="text-align: center;margin: 10px 0%;">
<section style="text-align:left;width:35px;height:35px;margin-left:20px;background-color: rgb(209,95,238);"></section>
<p></p>
<section style="margin-top: -1.5em; border-style: solid; border-width: 1px; border-color: rgb(178, 58, 238); padding: 8px; box-sizing: border-box;">
<section style="border-style: solid; border-width: 3px; border-color: rgb(178, 58, 238); padding: 15px; box-sizing: border-box;">
<p class="brush active" style="letter-spacing: 2px; font-size: 26px; color: rgb(178, 58, 238); min-width: 1px; text-align: left;">
	<span style="font-size:16px;">尽管人类离[通用人工智能]的路还很远，但越来越多新技术的出现，让这条路得以不断加速。</span></p>
</section>
</section>
<section style="width:35px;height:35px;margin-left:auto;margin-top: -1.2em;margin-right:20px;background-color: rgb(209,95,238);"></section>
</section>
</section>
<p></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="margin-right: auto; margin-left: auto; display: flex; justify-content: center; align-items: center;">
<section style="display: flex;flex-direction: row;justify-content: center;align-items: center;">
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
<section style="margin-right: 10px; margin-left: 10px;text-align: center;">
<p class="title active" style=" color: rgb(51, 51, 51); letter-spacing: 1.5px; line-height: 1.75;min-width:1px;">
	​What？强化学习设计芯片？</p>
</section>
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
</section>
</section>
</section>
<p>就这几天的事：Google已经开始用强化学习技术来设计芯片了！<br />
如果说用强化学习来玩游戏、下围棋，甚至用来帮助提升互联网广告的点击率、收入，都不是什么新鲜事的话，那么用强化学习来设计芯片，就也太新鲜了吧？但Google就做到了<span style="color:#0000ff;"><sup>[1]</sup></span>：</p>
<blockquote>
<p>
		我们提出了一种基于学习的芯片布局方法，这是芯片设计过程中最复杂、最耗时的阶段之一。与之前的方法不同，我们的方法具有从过去的经验中学习并随着时间的推移而改进的能力。特别是随着我们对更多的芯片块进行训练，我们的方法在快速生成以前未见过的芯片块的优化布局方面变得更好。为了实现这些结果，我们将芯片布局作为一个强化学习（RL）问题，并训练一个Agent将芯片网表的节点放置到芯片画布上。为了使我们的RL策略能够泛化到未见过的芯片块，我们将表征学习置于预测布局质量的有监督任务中。通过设计一个能够准确预测各种网表及其布局质量的神经架构，我们能够生成丰富的输入网表的特征嵌入。然后，我们使用这个架构作为我们的策略和价值网络的编码器来实现转移学习。我们的目标是将PPA（功率、性能和面积）降到最低，我们表明，在6个小时内，我们的方法可以在现代加速器网表上生成超越人类或可与之相媲美的芯片布局，而现有的基线需要人类专家在循环中进行，并需要几周的时间。</p>
</blockquote>
<p>硬件工程师为之虎躯一颤。<br />
<span id="more-11992"></span><br />
这是我今年看到的第二个跟我多少有点关系，并且又让我马上喊出一句&ldquo;卧槽&rdquo;的技术应用了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
在机器学习领域，强化学习可能是目前人类发明的、最接近人类成长过程的机器学习范式了。从婴儿的咿呀学步，到掌握海量知识，人都是在不断接受外界反馈中对自我行为做出修正，而强化学习正是模仿了这一过程。<br />
目前科学家们正在不断拓展强化学习的应用边界，从一开始的相对简单领域，到越来越复杂的工作，都尝试用强化学习来完成。<br />
事实上，在现实世界，真正比较大规模的、普通人摸得着看得见的强化学习应用，还是当属游戏领域的AI玩家，但考虑到游戏受众占总人口数的比例很小，所以客观地说，强化学习并没有像人脸识别、语音识别等机器学习技术一样渗透到民生的方方面面。不过，由于强化学习的可预见潜力很大，我们有理由相信，它会在很多领域代替人类的工作，而这些工作，不是低水平的重复工作，而是需要较高知识储备才能胜任的。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="margin-right: auto; margin-left: auto; display: flex; justify-content: center; align-items: center;">
<section style="display: flex;flex-direction: row;justify-content: center;align-items: center;">
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
<section style="margin-right: 10px; margin-left: 10px;text-align: center;">
<p class="title active" style=" color: rgb(51, 51, 51); letter-spacing: 1.5px; line-height: 1.75;min-width:1px;">
	米娜？你真的可以无障碍聊天？</p>
</section>
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
</section>
</section>
</section>
<p>还是Google，在今年1月的时候发布了一个聊天机器人：Meena<span style="color:#0000ff;"><sup>[2]</sup></span>（&ldquo;米娜&rdquo;？）。当然，说是发布，其实并没有公开地提供这个服务，也没有App提供下载，Google只是发了篇论文说他们达到了什么样的技术成果。<br />
这个Meena有多牛呢？<br />
举个大家生活中随处可见的例子：无论你是在京东淘宝上购物的时候在线咨询，还是在打各种客服电话的时候接线的是个&ldquo;机器人&rdquo;，可能都会很容易遇到这样一种情况：只要问题问得不是那么直接，那些&ldquo;聊天机器人&rdquo;就不知道怎么回答了。<br />
再比如，我家里有一个&ldquo;小爱同学&rdquo;（小米的智能音箱），我问她&ldquo;明天的天气怎么样&rdquo;，她能完美回答我；但如果我用和人类随意聊天的方式来和她对话，她马上就会进入懵逼状态：&ldquo;哎呀，你说的这个问题小爱不懂&rdquo;。<br />
理想和现实的差距，就是人类和市面上所有聊天机器人的差距。<br />
而Google的Meena是一个&ldquo;<span style="color:#0000ff;">开放领域聊天机器人</span>&rdquo;。开放领域聊天机器人不会仅限于在某个特定领域，而是能够和用户聊近乎所有的话题&mdash;&mdash;这不就是人类的正常表现嘛。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<blockquote>
<p>
		Meena是一个有着26亿参数的端到端神经对话模型，也就是GPT-2模型最大版本（15 亿参数）的1.7倍。通过实验可以看到，Meena 比现有的 SOTA 聊天机器人能够更好地完成对话，对话内容显得更为具体、清楚。</p>
</blockquote>
<p>Google也给出了一些实例，用来说明Meena与人类的对话有多自然。<br />
如果Meena真能达到真人水平，那她一定是我做梦都想拥有的一个chatbot。<br />
我现在每周都在<a href="http://cambly.com/invite/DZZZ" rel="noopener noreferrer" target="_blank"><span style="color: rgb(0, 0, 255);">Cambly</span></a>上和外国人聊天练口语，我当然想把这钱省下来，我曾经也找过英语的chatbot，但没有什么好的结果，在语言学习方面，和人类交流目前还是具有不可替代性。我可以和外国人聊新冠疫情的近况，聊时事政治的发展，但是我和一个chatbot讲这些，它可能当我是傻子（其实它才是傻子）。<br />
所以，如果有一个像Meena那样的chatbot可以和我在开放领域以人类水平用英语聊天，那我真要笑开了花！<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="margin-right: auto; margin-left: auto; display: flex; justify-content: center; align-items: center;">
<section style="display: flex;flex-direction: row;justify-content: center;align-items: center;">
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
<section style="margin-right: 10px; margin-left: 10px;text-align: center;">
<p class="title active" style=" color: rgb(51, 51, 51); letter-spacing: 1.5px; line-height: 1.75;min-width:1px;">
	有生之年的期盼</p>
</section>
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
</section>
</section>
</section>
<p>随着技术的发展，在我有生之年，我一定会看到很多本来&ldquo;不可替代&rdquo;的人类，会因为技术的发展而失业，这当中，或许就包含了我这样的工程师。而技术的目标之一就是节省更大的成本，我也相信在未来几十年，AI在语言学习上一定可以代替人类，和学生进行几乎无障碍的交流对话。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<section class="RankEditor" data-opacity="1" data-rotate="0" data-width="100%" style="width: 100%; margin: 0px auto; opacity: 1;transform: rotateZ(0deg);-webkit-transform: rotateZ(0deg);-moz-transform: rotateZ(0deg);-ms-transform: rotateZ(0deg);-o-transform: rotateZ(0deg);">
<section style="margin-right: auto; margin-left: auto; display: flex; justify-content: center; align-items: center;">
<section style="display: flex;flex-direction: row;justify-content: center;align-items: center;">
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
<section style="margin-right: 10px; margin-left: 10px;text-align: center;">
<p class="title active" style=" color: rgb(51, 51, 51); letter-spacing: 1.5px; line-height: 1.75;min-width:1px;">
	链接</p>
</section>
<section style="display: flex;flex-direction: column;justify-content: center;align-items: center;">
<section style="width: 43px; height: 3px; background: rgb(255, 0, 0); border-radius: 2px; flex-shrink: 0; box-sizing: border-box;"></section>
<section style="width: 43px; height: 3px; background: rgb(255, 211, 155); border-radius: 2px; flex-shrink: 0; margin-top: 2px; box-sizing: border-box;"></section>
</section>
</section>
</section>
</section>
<p> [1]&nbsp;<a href="https://ai.googleblog.com/2020/04/chip-design-with-deep-reinforcement.html" rel="noopener noreferrer" target="_blank">https://ai.googleblog.com/2020/04/chip-design-with-deep-reinforcement.html</a><br />
[2]&nbsp;<a href="https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html" rel="noopener noreferrer" target="_blank">https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html</a></p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%bb%e6%9c%89%e4%b8%80%e5%a4%a9%ef%bc%8c%e5%a4%b1%e4%b8%9a%e4%b8%8d%e5%86%8d%e9%81%a5%e8%bf%9c/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(10) 基于CPU的并行采样器CpuSampler，worker的实现</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a10-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a10-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Tue, 21 Jan 2020 05:15:53 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[并行]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11674</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;本文是<a href="https://www.codelast.com/?p=11613" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">上一篇</span></a>文章的续文，继续分析CpuSampler的源码。<br />
本文将分析 CPU并行模式下的 ParallelSamplerBase 类的worker实现。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;worker的代码在哪<br />
<span style="color:#0000ff;">rlpyt/samplers/parallel/worker.py</span><br />
<span id="more-11674"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;worker是做什么用的<br />
用于采样agent与environment交互得到的数据。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;代码分析<br />
我直接在代码里加了大量注释：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">initialize_worker</span>(rank<span style="color:#cc7832;">, </span>seed=<span style="color:#cc7832;">None, </span>cpu=<span style="color:#cc7832;">None, </span>torch_threads=<span style="color:#cc7832;">None</span>):
    <span style="color:#629755;font-style:italic;">&#34;&#34;&#34;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">初始化采样用的</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">。
</span>
<span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> rank: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">采样进程的标识序号。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> seed: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">种子，一个整数值。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> cpu: CPU</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">序号，例如</span><span style="color:#629755;font-style:italic;"> 0, 1, 2 </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">等等。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> torch_threads: CPU</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">并发执行的线程数。</span>
<span style="color:#629755;font-style:italic;">    &#34;&#34;&#34;
</span><span style="color:#629755;font-style:italic;">    </span>log_str = <span style="color:#6a8759;">f&#34;Sampler rank </span><span style="color:#cc7832;">{</span>rank<span style="color:#cc7832;">}</span><span style="color:#6a8759;"> initialized&#34;
</span><span style="color:#6a8759;">    </span>cpu = [cpu] <span style="color:#cc7832;">if </span><span style="color:#8888c6;">isinstance</span>(cpu<span style="color:#cc7832;">, </span><span style="color:#8888c6;">int</span>) <span style="color:#cc7832;">else </span>cpu
    p = psutil.Process()</pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a10-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;本文是<a href="https://www.codelast.com/?p=11613" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">上一篇</span></a>文章的续文，继续分析CpuSampler的源码。<br />
本文将分析 CPU并行模式下的 ParallelSamplerBase 类的worker实现。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;worker的代码在哪<br />
<span style="color:#0000ff;">rlpyt/samplers/parallel/worker.py</span><br />
<span id="more-11674"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;worker是做什么用的<br />
用于采样agent与environment交互得到的数据。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;代码分析<br />
我直接在代码里加了大量注释：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">initialize_worker</span>(rank<span style="color:#cc7832;">, </span>seed=<span style="color:#cc7832;">None, </span>cpu=<span style="color:#cc7832;">None, </span>torch_threads=<span style="color:#cc7832;">None</span>):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">初始化采样用的</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">。
</span>
<span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> rank: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">采样进程的标识序号。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> seed: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">种子，一个整数值。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> cpu: CPU</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">序号，例如</span><span style="color:#629755;font-style:italic;"> 0, 1, 2 </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">等等。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> torch_threads: CPU</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">并发执行的线程数。</span>
<span style="color:#629755;font-style:italic;">    &quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span>log_str = <span style="color:#6a8759;">f&quot;Sampler rank </span><span style="color:#cc7832;">{</span>rank<span style="color:#cc7832;">}</span><span style="color:#6a8759;"> initialized&quot;
</span><span style="color:#6a8759;">    </span>cpu = [cpu] <span style="color:#cc7832;">if </span><span style="color:#8888c6;">isinstance</span>(cpu<span style="color:#cc7832;">, </span><span style="color:#8888c6;">int</span>) <span style="color:#cc7832;">else </span>cpu
    p = psutil.Process()
    <span style="color:#cc7832;">try</span>:
        <span style="color:#cc7832;">if </span>cpu <span style="color:#cc7832;">is not None</span>:
            p.cpu_affinity(cpu)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">设置</span><span style="color:#808080;">CPU</span><span style="color:#808080;font-family:'AR PL UKai CN';">亲和性</span><span style="color:#808080;">(MacOS</span><span style="color:#808080;font-family:'AR PL UKai CN';">不支持</span><span style="color:#808080;">)
</span><span style="color:#808080;">        </span>cpu_affin = p.cpu_affinity()
    <span style="color:#cc7832;">except </span><span style="color:#8888c6;">AttributeError</span>:
        cpu_affin = <span style="color:#6a8759;">&quot;UNAVAILABLE MacOS&quot;
</span><span style="color:#6a8759;">    </span>log_str += <span style="color:#6a8759;">f&quot;, CPU affinity </span><span style="color:#cc7832;">{</span>cpu_affin<span style="color:#cc7832;">}</span><span style="color:#6a8759;">&quot;
</span><span style="color:#6a8759;">    </span>torch_threads = (<span style="color:#6897bb;">1 </span><span style="color:#cc7832;">if </span>torch_threads <span style="color:#cc7832;">is None and </span>cpu <span style="color:#cc7832;">is not None else
</span><span style="color:#cc7832;">        </span>torch_threads)  <span style="color:#808080;"># Default to 1 to avoid possible MKL hang.
</span><span style="color:#808080;">    </span><span style="color:#cc7832;">if </span>torch_threads <span style="color:#cc7832;">is not None</span>:
        torch.set_num_threads(torch_threads)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">设置</span><span style="color:#808080;">CPU</span><span style="color:#808080;font-family:'AR PL UKai CN';">并发执行的线程数
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span>log_str += <span style="color:#6a8759;">f&quot;, Torch threads </span><span style="color:#cc7832;">{</span>torch.get_num_threads()<span style="color:#cc7832;">}</span><span style="color:#6a8759;">&quot;
</span><span style="color:#6a8759;">    </span><span style="color:#cc7832;">if </span>seed <span style="color:#cc7832;">is not None</span>:
        set_seed(seed)
        time.sleep(<span style="color:#6897bb;">0.3</span>)  <span style="color:#808080;"># (so the printing from set_seed is not intermixed)
</span><span style="color:#808080;">        </span>log_str += <span style="color:#6a8759;">f&quot;, Seed </span><span style="color:#cc7832;">{</span>seed<span style="color:#cc7832;">}</span><span style="color:#6a8759;">&quot;
</span><span style="color:#6a8759;">    </span>logger.log(log_str)


<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">sampling_process</span>(common_kwargs<span style="color:#cc7832;">, </span>worker_kwargs):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    Arguments fed from the Sampler class in master process.
</span>
<span style="color:#629755;font-style:italic;">    </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">采样进程函数。
</span>
<span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> common_kwargs: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">各个</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">通用的参数列表。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> worker_kwargs: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">各个</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">可能不同的参数列表。</span>
<span style="color:#629755;font-style:italic;">    &quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span>c<span style="color:#cc7832;">, </span>w = AttrDict(**common_kwargs)<span style="color:#cc7832;">, </span>AttrDict(**worker_kwargs)
    initialize_worker(w.rank<span style="color:#cc7832;">, </span>w.seed<span style="color:#cc7832;">, </span>w.cpus<span style="color:#cc7832;">, </span>c.torch_threads)
    <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">初始化用于</span><span style="color:#808080;">training</span><span style="color:#808080;font-family:'AR PL UKai CN';">的</span><span style="color:#808080;">environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例和</span><span style="color:#808080;">collector</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span>envs = [c.EnvCls(**c.env_kwargs) <span style="color:#cc7832;">for </span>_ <span style="color:#cc7832;">in </span><span style="color:#8888c6;">range</span>(w.n_envs)]
    collector = c.CollectorCls(
        <span style="color:#aa4926;">rank</span>=w.rank<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">envs</span>=envs<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">samples_np</span>=w.samples_np<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">batch_T</span>=c.batch_T<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">TrajInfoCls</span>=c.TrajInfoCls<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">agent</span>=c.get(<span style="color:#6a8759;">&quot;agent&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,  </span><span style="color:#808080;"># Optional depending on parallel setup.
</span><span style="color:#808080;">        </span><span style="color:#aa4926;">sync</span>=w.get(<span style="color:#6a8759;">&quot;sync&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">step_buffer_np</span>=w.get(<span style="color:#6a8759;">&quot;step_buffer_np&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">global_B</span>=c.get(<span style="color:#6a8759;">&quot;global_B&quot;</span><span style="color:#cc7832;">, </span><span style="color:#6897bb;">1</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">env_ranks</span>=w.get(<span style="color:#6a8759;">&quot;env_ranks&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span>)
    agent_inputs<span style="color:#cc7832;">, </span>traj_infos = collector.start_envs(c.max_decorrelation_steps)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">这里会做收集</span><span style="color:#808080;">(</span><span style="color:#808080;font-family:'AR PL UKai CN';">采样</span><span style="color:#808080;">)</span><span style="color:#808080;font-family:'AR PL UKai CN';">第一批数据的工作
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span>collector.start_agent()  <span style="color:#808080;"># collector</span><span style="color:#808080;font-family:'AR PL UKai CN';">的初始化
</span>
<span style="color:#808080;font-family:'AR PL UKai CN';">    </span><span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">初始化用于</span><span style="color:#808080;">evaluation</span><span style="color:#808080;font-family:'AR PL UKai CN';">的</span><span style="color:#808080;">environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例和</span><span style="color:#808080;">collector</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span><span style="color:#cc7832;">if </span>c.get(<span style="color:#6a8759;">&quot;eval_n_envs&quot;</span><span style="color:#cc7832;">, </span><span style="color:#6897bb;">0</span>) &gt; <span style="color:#6897bb;">0</span>:
        eval_envs = [c.EnvCls(**c.eval_env_kwargs) <span style="color:#cc7832;">for </span>_ <span style="color:#cc7832;">in </span><span style="color:#8888c6;">range</span>(c.eval_n_envs)]
        eval_collector = c.eval_CollectorCls(
            <span style="color:#aa4926;">rank</span>=w.rank<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">envs</span>=eval_envs<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">TrajInfoCls</span>=c.TrajInfoCls<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">traj_infos_queue</span>=c.eval_traj_infos_queue<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">max_T</span>=c.eval_max_T<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">agent</span>=c.get(<span style="color:#6a8759;">&quot;agent&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">sync</span>=w.get(<span style="color:#6a8759;">&quot;sync&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">step_buffer_np</span>=w.get(<span style="color:#6a8759;">&quot;eval_step_buffer_np&quot;</span><span style="color:#cc7832;">, None</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span>)
    <span style="color:#cc7832;">else</span>:
        eval_envs = <span style="color:#8888c6;">list</span>()

    ctrl = c.ctrl  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">用于控制多个</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">进程同时运行时能正确运作的控制器
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span>ctrl.barrier_out.wait()  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">每个</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">都有一个</span><span style="color:#808080;">wait()</span><span style="color:#808080;font-family:'AR PL UKai CN';">，加上</span><span style="color:#808080;">ParallelSamplerBase.initialize()</span><span style="color:#808080;font-family:'AR PL UKai CN';">中的一个</span><span style="color:#808080;">wait()</span><span style="color:#808080;font-family:'AR PL UKai CN';">，刚好</span><span style="color:#808080;">n_worker+1</span><span style="color:#808080;font-family:'AR PL UKai CN';">个
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span><span style="color:#cc7832;">while True</span>:
        collector.reset_if_needed(agent_inputs)  <span style="color:#808080;"># Outside barrier?
</span><span style="color:#808080;">        </span>ctrl.barrier_in.wait()
        <span style="color:#cc7832;">if </span>ctrl.quit.value:  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">在主进程中</span><span style="color:#808080;">set</span><span style="color:#808080;font-family:'AR PL UKai CN';">了这个值为</span><span style="color:#808080;">True</span><span style="color:#808080;font-family:'AR PL UKai CN';">时，所有</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">进程会退出采样
</span><span style="color:#808080;font-family:'AR PL UKai CN';">            </span><span style="color:#cc7832;">break
</span><span style="color:#cc7832;">        if </span>ctrl.do_eval.value:  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">在主进程的</span><span style="color:#808080;">evaluate_agent()</span><span style="color:#808080;font-family:'AR PL UKai CN';">函数里</span><span style="color:#808080;">set</span><span style="color:#808080;font-family:'AR PL UKai CN';">了这个值为</span><span style="color:#808080;">True</span><span style="color:#808080;font-family:'AR PL UKai CN';">时，这里才会收集</span><span style="color:#808080;">evaluation</span><span style="color:#808080;font-family:'AR PL UKai CN';">用的数据
</span><span style="color:#808080;font-family:'AR PL UKai CN';">            </span>eval_collector.collect_evaluation(ctrl.itr.value)  <span style="color:#808080;"># Traj_infos to queue inside.
</span><span style="color:#808080;">        </span><span style="color:#cc7832;">else</span>:  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">不是做</span><span style="color:#808080;">evaluation
</span><span style="color:#808080;">            </span>agent_inputs<span style="color:#cc7832;">, </span>traj_infos<span style="color:#cc7832;">, </span>completed_infos = collector.collect_batch(
                agent_inputs<span style="color:#cc7832;">, </span>traj_infos<span style="color:#cc7832;">, </span>ctrl.itr.value)
            <span style="color:#cc7832;">for </span>info <span style="color:#cc7832;">in </span>completed_infos:
                c.traj_infos_queue.put(info)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">向所有</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">进程共享的队列塞入当前</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">的统计数据
</span><span style="color:#808080;font-family:'AR PL UKai CN';">        </span>ctrl.barrier_out.wait()

    <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">清理</span><span style="color:#808080;">environment
</span><span style="color:#808080;">    </span><span style="color:#cc7832;">for </span>env <span style="color:#cc7832;">in </span>envs + eval_envs:
        env.close()
</pre>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
在worker的代码中，比较绕的就是，worker是怎么把采样到的数据返回放到replay buffer里的？<br />
在<a href="https://www.codelast.com/?p=11613" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">上一篇</span></a>文章中，我们知道 ParallelSamplerBase.initialize() 函数初始化了replay buffer：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
examples = <span style="color:#94558d;">self</span>._build_buffers(env<span style="color:#cc7832;">, </span>bootstrap_value)</pre>
<p>以及：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">_build_buffers</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>env<span style="color:#cc7832;">, </span>bootstrap_value):
    <span style="color:#94558d;">self</span>.samples_pyt<span style="color:#cc7832;">, </span><span style="color:#94558d;">self</span>.samples_np<span style="color:#cc7832;">, </span>examples = build_samples_buffer(
        <span style="color:#94558d;">self</span>.agent<span style="color:#cc7832;">, </span>env<span style="color:#cc7832;">, </span><span style="color:#94558d;">self</span>.batch_spec<span style="color:#cc7832;">, </span>bootstrap_value<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">agent_shared</span>=<span style="color:#cc7832;">True, </span><span style="color:#aa4926;">env_shared</span>=<span style="color:#cc7832;">True, </span><span style="color:#aa4926;">subprocess</span>=<span style="color:#cc7832;">True</span>)
    <span style="color:#cc7832;">return </span>examples</pre>
<p>在这里，self.samples_np 对应的是replay buffer的存储对象。而 worker 的参数&nbsp;workers_kwargs 初始化的时候，会把 self.samples_np 拆分成多个slice，并传入 worker：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#aa4926;">samples_np</span>=<span style="color:#94558d;">self</span>.samples_np[:<span style="color:#cc7832;">, </span>slice_B]<span style="color:#cc7832;">,</span></pre>
<p>在 worker 中，构造 collector 对象的时候，会把这个传入的 samples_np 再传给 collector 的构造函数。这样，replay buffer 就与 collector 关联起来了。<br />
最后，在 collector.collect_batch() 的时候，会把采样得到的数据放入 samples_np 中，也就是相当于放到了 replay buffer 里。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
这一节就到这，且听下回分解。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a10-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(9) 基于CPU的并行采样器CpuSampler</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a9-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a9-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Mon, 20 Jan 2020 09:16:20 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[并行]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11613</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;本文是<a href="https://www.codelast.com/?p=11441" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">上一篇</span></a>文章的续文，继续分析CpuSampler的源码。<br />
我们已经知道了CpuSampler有两个父类：BaseSampler&#160;和&#160;ParallelSamplerBase。其中，BaseSampler主要是定义了一堆接口，没什么好说的，因此本文接着分析另一个父类&#160;ParallelSamplerBase。在&#160;ParallelSamplerBase 中，初始化函数&#160;initialize() 做了很多重要的工作，已经够写一篇长长的文章来分析了，这正是本文的主要内容。<br />
<span id="more-11613"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;初始化函数 initialize()&#160;做了哪些重要工作<br />
一句话总结 initialize() 的重要功能：计算一些特殊参数的值，初始化agent，创建<span style="color:#0000ff;">并行控制器</span>，创建并启动多个worker进程。<br />
<span style="color:#ff0000;">✍</span> 这里说的&#8220;<span style="color: rgb(0, 0, 255);">并行控制器</span>&#8221;(parallel ctrl)是指用Python&#160;multiprocessing模块来实现并行功能的时候，需要使用一些变量来协调各个并行的进程，使它们可以正确运作。这些用于协调的变量就是&#8220;并行控制器&#8221;。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;计算特殊参数的值<br />
在并行模式下，有些参数（比如采样用的worker的数量）不是由用户直接设置的，而是计算出来的。而且这样的参数还挺多，所以有大段大段的代码都用来干这事了。<br />
如果下面的代码没有注释的话，肯定会让人一头雾水：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
n_envs_list = <span style="color:#94558d;">self</span>._get_n_envs_list(</pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a9-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;本文是<a href="https://www.codelast.com/?p=11441" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">上一篇</span></a>文章的续文，继续分析CpuSampler的源码。<br />
我们已经知道了CpuSampler有两个父类：BaseSampler&nbsp;和&nbsp;ParallelSamplerBase。其中，BaseSampler主要是定义了一堆接口，没什么好说的，因此本文接着分析另一个父类&nbsp;ParallelSamplerBase。在&nbsp;ParallelSamplerBase 中，初始化函数&nbsp;initialize() 做了很多重要的工作，已经够写一篇长长的文章来分析了，这正是本文的主要内容。<br />
<span id="more-11613"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;初始化函数 initialize()&nbsp;做了哪些重要工作<br />
一句话总结 initialize() 的重要功能：计算一些特殊参数的值，初始化agent，创建<span style="color:#0000ff;">并行控制器</span>，创建并启动多个worker进程。<br />
<span style="color:#ff0000;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/270d.png" alt="✍" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span> 这里说的&ldquo;<span style="color: rgb(0, 0, 255);">并行控制器</span>&rdquo;(parallel ctrl)是指用Python&nbsp;multiprocessing模块来实现并行功能的时候，需要使用一些变量来协调各个并行的进程，使它们可以正确运作。这些用于协调的变量就是&ldquo;并行控制器&rdquo;。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;计算特殊参数的值<br />
在并行模式下，有些参数（比如采样用的worker的数量）不是由用户直接设置的，而是计算出来的。而且这样的参数还挺多，所以有大段大段的代码都用来干这事了。<br />
如果下面的代码没有注释的话，肯定会让人一头雾水：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
n_envs_list = <span style="color:#94558d;">self</span>._get_n_envs_list(<span style="color:#aa4926;">affinity</span>=affinity)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">用户设置的</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">数不一定与</span><span style="color:#808080;">environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">数相匹配，这里会重新调整
</span><span style="color:#94558d;">self</span>.n_worker = n_worker = <span style="color:#8888c6;">len</span>(n_envs_list)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">经过调整之后的</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">数
</span>B = <span style="color:#94558d;">self</span>.batch_spec.B  <span style="color:#808080;"># environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例的数量
</span>global_B = B * world_size  <span style="color:#808080;"># &quot;</span><span style="color:#808080;font-family:'AR PL UKai CN';">平行宇宙</span><span style="color:#808080;">&quot;</span><span style="color:#808080;font-family:'AR PL UKai CN';">概念下的</span><span style="color:#808080;">environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例的数量
</span>env_ranks = <span style="color:#8888c6;">list</span>(<span style="color:#8888c6;">range</span>(rank * B<span style="color:#cc7832;">, </span>(rank + <span style="color:#6897bb;">1</span>) * B))  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">含义可参考：</span><span style="color:#808080;">https://www.codelast.com/?p=10932
</span><span style="color:#94558d;">self</span>.world_size = world_size
<span style="color:#94558d;">self</span>.rank = rank

<span style="color:#cc7832;">if </span><span style="color:#94558d;">self</span>.eval_n_envs &gt; <span style="color:#6897bb;">0</span>:  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">在</span><span style="color:#808080;">example_*.py</span><span style="color:#808080;font-family:'AR PL UKai CN';">中传入的参数
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span><span style="color:#94558d;">self</span>.eval_n_envs_per = <span style="color:#8888c6;">max</span>(<span style="color:#6897bb;">1</span><span style="color:#cc7832;">, </span><span style="color:#94558d;">self</span>.eval_n_envs // n_worker)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">计算每个</span><span style="color:#808080;">worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">至少承载几个</span><span style="color:#808080;">evaluation</span><span style="color:#808080;font-family:'AR PL UKai CN';">的</span><span style="color:#808080;">environment(</span><span style="color:#808080;font-family:'AR PL UKai CN';">至少</span><span style="color:#808080;">1)
</span><span style="color:#808080;">    </span><span style="color:#94558d;">self</span>.eval_n_envs = eval_n_envs = <span style="color:#94558d;">self</span>.eval_n_envs_per * n_worker  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">保证至少有</span><span style="color:#808080;">&quot;worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">数量</span><span style="color:#808080;">&quot;</span><span style="color:#808080;font-family:'AR PL UKai CN';">个</span><span style="color:#808080;">eval environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span>logger.log(<span style="color:#6a8759;">f&quot;Total parallel evaluation envs: </span><span style="color:#cc7832;">{</span>eval_n_envs<span style="color:#cc7832;">}</span><span style="color:#6a8759;">.&quot;</span>)
    <span style="color:#94558d;">self</span>.eval_max_T = <span style="color:#72737a;">eval_max_T </span>= <span style="color:#8888c6;">int</span>(<span style="color:#94558d;">self</span>.eval_max_steps // eval_n_envs)</pre>
<p>
最为&ldquo;神奇&rdquo;的就是 <span style="color:#0000ff;">self._get_n_envs_list()</span> 这个函数，它用来计算<span style="color:#b22222;">每个worker承载几个environment实例</span>。这个说法是不是特别奇怪？原因是：用户可以指定environment实例的数量，也可以指定worker的数量，但这两个数量可能是不相等的，于是，要么worker数不够，要么worker数有多；在第1种情况下，一个worker需要带&gt;1个environment实例，在第2种情况下，不需要那么多worker，所以要减少worker的数量，才能保证一个worker刚好带一个environment实例。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
我给 self._get_n_envs_list()&nbsp;函数加上了注释，相信足以让大家理解它的功能了：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">_get_n_envs_list</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>affinity=<span style="color:#cc7832;">None, </span>n_worker=<span style="color:#cc7832;">None, </span>B=<span style="color:#cc7832;">None</span>):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">根据</span><span style="color:#629755;font-style:italic;">environment</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">实例的数量</span><span style="color:#629755;font-style:italic;">(</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">所谓的</span><span style="color:#629755;font-style:italic;">&quot;B&quot;)</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">，以及用户设定的用于采样的</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">的数量</span><span style="color:#629755;font-style:italic;">(n_worker)</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">，来计算得到一个</span><span style="color:#629755;font-style:italic;">list</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">，这个</span><span style="color:#629755;font-style:italic;">list</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">的元素的总数，
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    就是最终的</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">的数量；而这个</span><span style="color:#629755;font-style:italic;">list</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">里的每个元素的值，分别是每个</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">承载的</span><span style="color:#629755;font-style:italic;">environment</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">实例的数量。
</span>
<span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> affinity: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">一个字典</span><span style="color:#629755;font-style:italic;">(dict)</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">，包含硬件亲和性定义。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> n_worker: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">用户设定的用于采样的</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">的数量。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> B: environment</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">实例的数量。
</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:return </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">一个</span><span style="color:#629755;font-style:italic;">list</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">，其含义如上所述。</span>
<span style="color:#629755;font-style:italic;">    &quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span>B = <span style="color:#94558d;">self</span>.batch_spec.B <span style="color:#cc7832;">if </span>B <span style="color:#cc7832;">is None else </span>B  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">参考</span><span style="color:#808080;">BatchSpec</span><span style="color:#808080;font-family:'AR PL UKai CN';">类，可以认为</span><span style="color:#808080;">B</span><span style="color:#808080;font-family:'AR PL UKai CN';">是</span><span style="color:#808080;">environment</span><span style="color:#808080;font-family:'AR PL UKai CN';">实例的数量
</span><span style="color:#808080;font-family:'AR PL UKai CN';">    </span>n_worker = <span style="color:#8888c6;">len</span>(affinity[<span style="color:#6a8759;">&quot;workers_cpus&quot;</span>]) <span style="color:#cc7832;">if </span>n_worker <span style="color:#cc7832;">is None else </span>n_worker  <span style="color:#808080;"># worker</span><span style="color:#808080;font-family:'AR PL UKai CN';">的数量</span><span style="color:#808080;">(</span><span style="color:#808080;font-family:'AR PL UKai CN';">不超过物理</span><span style="color:#808080;">CPU</span><span style="color:#808080;font-family:'AR PL UKai CN';">数否则在别处报错</span><span style="color:#808080;">)
</span><span style="color:#808080;">    </span><span style="color:#6a8759;">&quot;&quot;&quot;
</span><span style="color:#6a8759;">    </span><span style="color:#6a8759;font-family:'AR PL UKai CN';">当</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">实例的数量</span><span style="color:#6a8759;">&lt;worker</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">的数量时，例如有</span><span style="color:#6a8759;">8</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">个</span><span style="color:#6a8759;">worker(</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">即</span><span style="color:#6a8759;">8</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">个物理</span><span style="color:#6a8759;">CPU)</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">，</span><span style="color:#6a8759;">5</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">个</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">实例，每一个物理</span><span style="color:#6a8759;">CPU</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">运行一个</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">，
</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">    那么此时会有</span><span style="color:#6a8759;">3</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">个物理</span><span style="color:#6a8759;">CPU</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">多余，此时就会把</span><span style="color:#6a8759;">worker</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">的数量设置成和</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">实例数量一样，使得每个</span><span style="color:#6a8759;">CPU</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">都刚好运行一个</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">实例。</span>
<span style="color:#6a8759;">    &quot;&quot;&quot;
</span><span style="color:#6a8759;">    </span><span style="color:#cc7832;">if </span>B &lt; n_worker:
        logger.log(<span style="color:#6a8759;">f&quot;WARNING: requested fewer envs (</span><span style="color:#cc7832;">{</span>B<span style="color:#cc7832;">}</span><span style="color:#6a8759;">) than available worker &quot;
</span><span style="color:#6a8759;">            f&quot;processes (</span><span style="color:#cc7832;">{</span>n_worker<span style="color:#cc7832;">}</span><span style="color:#6a8759;">). Using fewer workers (but maybe better to &quot;
</span><span style="color:#6a8759;">            &quot;increase sampler&#39;s `batch_B`.&quot;</span>)
        n_worker = B
    n_envs_list = [B // n_worker] * n_worker
    <span style="color:#6a8759;">&quot;&quot;&quot;
</span><span style="color:#6a8759;">    </span><span style="color:#6a8759;font-family:'AR PL UKai CN';">当</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">实例的数量不是</span><span style="color:#6a8759;">worker</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">数量的整数倍时，每个</span><span style="color:#6a8759;">worker</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">被分配到的</span><span style="color:#6a8759;">environment</span><span style="color:#6a8759;font-family:'AR PL UKai CN';">实例的数量是不均等的。</span>
<span style="color:#6a8759;">    &quot;&quot;&quot;
</span><span style="color:#6a8759;">    </span><span style="color:#cc7832;">if not </span>B % n_worker == <span style="color:#6897bb;">0</span>:
        logger.log(<span style="color:#6a8759;">&quot;WARNING: unequal number of envs per process, from &quot;
</span><span style="color:#6a8759;">            f&quot;batch_B </span><span style="color:#cc7832;">{</span><span style="color:#94558d;">self</span>.batch_spec.B<span style="color:#cc7832;">}</span><span style="color:#6a8759;"> and n_worker </span><span style="color:#cc7832;">{</span>n_worker<span style="color:#cc7832;">} </span><span style="color:#6a8759;">&quot;
</span><span style="color:#6a8759;">            &quot;(possible suboptimal speed).&quot;</span>)
        <span style="color:#cc7832;">for </span>b <span style="color:#cc7832;">in </span><span style="color:#8888c6;">range</span>(B % n_worker):
            n_envs_list[b] += <span style="color:#6897bb;">1
</span><span style="color:#6897bb;">    </span><span style="color:#cc7832;">return </span>n_envs_list</pre>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;初始化agent<br />
<span style="color:#0000ff;">agent对象只有一个</span>！并不是每一个worker进程都对应到不同的agent对象！这是理解CpuSampler时需要知晓的一个重要概念。<br />
agent通过以下代码初始化（ParallelSamplerBase.initialize() 函数）：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
env = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">EnvCls</span>(**<span style="color:#94558d;">self</span>.env_kwargs)
<span style="color:#94558d;">self</span>.<span style="color:#cc7833;">_agent_init</span>(agent<span style="color:#cc7832;">, </span>env<span style="color:#cc7832;">, </span><span style="color:#aa4926;">global_B</span>=global_B<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">env_ranks</span>=env_ranks)
examples = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">_build_buffers</span>(env<span style="color:#cc7832;">, </span>bootstrap_value)
env.<span style="color:#cc7833;">close</span>()
<span style="color:#cc7832;font-weight:bold;">del </span>env</pre>
<p>可以看到，这里初始化了environment对象，并把它作为一个参数传给了agent初始化函数 self._agent_init()，事实上，在&nbsp;self._agent_init()&nbsp;函数里，只用到了 env&nbsp;对象的 <span style="color:#0000ff;">spaces</span>&nbsp;这个属性，而没有引用整个 env 对象，因此在使用完之后，使用 env.close()&nbsp;以及 del env&nbsp;来清理掉env不会有问题。<br />
self._build_buffers() 是一个非常复杂的操作，它的主要功能是创建强化学习中必备的<span style="color:#0000ff;">replay buffer</span>。直觉上，有人可能认为replay buffer这个东西，不就是创建一个list或者类似的数据结构就能搞定的吗？但实际上不是这么简单，从这个函数一级级点进去就会发现代码还不少，而且它里面甚至还用到了Python&nbsp;multiprocessing，所以创建replay buffer的实现就不在本文分析了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
self._agent_init()&nbsp;函数的实现很简单：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">_agent_init</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>agent<span style="color:#cc7832;">, </span>env<span style="color:#cc7832;">, </span>global_B=<span style="color:#6897bb;">1</span><span style="color:#cc7832;">, </span>env_ranks=<span style="color:#cc7832;font-weight:bold;">None</span>):
    agent.<span style="color:#cc7833;">initialize</span>(env.spaces<span style="color:#cc7832;">, </span><span style="color:#aa4926;">share_memory</span>=<span style="color:#cc7832;font-weight:bold;">True</span><span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">global_B</span>=global_B<span style="color:#cc7832;">, </span><span style="color:#aa4926;">env_ranks</span>=env_ranks)
    <span style="color:#94558d;">self</span>.agent = agent</pre>
<p>在这里看到：agent初始化之后，赋值给了 self.agent，这就是&nbsp;CpuSampler&nbsp;中唯一使用的 agent&nbsp;对象。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;创建并行控制器<br />
并行控制器(parallel ctrl)用于协调多个采样用的worker进程。<br />
在&nbsp;initialize() 里，创建并行控制器的代码只有一句：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">_build_parallel_ctrl</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>n_worker):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">创建用于控制并行训练过程的一些数据结构。</span>

<span style="color:#629755;font-style:italic;">    multiprocessing.RawValue</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">：不存在</span><span style="color:#629755;font-style:italic;">lock</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">的多进程间共享值。</span>
<span style="color:#629755;font-style:italic;">    multiprocessing.Barrier</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">：一种简单的同步原语，用于固定数目的进程相互等待。当所有进程都调用</span><span style="color:#629755;font-style:italic;">wait</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">以后，所有进程会同时开始执行。</span>
<span style="color:#629755;font-style:italic;">    multiprocessing.Queue</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">：用于多进程间数据传递的消息队列。
</span>
<span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">    </span><span style="color:#629755;font-weight:bold;font-style:italic;">:param</span><span style="color:#629755;font-style:italic;"> n_worker: </span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">真正的</span><span style="color:#629755;font-style:italic;">worker</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">数</span><span style="color:#629755;font-style:italic;">(</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">不一定等于用户设置的那个原始值</span><span style="color:#629755;font-style:italic;">)</span><span style="color:#629755;font-style:italic;font-family:'AR PL UKai CN';">。</span>
<span style="color:#629755;font-style:italic;">    &quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#94558d;">self</span>.ctrl = AttrDict(
        <span style="color:#aa4926;">quit</span>=mp.RawValue(ctypes.c_bool<span style="color:#cc7832;">, False</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">barrier_in</span>=mp.Barrier(n_worker + <span style="color:#6897bb;">1</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">barrier_out</span>=mp.Barrier(n_worker + <span style="color:#6897bb;">1</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">do_eval</span>=mp.RawValue(ctypes.c_bool<span style="color:#cc7832;">, False</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">itr</span>=mp.RawValue(ctypes.c_long<span style="color:#cc7832;">, </span><span style="color:#6897bb;">0</span>)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span>)
    <span style="color:#94558d;">self</span>.traj_infos_queue = mp.Queue()
    <span style="color:#94558d;">self</span>.eval_traj_infos_queue = mp.Queue()
    <span style="color:#94558d;">self</span>.sync = AttrDict(<span style="color:#aa4926;">stop_eval</span>=mp.RawValue(ctypes.c_bool<span style="color:#cc7832;">, False</span>))</pre>
<p>这里AttrDict是一个&ldquo;扩展的&rdquo;dict，mp就是Python&nbsp;multiprocessing模块，而Python&nbsp;multiprocessing是一个巨大的话题，我自己也只是初步了解，所以没办法讲透彻，这里只举两个例子，来说明这些并行控制器的作用：<br />
<span style="color:#0000ff;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;ctrl.quit 可以理解为一个bool类型的进程间共享变量。在 minibatch_rl.py 中，训练完成后，会执行 shutdown()，它会调用 sampler.shutdown()，从而会把 ctrl.quit 的值设置为True；同时，在 worker.py 中会看到，当检测到 ctrl.quit 的值为True时，会退出采样过程。所有采样的worker进程都受这个变量控制。所以这样就做到了在主进程中控制并行跑的worker进程。<br />
<span style="color:#0000ff;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;multiprocessing.Queue() 用于在多进程间传递消息。在每个采样的worker进程中，会把收集到的trajectory info放到同一个traj_infos_queue中，在主进程中会把汇总的trajectory info进一步处理成统计数据，然后记日志、打印到屏幕上，等等。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;创建并启动多个worker进程<br />
worker进程用于采样(agent与environment交互得到的)数据。<br />
在创建这些进程之前，需要先为它们构建所需的参数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
common_kwargs = <span style="color:#94558d;">self</span>._assemble_common_kwargs(affinity<span style="color:#cc7832;">, </span>global_B)
workers_kwargs = <span style="color:#94558d;">self</span>._assemble_workers_kwargs(affinity<span style="color:#cc7832;">, </span>seed<span style="color:#cc7832;">, </span>n_envs_list)</pre>
<p>为什么需要分成&nbsp;<span style="color:#0000ff;">common_kwargs</span> 以及&nbsp;<span style="color:#0000ff;">workers_kwargs</span> 两个参数？这是因为：对每个worker进程来说，有些参数是通用的，有些参数是不通用的（例如，每个worker使用的CPU数量、承载的environment实例的数量等），因此，rlpyt把它们分成了两拨，分别放在两个对象里。</p>
<p>在准备好了参数之后，就开始创建多个worker进程，并把它们启动起来了：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">创建一批子进程
</span>target = sampling_process <span style="color:#cc7832;">if </span>worker_process <span style="color:#cc7832;">is None else </span>worker_process
<span style="color:#94558d;">self</span>.workers = [mp.Process(<span style="color:#aa4926;">target</span>=target<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">kwargs</span>=<span style="color:#8888c6;">dict</span>(<span style="color:#aa4926;">common_kwargs</span>=common_kwargs<span style="color:#cc7832;">, </span><span style="color:#aa4926;">worker_kwargs</span>=w_kwargs))
    <span style="color:#cc7832;">for </span>w_kwargs <span style="color:#cc7832;">in </span>workers_kwargs]
<span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">启动子进程
</span><span style="color:#cc7832;">for </span>w <span style="color:#cc7832;">in </span><span style="color:#94558d;">self</span>.workers:
    w.start()

<span style="color:#94558d;">self</span>.ctrl.barrier_out.wait()  <span style="color:#808080;"># Wait for workers ready (e.g. decorrelate).</span></pre>
<p>在这里，使用的是 multiprocessing.Process() 来创建的进程，target 为进程函数名，进程函数是可以自行指定的，rlpyt也提供了默认的实现，即 worker.py 中的&nbsp;sampling_process() 函数。采样进程的实现代码 worker.py 虽然不长，但要完全看懂并不容易，所以留到后面的文章再分析。<br />
在worker进程启动之后，它就进入了持续的采样过程。注意上面代码的最后一句&nbsp;<span style="color:#0000ff;">self.ctrl.barrier_out.wait()</span>，这里使用了 multiprocessing的Barrier来控制各个worker进程同步。由于 barrier_out 创建的时候是这样的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'JetBrains Mono';font-size:13.5pt;">
<span style="color:#aa4926;">barrier_out</span>=mp.Barrier(n_worker + <span style="color:#6897bb;">1</span>)</pre>
<p>所以，它需要 <span style="color:#0000ff;">n_worker + 1 </span>个 wait() 才能让所有进程同时&ldquo;解锁&rdquo;（即同时开始执行），在 initialize() 函数里的&nbsp;<span style="color: rgb(0, 0, 255);">self.ctrl.barrier_out.wait()&nbsp;</span>算一个，每个worker函数&mdash;&mdash;即 sampling_process()&mdash;&mdash;里也分别有一个 barrier_out.wait()，所有这些 wait() 加起来刚好是 <span style="color:#0000ff;">n_worker + 1</span> 个，这使得 initialize() 函数执行完，所有 worker 就会&ldquo;跑起来&rdquo;开始采样。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
这一节就到这，且听下回分解。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a9-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(8) 基于CPU的并行采样器CpuSampler</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a8-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a8-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sun, 12 Jan 2020 09:40:26 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[并行]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11441</guid>

					<description><![CDATA[<p>
<em>写这篇文章的过程中，我改稿改到怀疑人生，因为有些我自己下的结论在看了很多次源码之后又自我否定了多次，所以这篇文章花了我很长时间才完工。虽然完稿之后我仍然不敢保证绝对正确，但这至少是在我当前认知情况下我&#8220;自以为&#8221;正确的版本了，写长稿不易，望理解。</em></p>
<p>查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;</p>
<p>在单机上支持丰富的并行(Parallelism)模式是 rlpyt 有别于很多其他强化学习框架的一个显著特征。rlpyt可以使用纯CPU，或CPU、GPU混合的方式来并行执行训练过程。<br />
<span id="more-11441"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;rlpyt的sampler模块概览<br />
rlpyt有一种叫做&#8220;<a href="https://www.codelast.com/?p=10750" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Sampler</span></a>&#8221;的模块，我们姑且称之为&#8220;采样器&#8221;，它用于采样/收集agent与environment交互的数据，对于不同的训练模式(串行、并行、异步)，rlpyt有不同的sampler实现：</p>
<blockquote>
<div>
		├── <span style="color:#0000ff;">async_</span></div>
<div>
		│&#160; &#160;├── action_server.py</div>
<div>
		│&#160; &#160;├── alternating_sampler.py</div>
<div>
		│&#160; &#160;├── base.py</div>
<div>
		│&#160; &#160;├── collectors.py</div>
<div>
		│&#160; &#160;├── cpu_sampler.py</div></blockquote>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a8-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
<em>写这篇文章的过程中，我改稿改到怀疑人生，因为有些我自己下的结论在看了很多次源码之后又自我否定了多次，所以这篇文章花了我很长时间才完工。虽然完稿之后我仍然不敢保证绝对正确，但这至少是在我当前认知情况下我&ldquo;自以为&rdquo;正确的版本了，写长稿不易，望理解。</em></p>
<p>查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;</p>
<p>在单机上支持丰富的并行(Parallelism)模式是 rlpyt 有别于很多其他强化学习框架的一个显著特征。rlpyt可以使用纯CPU，或CPU、GPU混合的方式来并行执行训练过程。<br />
<span id="more-11441"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;rlpyt的sampler模块概览<br />
rlpyt有一种叫做&ldquo;<a href="https://www.codelast.com/?p=10750" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Sampler</span></a>&rdquo;的模块，我们姑且称之为&ldquo;采样器&rdquo;，它用于采样/收集agent与environment交互的数据，对于不同的训练模式(串行、并行、异步)，rlpyt有不同的sampler实现：</p>
<blockquote>
<div>
		├── <span style="color:#0000ff;">async_</span></div>
<div>
		│&nbsp; &nbsp;├── action_server.py</div>
<div>
		│&nbsp; &nbsp;├── alternating_sampler.py</div>
<div>
		│&nbsp; &nbsp;├── base.py</div>
<div>
		│&nbsp; &nbsp;├── collectors.py</div>
<div>
		│&nbsp; &nbsp;├── cpu_sampler.py</div>
<div>
		│&nbsp; &nbsp;├── gpu_sampler.py</div>
<div>
		│&nbsp; &nbsp;└── serial_sampler.py</div>
<div>
		├── base.py</div>
<div>
		├── buffer.py</div>
<div>
		├── collections.py</div>
<div>
		├── collectors.py</div>
<div>
		├── <span style="color:#0000ff;">parallel</span></div>
<div>
		│&nbsp; &nbsp;├── base.py</div>
<div>
		│&nbsp; &nbsp;├── cpu</div>
<div>
		│&nbsp; &nbsp;│&nbsp; &nbsp;├── collectors.py</div>
<div>
		│&nbsp; &nbsp;│&nbsp; &nbsp;└── sampler.py</div>
<div>
		│&nbsp; &nbsp;├── gpu</div>
<div>
		│&nbsp; &nbsp;│&nbsp; &nbsp;├── action_server.py</div>
<div>
		│&nbsp; &nbsp;│&nbsp; &nbsp;├── alternating_sampler.py</div>
<div>
		│&nbsp; &nbsp;│&nbsp; &nbsp;├── collectors.py</div>
<div>
		│&nbsp; &nbsp;│&nbsp; &nbsp;└── sampler.py</div>
<div>
		│&nbsp; &nbsp;└── worker.py</div>
<div>
		├── <span style="color:#0000ff;">serial</span></div>
<div>
		│&nbsp; &nbsp;├── collectors.py</div>
<div>
		│&nbsp; &nbsp;└── sampler.py</div>
</blockquote>
<p>
直观感受：串行(<span style="color:#0000ff;">serial</span>)模式的sampler代码最简单，并行(<span style="color:#0000ff;">parallel</span>)模式下的cpu并行实现比gpu并行实现简单一些，异步(<span style="color:#0000ff;">async_</span>)模式下的实现最复杂。<br />
不知道会不会有人好奇：为什么异步模式的module名是带下划线的<span style="color:#0000ff;">async_</span>而不是async呢？因为async在Python 3里是一个关键字，rlpyt的作者应该是为了避开这个问题才加了一个下划线。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
在前面的系列源码分析文章中，我已经分析过了串行(<span style="color: rgb(0, 0, 255);">serial</span>)模式下的sampler代码，本文想分析的是并行(<span style="color: rgb(0, 0, 255);">parallel</span>)模式下的CPU并行实现代码，也就是树形图里的这一部分：</p>
<div>
<blockquote>
<div>
			├── cpu</div>
<div>
			│&nbsp; &nbsp;├── collectors.py</div>
<div>
			│&nbsp; &nbsp;└── sampler.py</div>
</blockquote>
<div>
		CPU sampler在采样/收集数据的时候，完全不使用GPU，因此相对于GPU sampler来说会简单得多（只是相对而言）。它只有两个代码文件。当然，由于这两个文件里的class会继承其他父类，因此最终有关联的代码文件远不止这两个。下面我们就来详细分析一下。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;CPU sampler概览<br />
		CPU sampler的实现类是&nbsp;CpuSampler，一级级向上，有多个父类：</div>
</div>
<p><img decoding="async" alt="rlpyt" src="https://www.codelast.com/wp-content/uploads/2020/01/sampler_class_inheritance.png" style="width: 600px; height: 360px;" /><br />
这个BaseSampler，同时也是&nbsp;GpuSampler&nbsp;的最顶级父类。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<a href="https://www.codelast.com/?p=10932" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">前面的文章</span></a>已经讲过，sampler是collector外面包装的一层，真正去做数据收集工作的是collector类。对&nbsp;CpuSampler&nbsp;来说，它对应的collector代码实现在collectors.py中，里面包含多个collector类：CpuResetCollector，CpuWaitResetCollector，CpuEvalCollector等。<br />
所以应该从两条线来分析sampler class，一条线是&nbsp;<span style="color:#0000ff;">CpuSampler</span>&rarr;<span style="color:#0000ff;">ParallelSamplerBase</span>&rarr;<span style="color:#0000ff;">BaseSampler</span>，另一条线是collector class。为了不让篇幅过长，本文只分析第一条线，把collector class留到后面的文章。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;BaseSampler：一个主要用于定义各种接口的父类<br />
最顶层的父类BaseSampler主要定义了各种接口，很多函数都没有实现：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">initialize</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>*args<span style="color:#cc7832;">, </span>**kwargs):
    <span style="color:#cc7832;font-weight:bold;">raise </span><span style="color:#8888c6;">NotImplementedError
</span>
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">obtain_samples</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>itr):
    <span style="color:#cc7832;font-weight:bold;">raise </span><span style="color:#8888c6;">NotImplementedError  </span><span style="color:#808080;"># type: Samples
</span>
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">evaluate_agent</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>itr):
    <span style="color:#cc7832;font-weight:bold;">raise </span><span style="color:#8888c6;">NotImplementedError
</span>
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">shutdown</span>(<span style="color:#94558d;">self</span>):
    <span style="color:#cc7832;font-weight:bold;">pass</span></pre>
<p>而__init__()函数还是像<span style="background-color:#ffa07a;"><a href="https://www.codelast.com/?p=10831" rel="noopener noreferrer" target="_blank">之前见识过的套路</a></span>一样，使用save__init__args()来把可变参数保存到对象属性里：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7833;">save__init__args</span>(<span style="color:#8888c6;">locals</span>())</pre>
<p>其余就没啥好说的了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;CpuSampler：主要充当一个入口<br />
CpuSampler类的代码相当少，它主要充当一个入口，而不是实现主要逻辑：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">CpuSampler</span>(ParallelSamplerBase):

    <span style="color:#cc7832;font-weight:bold;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>*args<span style="color:#cc7832;">, </span>CollectorCls=CpuResetCollector<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span>eval_CollectorCls=CpuEvalCollector<span style="color:#cc7832;">, </span>**kwargs):
        <span style="color:#808080;"># e.g. or use CpuWaitResetCollector, etc...
</span><span style="color:#808080;">        </span><span style="color:#8888c6;">super</span>().<span style="color:#b200b2;">__init__</span>(*args<span style="color:#cc7832;">, </span><span style="color:#aa4926;">CollectorCls</span>=CollectorCls<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span><span style="color:#aa4926;">eval_CollectorCls</span>=eval_CollectorCls<span style="color:#cc7832;">, </span>**kwargs)

    <span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">obtain_samples</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>itr):
        <span style="color:#94558d;">self</span>.agent.<span style="color:#cc7833;">sync_shared_memory</span>()  <span style="color:#808080;"># New weights in workers, if needed.
</span><span style="color:#808080;">        </span><span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#8888c6;">super</span>().<span style="color:#cc7833;">obtain_samples</span>(itr)

    <span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">evaluate_agent</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>itr):
        <span style="color:#94558d;">self</span>.agent.<span style="color:#cc7833;">sync_shared_memory</span>()
        <span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#8888c6;">super</span>().<span style="color:#cc7833;">evaluate_agent</span>(itr)</pre>
<p>其中，obtain_samples() 用于采样一批数据，evaluate_agent() 用于评估agent&mdash;&mdash;或者说是评估模型，差不多的意思。<br />
这两个函数都调用父类<span style="color:#0000ff;">ParallelSamplerBase</span>的同名函数来实现对应功能，后面会在其他文章里具体分析。<br />
在这两个函数的开头，都有一个&nbsp;self.agent.sync_shared_memory()&nbsp;的操作，这是干嘛？<br />
其功能是：<span style="color:#b22222;">在并行模式下，采样/评估之前先同步shared model</span>。<br />
<span style="color:#0000ff;">sync_shared_memory()</span>&nbsp;函数的实现是：</p>
<section class="output_wrapper" id="output_wrapper_id" style="font-size: 16px; color: rgb(62, 62, 62); line-height: 1.6; letter-spacing: 0px; font-family: &quot;Helvetica Neue&quot;, Helvetica, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;">
<pre style="font-size: inherit; color: inherit; line-height: inherit; margin-top: 0px; margin-bottom: 0px; padding: 0px;">
<code class="python language-python hljs" style="margin: 0px 2px; line-height: 18px; font-size: 14px; letter-spacing: 0px; font-family: Consolas, Inconsolata, Courier, monospace; border-radius: 0px; color: rgb(169, 183, 198); background: rgb(40, 43, 46); padding: 0.5em; overflow-wrap: normal !important; word-break: normal !important; overflow: auto !important; display: -webkit-box !important;"><span class="hljs-function" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;"><span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; overflow-wrap: inherit !important; word-break: inherit !important;">def</span>&nbsp;<span class="hljs-title" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(165, 218, 45); word-wrap: inherit !important; word-break: inherit !important;">sync_shared_memory</span><span class="hljs-params" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(255, 152, 35); word-wrap: inherit !important; word-break: inherit !important;">(self)</span>:</span>
&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">if</span>&nbsp;self.shared_model&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">is</span>&nbsp;<span class="hljs-keyword" style="font-size: inherit; line-height: inherit; margin: 0px; padding: 0px; color: rgb(248, 35, 117); word-wrap: inherit !important; word-break: inherit !important;">not</span>&nbsp;self.model:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.shared_model.load_state_dict(strip_ddp_state_dict(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;self.model.state_dict()))
</code></pre>
</section>
<p>这里的意思是：当 self.model 被训练过之后，可能已经和 self.shared_model 不是一个东西了，此时需要把 self.model 的参数copy到 self.shared_model&nbsp;里。<br />
<span style="color: rgb(0, 0, 255);">strip_ddp_state_dict()</span>函数是一个很tricky的操作，为什么从 self.model&nbsp;取出来的 state_dict&nbsp;不能直接用 load_state_dict()&nbsp;加载到 self.shared_model&nbsp;里呢？关于这一点，我觉得代码的注释里写得比较清楚，建议直接去看它。<br />
这里就产生了两个问题：<span style="color:#0000ff;">✓</span> <span style="color:#ff0000;">什么是shared model？</span>&nbsp;<span style="color:#0000ff;">✓</span> <span style="color:#ff0000;">为什么要同步shared model？</span><br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;什么是shared model<br />
从名字上猜测，shared model就是一个&ldquo;共享的模型&rdquo;，之所以会有&ldquo;共享&rdquo;这个概念，是因为在多个进程中都需要使用模型，所以才需要&ldquo;共享&rdquo;。<br />
<span style="color:#ff8c00;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span> rlpyt在并行(<span style="color: rgb(0, 0, 255);">parallel</span>)模式下，会产生多个&ldquo;worker&rdquo;跑在多个进程里，这些worker会各自在environment中采样，采样得到的数据用于优化模型。<br />
<span style="color: rgb(255, 140, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;worker在采样的时候会选择action，此时会用模型来做action selection。<br />
<span style="color: rgb(255, 140, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;所有worker关联到同一个agent对象(agent包含了策略网络的参数)，只有一个进程会去做优化模型(也就是反向传播之类)的工作，这一点要特别注意，是一个进程，而不是所有worker进程！<br />
<span style="color: rgb(255, 140, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;在每个agent对象内部，会有一个类型为 torch.nn.Module 的 self.model 对象，还有一个 self.shared_model 对象，我们可以从agent的父类&nbsp;BaseAgent&nbsp;的__init__()函数中看到这一点：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>ModelCls=<span style="color:#cc7832;font-weight:bold;">None</span><span style="color:#cc7832;">, </span>model_kwargs=<span style="color:#cc7832;font-weight:bold;">None</span><span style="color:#cc7832;">, </span>initial_model_state_dict=<span style="color:#cc7832;font-weight:bold;">None</span>):
    <span style="color:#cc7833;">save__init__args</span>(<span style="color:#8888c6;">locals</span>())
    <span style="color:#94558d;">self</span>.model = <span style="color:#cc7832;font-weight:bold;">None  </span><span style="color:#808080;"># type: torch.nn.Module
</span><span style="color:#808080;">    </span><span style="color:#94558d;">self</span>.shared_model = <span style="color:#cc7832;font-weight:bold;">None</span></pre>
<p>在agent对象初始化的时候，即在 BaseAgent.initialize() 函数中，会把 self.shared_model&nbsp;初始化成和 self.model&nbsp;一样：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">initialize</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>env_spaces<span style="color:#cc7832;">, </span>share_memory=<span style="color:#cc7832;font-weight:bold;">False</span><span style="color:#cc7832;">, </span>**kwargs):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;In this default setup, self.model is treated as the model needed
</span><span style="color:#629755;font-style:italic;">    for action selection, so it is the only one shared with workers.&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#94558d;">self</span>.env_model_kwargs = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">make_env_to_model_kwargs</span>(env_spaces)
    <span style="color:#94558d;">self</span>.model = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">ModelCls</span>(**<span style="color:#94558d;">self</span>.env_model_kwargs<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span>**<span style="color:#94558d;">self</span>.model_kwargs)
    <span style="color:#cc7832;font-weight:bold;">if </span>share_memory:
        <span style="color:#94558d;">self</span>.model.<span style="color:#cc7833;">share_memory</span>()
        <span style="color:#94558d;">self</span>.shared_model = <span style="color:#94558d;">self</span>.model</pre>
<p>上面代码中的 if share_memory&nbsp;这个条件是否得到满足呢？<br />
在并行模式下，也就是从 ParallelSamplerBase._agent_init()&nbsp;函数的代码我们可以发现，agent初始化的时候 share_memory&nbsp;参数被设置成了&nbsp;True：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
agent.<span style="color:#cc7833;">initialize</span>(env.spaces<span style="color:#cc7832;">, </span><span style="color:#aa4926;">share_memory</span>=<span style="color:#cc7832;font-weight:bold;">True</span><span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">global_B</span>=global_B<span style="color:#cc7832;">, </span><span style="color:#aa4926;">env_ranks</span>=env_ranks)</pre>
<p>所以 if share_memory&nbsp;的条件是满足的。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
如果使用GPU训练模型，那么rlpyt会把model挪到用户指定的设备上，而shared_model需要放在CPU上(<a href="https://towardsdatascience.com/speed-up-your-algorithms-part-1-pytorch-56d8a4ae7051" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">经查</span></a>，PyTorch的Tensor或模型参数也是可以放在GPU上共享的，但有一些容易出错、需要谨慎处理的细节，所以我猜由于这个原因，作者选择了把shared_model放在CPU上)，因此，这里创建出来了一个self.shared_model，用来防止之后self.model有可能被挪到GPU的情况发生&mdash;&mdash;如果发生了，self.shared_model这个放在CPU上的模型才是多个进程间的共享模型。<br />
那么这个shared_model在CpuSampler中真的有用吗？下面我们就一层层地挖下去，看看这个东西到底有没有用。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;为什么要同步shared model<br />
先说结论：在CpuSampler里，完全不需要同步。<br />
为了确认这个结论，我们看看在使用CPU sampler的时候，BaseAgent类里的 self.shared_model&nbsp;到底用在了什么地方。通过搜索代码，发现除了 <span style="color:#0000ff;">sync_shared_memory()</span>&nbsp;函数之外，只有两个地方在用：<br />
1、上面提到的&nbsp;BaseAgent.initialize()&nbsp;函数。在这里，对 self.shared_model&nbsp;只有赋值操作，没有使用。<br />
2、to_device()&nbsp;函数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">to_device</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>cuda_idx=<span style="color:#cc7832;font-weight:bold;">None</span>):
<span style="color:#629755;font-style:italic;">    </span><span style="color:#cc7832;font-weight:bold;">if </span>cuda_idx <span style="color:#cc7832;font-weight:bold;">is None</span>:
        <span style="color:#cc7832;font-weight:bold;">return
</span><span style="color:#cc7832;font-weight:bold;">    if </span><span style="color:#94558d;">self</span>.shared_model <span style="color:#cc7832;font-weight:bold;">is not None</span>:
        <span style="color:#94558d;">self</span>.model = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">ModelCls</span>(**<span style="color:#94558d;">self</span>.env_model_kwargs<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span>**<span style="color:#94558d;">self</span>.model_kwargs)
        <span style="color:#94558d;">self</span>.model.<span style="color:#cc7833;">load_state_dict</span>(<span style="color:#94558d;">self</span>.shared_model.<span style="color:#cc7833;">state_dict</span>())
    <span style="color:#94558d;">self</span>.device = torch.<span style="color:#cc7833;">device</span>(<span style="color:#008080;">&quot;cuda&quot;</span><span style="color:#cc7832;">, </span><span style="color:#aa4926;">index</span>=cuda_idx)
    <span style="color:#94558d;">self</span>.model.<span style="color:#cc7833;">to</span>(<span style="color:#94558d;">self</span>.device)</pre>
<p>在这一段代码中，当使用CPU sampler时，cuda_idx&nbsp;为 None，因此直接return了，self.shared_model&nbsp;根本触达不到。<br />
此外，BaseAgent的其他所有使用 self.shared_model&nbsp;的地方，都是和异步(<span style="color: rgb(0, 0, 255);">async_</span>)模式相关的，和并行(<span style="color:#0000ff;">parallel</span>)模式无关。<br />
因此，对CpuSampler来说，shared_model没用，不需要调用 sync_shared_memory()&nbsp;来同步shared_model。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;shared model在什么情况下有意义<br />
对CpuSampler来说，BaseAgent里的 self.model&nbsp;对各个采样的worker来说都会实时更新，在action&nbsp;selection的时候使用的也是 self.model，而不是 self.shared_model，所以 shared_model&nbsp;对CpuSampler来说其实没有意义。<br />
但在其他模式下 shared model 还是有意义的，而且机制更复杂。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
这一节就到这，且听下回分解。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a8-%e5%9f%ba%e4%ba%8ecpu%e7%9a%84%e5%b9%b6%e8%a1%8c%e9%87%87%e6%a0%b7/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(7) 模型参数是在哪更新的</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a7-%e6%a8%a1%e5%9e%8b%e5%8f%82%e6%95%b0%e6%98%af%e5%9c%a8%e5%93%aa/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a7-%e6%a8%a1%e5%9e%8b%e5%8f%82%e6%95%b0%e6%98%af%e5%9c%a8%e5%93%aa/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 08 Jan 2020 17:55:58 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11528</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。<br />
本文简要分析一下在rlpyt中，强化学习模型的参数是在什么地方被更新、怎么被更新的。<br />
<span id="more-11528"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;概述<br />
模型参数是在<span style="color:#0000ff;">Algorithm</span>模块的<span style="color:#0000ff;">optimize_agent()</span>函数里被更新的，它在Runner类(例如&#160;<span style="color:#0000ff;">MinibatchRl</span>)的<span style="color:#0000ff;">train()</span>函数里被调用。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;Runner类的调用<br />
以<span style="color: rgb(0, 0, 255);">MinibatchRl</span>这个Runner类为例，它的 train()&#160;函数中有这么一句：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
opt_info = <span style="color:#94558d;">self</span>.algo.<span style="color:#cc7833;">optimize_agent</span>(itr<span style="color:#cc7832;">, </span>samples)</pre>
<p>其中，self.algo&#160;就是一个Algorithm类的对象，这里的&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a7-%e6%a8%a1%e5%9e%8b%e5%8f%82%e6%95%b0%e6%98%af%e5%9c%a8%e5%93%aa/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。<br />
本文简要分析一下在rlpyt中，强化学习模型的参数是在什么地方被更新、怎么被更新的。<br />
<span id="more-11528"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;概述<br />
模型参数是在<span style="color:#0000ff;">Algorithm</span>模块的<span style="color:#0000ff;">optimize_agent()</span>函数里被更新的，它在Runner类(例如&nbsp;<span style="color:#0000ff;">MinibatchRl</span>)的<span style="color:#0000ff;">train()</span>函数里被调用。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;Runner类的调用<br />
以<span style="color: rgb(0, 0, 255);">MinibatchRl</span>这个Runner类为例，它的 train()&nbsp;函数中有这么一句：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
opt_info = <span style="color:#94558d;">self</span>.algo.<span style="color:#cc7833;">optimize_agent</span>(itr<span style="color:#cc7832;">, </span>samples)</pre>
<p>其中，self.algo&nbsp;就是一个Algorithm类的对象，这里的<span style="color:#0000ff;">optimize_agent()</span>函数会用采样得到的一批数据(samples)更新一次模型参数。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;Algorithm类更新模型参数的实现<br />
在<a href="https://www.codelast.com/?p=10750" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">前文</span></a>中提到了rlpyt有一个模块叫做<span style="color:#0000ff;">Algorithm</span>，它们位于项目的 <span style="color:#0000ff;">rlpyt/algos/</span>&nbsp;路径下：</p>
<blockquote>
<div>
		├── base.py</div>
<div>
		├── dqn</div>
<div>
		│&nbsp; &nbsp;├── cat_dqn.py</div>
<div>
		│&nbsp; &nbsp;├── dqn.py</div>
<div>
		│&nbsp; &nbsp;└── r2d1.py</div>
<div>
		├── pg</div>
<div>
		│&nbsp; &nbsp;├── a2c.py</div>
<div>
		│&nbsp; &nbsp;├── base.py</div>
<div>
		│&nbsp; &nbsp;└── ppo.py</div>
<div>
		├── qpg</div>
<div>
		│&nbsp; &nbsp;├── ddpg.py</div>
<div>
		│&nbsp; &nbsp;├── sac.py</div>
<div>
		│&nbsp; &nbsp;├── sac_v.py</div>
<div>
		│&nbsp; &nbsp;└── td3.py</div>
<div>
		└── utils.py</div>
</blockquote>
<div>
	这些就是rlpyt里面的&ldquo;算法&rdquo;模块，它们实现了DQN，PPO等算法。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	以DQN为例(<span style="color: rgb(0, 0, 255);">rlpyt/algos/dqn/dqn.py</span>)，其<span style="color: rgb(0, 0, 255);">optimize_agent()</span>函数有这么几句：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.optimizer.<span style="color:#cc7833;">zero_grad</span>()  <span style="color:#808080;"># 将所有参数的梯度都置零
</span>loss<span style="color:#cc7832;">, </span>td_abs_errors = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">loss</span>(samples_from_replay)
loss.<span style="color:#cc7833;">backward</span>()  <span style="color:#808080;"># 误差反向传播计算参数梯度
</span>grad_norm = torch.nn.utils.<span style="color:#cc7833;">clip_grad_norm_</span>(<span style="color:#94558d;">self</span>.agent.<span style="color:#cc7833;">parameters</span>()<span style="color:#cc7832;">, </span><span style="color:#94558d;">self</span>.clip_grad_norm)
<span style="color:#94558d;">self</span>.optimizer.<span style="color:#cc7833;">step</span>()  <span style="color:#808080;"># 通过梯度做一步参数更新</span></pre>
<p>	加上注释的几句就是主要的模型参数更新逻辑。其中，self.optimizer其实就是PyTorch的optimzer对象(例如 torch.optim.Adam)，用于优化神经网络的参数。<br />
	但是乍一看，这几句optimizer的操作，貌似和模型(torch.nn.Module)的参数没有关系？<br />
	所以这就涉及到另一个问题：optimizer和model是怎么关联上的？<br />
	在<span style="color:#0000ff;">DQN.optim_initialize()</span>函数中创建了 <span style="color:#0000ff;">self.optimizer</span>&nbsp;对象：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.optimizer = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">OptimCls</span>(<span style="color:#94558d;">self</span>.agent.<span style="color:#cc7833;">parameters</span>()<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">lr</span>=<span style="color:#94558d;">self</span>.learning_rate<span style="color:#cc7832;">, </span>**<span style="color:#94558d;">self</span>.optim_kwargs)</pre>
<p>	其中，<span style="color:#0000ff;">self.OptimCls</span>&nbsp;就是PyTorch的optimzer类，例如 torch.optim.Adam。其构造函数可以接受一个 <span style="color:#0000ff;">params</span>&nbsp;参数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>params<span style="color:#cc7832;">, </span>lr=<span style="color:#6897bb;">1e-3</span><span style="color:#cc7832;">, </span>betas=(<span style="color:#6897bb;">0.9</span><span style="color:#cc7832;">, </span><span style="color:#6897bb;">0.999</span>)<span style="color:#cc7832;">, </span>eps=<span style="color:#6897bb;">1e-8</span><span style="color:#cc7832;">,
</span><span style="color:#cc7832;">             </span>weight_decay=<span style="color:#6897bb;">0</span><span style="color:#cc7832;">, </span>amsgrad=<span style="color:#cc7832;font-weight:bold;">False</span>):</pre>
<p>	官方文档对&nbsp;<span style="color: rgb(0, 0, 255);">params</span>参数的说明：</div>
<div>
<blockquote>
<p>
			params (iterable): iterable of parameters to optimize or dicts defining parameter groups</p>
</blockquote>
<p>	在创建 self.optimizer&nbsp;对象的时候，传入了一个 <span style="color:#0000ff;">self.agent.parameters()</span> 参数，这个函数的实现在 <span style="color:#0000ff;">BaseAgent.parameters()</span>&nbsp;这里：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">parameters</span>(<span style="color:#94558d;">self</span>):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;Parameters to be optimized (overwrite in subclass if multiple models).&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#94558d;">self</span>.model.<span style="color:#cc7833;">parameters</span>()</pre>
<p>	其中，self.model&nbsp;就是&nbsp;torch.nn.Module&nbsp;类型的对象，其 parameters()&nbsp;函数返回的就是模型要优化的参数。<br />
	于是 model 就这样和 optimizer 关联起来了。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	这一节就到这，且听下回分解。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a7-%e6%a8%a1%e5%9e%8b%e5%8f%82%e6%95%b0%e6%98%af%e5%9c%a8%e5%93%aa/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 并行(parallelism)原理初探</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e5%b9%b6%e8%a1%8cparallelism%e5%8e%9f%e7%90%86%e5%88%9d%e6%8e%a2/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e5%b9%b6%e8%a1%8cparallelism%e5%8e%9f%e7%90%86%e5%88%9d%e6%8e%a2/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Mon, 23 Dec 2019 05:26:47 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[并行]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11346</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;</p>
<p>在单机上全面的并行（Parallelism）特性是 rlpyt 有别于很多其他强化学习框架的一个显著特征。在前面的简介文章中，已经介绍了 rlpyt 支持多种场景下的并行训练。而这种&#8220;武功&#8221;是怎么修炼出来的呢？它是站在了巨人的肩膀上&#8212;&#8212;通过PyTorch的多进程(multiprocessing)机制来实现的。<br />
所以你知道为什么 rlpyt 不使用TensorFlow这样的框架来作为后端了吧，因为TensorFlow根本就没有这种功能。TensorFlow只能靠类似于<a href="https://github.com/ray-project/ray" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Ray</span></a>这样的并行计算框架的帮助，才能支撑起全方位的并行特性。<br />
<span id="more-11346"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;为什么说TensorFlow自身的并行能力并不适用于强化学习场景<br />
限于我掌握的知识，我不保证下面的结论都是正确的，请专家们不吝赐教。<br />
相信很多刚开始学写强化学习程序的人，都是从<a href="https://morvanzhou.github.io/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">莫凡</span></a>的强化学习教程开始的，莫凡的强化学习教程使用的是TensorFlow来实现的（很久以前看到是这样，后来我没有再去关注过，不知道他有没有发布在其他ML框架下的RL教程）。<br />
看过一部分莫凡RL代码的人都会知道，里面用TensorFlow实现的静态图多进程&#8220;并行&#8221;训练逻辑有多么晦涩（而且并行其实是伪并行，说到底还是串行）。<br />
我个人认为，如果一个初学者从这样的程序入手，其实就相当于&#8220;劝退&#8221;，也就是说：这程序这么难写，你还是别学了吧。如果有与莫凡的RL代码逻辑对等的PyTorch代码，那绝对会是另一番景象。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
有人会说，明明TensorFlow就支持并行训练的啊！现在很多模型不就是通过多机多卡分布式训练的吗？<br />
然而到了强化学习场景下，就不是这么一回事了：强化学习和监督学习很不一样。在强化学习场景下，如果要并行训练的话，会需要多个agent，与多个environment交互，对应到程序就是多个进程/线程。与environment交互的过程，可以是纯CPU计算，也可以是CPU/GPU混合计算（例如，inference得到action的过程就可以放在GPU上加速），但这个过程不能是纯GPU计算的过程。以Atari游戏模拟器为例，调用<a href="https://github.com/mgbellemare/Arcade-Learning-Environment" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">ALE</span></a>接口得到Atari环境的反馈，这个过程就是CPU计算的，不能在GPU上计算。整个强化学习的流程，数据就是这样不断地在CPU/GPU之间流转，当然你可以使用纯CPU，但假设你使用了GPU的话，也只能在一小部分工作中使用GPU，其实CPU的工作也很重。反观supervised learning，当你把数据预处理好了之后，就可以一次性地喂给GPU，GPU在单机单卡训练的时候，可以把结果全部算完了再吐回给CPU；就算是Distributed TensorFlow，也不适用于强化学习，因为Distributed TensorFlow的并行功能是为了并行地使用GPU对吧？但强化学习的采样过程是使用CPU，按我的理解这部分工作不能使用Distributed TensorFlow来并行，相反PyTorch有<span style="color:#0000ff;">multiprocessing</span>可以做到；而计算梯度之类的工作用Distributed TensorFlow就可以并行了&#8212;&#8212;但别的DL框架例如PyTorch也可以啊。<br />
所以Distributed TensorFlow在RL场景下有什么优势？没看出来。<br />
关于TensorFlow在强化学习场景下的应用，莫凡当时也<a href="https://www.zhihu.com/question/63342728/answer/297818331" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">在知乎向网友提问</span></a>如何能在TF下较好地实现强化学习的并行功能，结论大概就是：还是用PyTorch吧！<br />
另外，知乎上有<a href="https://www.zhihu.com/question/308716947/answer/571637089" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">一个讨论</span></a>，提问者对GPU并行训练DRL模型的并行过程提出了疑问。第一个回答里面说&#8220;采样过程可以并行&#8221;，但作者说的并不是指Distributed TensorFlow支持这个功能。<br />
所以我认为，TensorFlow由于缺少了类似于PyTorch&#160;multiprocessing那样的模块，它只能借助于类似于Ray的并行计算框架，也就是在外面再&#8220;包装一层&#8221;，才能把TF对&#8220;全面的并行强化学习&#8221;的缺陷给修补上。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e5%b9%b6%e8%a1%8cparallelism%e5%8e%9f%e7%90%86%e5%88%9d%e6%8e%a2/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;</p>
<p>在单机上全面的并行（Parallelism）特性是 rlpyt 有别于很多其他强化学习框架的一个显著特征。在前面的简介文章中，已经介绍了 rlpyt 支持多种场景下的并行训练。而这种&ldquo;武功&rdquo;是怎么修炼出来的呢？它是站在了巨人的肩膀上&mdash;&mdash;通过PyTorch的多进程(multiprocessing)机制来实现的。<br />
所以你知道为什么 rlpyt 不使用TensorFlow这样的框架来作为后端了吧，因为TensorFlow根本就没有这种功能。TensorFlow只能靠类似于<a href="https://github.com/ray-project/ray" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Ray</span></a>这样的并行计算框架的帮助，才能支撑起全方位的并行特性。<br />
<span id="more-11346"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;为什么说TensorFlow自身的并行能力并不适用于强化学习场景<br />
限于我掌握的知识，我不保证下面的结论都是正确的，请专家们不吝赐教。<br />
相信很多刚开始学写强化学习程序的人，都是从<a href="https://morvanzhou.github.io/" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">莫凡</span></a>的强化学习教程开始的，莫凡的强化学习教程使用的是TensorFlow来实现的（很久以前看到是这样，后来我没有再去关注过，不知道他有没有发布在其他ML框架下的RL教程）。<br />
看过一部分莫凡RL代码的人都会知道，里面用TensorFlow实现的静态图多进程&ldquo;并行&rdquo;训练逻辑有多么晦涩（而且并行其实是伪并行，说到底还是串行）。<br />
我个人认为，如果一个初学者从这样的程序入手，其实就相当于&ldquo;劝退&rdquo;，也就是说：这程序这么难写，你还是别学了吧。如果有与莫凡的RL代码逻辑对等的PyTorch代码，那绝对会是另一番景象。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
有人会说，明明TensorFlow就支持并行训练的啊！现在很多模型不就是通过多机多卡分布式训练的吗？<br />
然而到了强化学习场景下，就不是这么一回事了：强化学习和监督学习很不一样。在强化学习场景下，如果要并行训练的话，会需要多个agent，与多个environment交互，对应到程序就是多个进程/线程。与environment交互的过程，可以是纯CPU计算，也可以是CPU/GPU混合计算（例如，inference得到action的过程就可以放在GPU上加速），但这个过程不能是纯GPU计算的过程。以Atari游戏模拟器为例，调用<a href="https://github.com/mgbellemare/Arcade-Learning-Environment" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">ALE</span></a>接口得到Atari环境的反馈，这个过程就是CPU计算的，不能在GPU上计算。整个强化学习的流程，数据就是这样不断地在CPU/GPU之间流转，当然你可以使用纯CPU，但假设你使用了GPU的话，也只能在一小部分工作中使用GPU，其实CPU的工作也很重。反观supervised learning，当你把数据预处理好了之后，就可以一次性地喂给GPU，GPU在单机单卡训练的时候，可以把结果全部算完了再吐回给CPU；就算是Distributed TensorFlow，也不适用于强化学习，因为Distributed TensorFlow的并行功能是为了并行地使用GPU对吧？但强化学习的采样过程是使用CPU，按我的理解这部分工作不能使用Distributed TensorFlow来并行，相反PyTorch有<span style="color:#0000ff;">multiprocessing</span>可以做到；而计算梯度之类的工作用Distributed TensorFlow就可以并行了&mdash;&mdash;但别的DL框架例如PyTorch也可以啊。<br />
所以Distributed TensorFlow在RL场景下有什么优势？没看出来。<br />
关于TensorFlow在强化学习场景下的应用，莫凡当时也<a href="https://www.zhihu.com/question/63342728/answer/297818331" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">在知乎向网友提问</span></a>如何能在TF下较好地实现强化学习的并行功能，结论大概就是：还是用PyTorch吧！<br />
另外，知乎上有<a href="https://www.zhihu.com/question/308716947/answer/571637089" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">一个讨论</span></a>，提问者对GPU并行训练DRL模型的并行过程提出了疑问。第一个回答里面说&ldquo;采样过程可以并行&rdquo;，但作者说的并不是指Distributed TensorFlow支持这个功能。<br />
所以我认为，TensorFlow由于缺少了类似于PyTorch&nbsp;multiprocessing那样的模块，它只能借助于类似于Ray的并行计算框架，也就是在外面再&ldquo;包装一层&rdquo;，才能把TF对&ldquo;全面的并行强化学习&rdquo;的缺陷给修补上。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;PyTorch的多进程处理功能<br />
参考<span style="background-color:#ffa07a;"><a href="https://www.jiqizhixin.com/articles/2019-12-02-6" rel="noopener noreferrer" target="_blank">这段话</a></span>：</p>
<blockquote>
<div>
		由于全局解释器锁（global interpreter lock，GIL）的 Python 默认实现不允许并行线程进行并行执行，所以为了解决该问题，Python 社区已经建立了一个标准的多进程处理模块，其中包含了大量的实用程序（utility），它们可以使得用户轻易地生成子进程并能够实现基础的进程间通信原语（communication primitive）。</div>
<div>
		&nbsp;</div>
<div>
		然而，原语的实现使用了与磁盘上持久性（on-disk persistence）相同格式的序列化，这在处理大规模数组时效率不高。所以，PyTorch 将Python 的 multiprocessing 模块扩展为 torch.multiprocessing，这就替代了内置包，并且自动将发送至其他进程的张量数据移动至共享内存中，而不用再通过通信渠道发送。</div>
<div>
		&nbsp;</div>
<div>
		PyTorch 的这一设计极大地提升了性能，并且弱化了进程隔离（process isolation），从而产生了更类似于普通线程程序的编程模型。</div>
</blockquote>
<div>
	看看就好，想深入了解的话请移步PyTorch文档。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;rlpyt的并行(parallelism)功能的局限<br />
	rlpyt瞄准的是单机上的RL训练效率的极致优化，它不支持多机训练。在单机硬件资源允许的范围内，rlpyt可以让RL模型训练很快，但如果你的训练数据占用的资源远远超过了单机硬件的范围，那么就只能利用支持分布式训练的框架了，例如构建在<span style="background-color: rgb(255, 160, 122);"><a href="https://github.com/ray-project/ray" rel="noopener noreferrer" target="_blank">Ray</a></span>之上的框架<a href="https://ray.readthedocs.io/en/latest/rllib.html" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">RLlib</span></a>，又例如基于PaddlePaddle的<a href="https://github.com/PaddlePaddle/PARL" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">PARL</span></a>等。<br />
	这里值得一提的是，PARL号称它与RLlib进行了IMPALA算法下的<a href="https://www.jiqizhixin.com/articles/2019-04-28-5" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">对比测试</span></a>，其数据吞吐量（同等计算资源下的数据收集速度）足以吊打RLlib，所以PARL看起来是一个有前途的框架。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e5%b9%b6%e8%a1%8cparallelism%e5%8e%9f%e7%90%86%e5%88%9d%e6%8e%a2/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习的Atari环境下的frame skipping(跳帧)是指什么</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e7%9a%84atari%e7%8e%af%e5%a2%83%e4%b8%8b%e7%9a%84frame-skipping%e8%b7%b3%e5%b8%a7%e6%98%af%e6%8c%87%e4%bb%80%e4%b9%88/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e7%9a%84atari%e7%8e%af%e5%a2%83%e4%b8%8b%e7%9a%84frame-skipping%e8%b7%b3%e5%b8%a7%e6%98%af%e6%8c%87%e4%bb%80%e4%b9%88/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sat, 14 Dec 2019 17:24:26 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Atari]]></category>
		<category><![CDATA[frame skipping]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[强化学习]]></category>
		<category><![CDATA[跳帧]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11387</guid>

					<description><![CDATA[<p>
查看更多强化学习的文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>Atari是强化学习领域最常用的一个游戏实验环境，在很多文章以及代码中，会看到frame skipping（跳帧）这个概念，那么它到底是指什么呢？<br />
<span id="more-11387"></span><br />
使用<a href="https://github.com/mgbellemare/Arcade-Learning-Environment" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">ALE</span></a>接口来实现agent与Atari环境的交互时，Atari环境会返回游戏的每一帧图像作为observation，agent需要为这个observation选择一个action，再让Atari环境去执行这个action。<br />
由于游戏是一个持续不断进行的过程，因此，为了减少运算量，一种叫做 <a href="http://nn.cs.utexas.edu/pub-view.php?PubID=127530" rel="noopener noreferrer" target="_blank"><span style="color:#0000ff;"><span style="background-color:#ffa07a;">frame skipping</span></span></a>（<span style="color:#b22222;">跳帧</span>）的技术被发明出来了，即，原来agent与environment的交互应该是这种画风：<br />
<span style="color:#b22222;">Atari给出一帧图像&#8594;agent选择一个action&#8594;Atari执行该action给出下一帧图像&#8594;agent选择下一个action&#8594;（如此循环下去）</span><br />
现在变成了这种画风：<br />
Atari给出一帧图像&#8594;agent选择一个action&#8594;Atari执行该action给出下一帧图像&#8594;<span style="color:#008000;">agent重复使用上次的action给Atari执行&#8594;Atari执行该action给出下一帧图像</span>&#8594;<span style="color:#008000;">agent重复使用上次的action给Atari执行&#8594;Atari执行该action给出下一帧图像</span>&#8594;（<span style="color:#0000ff;">如此重复N帧</span>）&#8594;agent重新选择一个action&#8594;Atari执行该action给出下一帧图像&#8594;......<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
注意上面的重复部分，简单地说就是：每经过N帧，agent才会选择一次action，在接下来的N帧内，会重复使用之前最后一次选择的那个action。<br />
为什么要这样做？因为action&#160;selection的过程是一个计算量较大的过程（想像成model的inference过程），而Atari环境向前走一步相对来说是计算量较小的过程，让Atari每走N步才选择一次action的话，可以让玩一次游戏的时间大幅减少，因此agent就能在单位时间内得到更充分的训练。<br />
这种每跳过N帧才选择一次action的技术，就叫&#160;<span style="color: rgb(0, 0, 255);">frame skipping</span>（<span style="color: rgb(178, 34, 34);">跳帧</span>），在很多强化学习框架中，也会看到这个参数的设定。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;版权声明&#160;<span style="color: rgb(255, 0, 0);">➤➤</span>&#160;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e7%9a%84atari%e7%8e%af%e5%a2%83%e4%b8%8b%e7%9a%84frame-skipping%e8%b7%b3%e5%b8%a7%e6%98%af%e6%8c%87%e4%bb%80%e4%b9%88/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
查看更多强化学习的文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p>Atari是强化学习领域最常用的一个游戏实验环境，在很多文章以及代码中，会看到frame skipping（跳帧）这个概念，那么它到底是指什么呢？<br />
<span id="more-11387"></span><br />
使用<a href="https://github.com/mgbellemare/Arcade-Learning-Environment" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">ALE</span></a>接口来实现agent与Atari环境的交互时，Atari环境会返回游戏的每一帧图像作为observation，agent需要为这个observation选择一个action，再让Atari环境去执行这个action。<br />
由于游戏是一个持续不断进行的过程，因此，为了减少运算量，一种叫做 <a href="http://nn.cs.utexas.edu/pub-view.php?PubID=127530" rel="noopener noreferrer" target="_blank"><span style="color:#0000ff;"><span style="background-color:#ffa07a;">frame skipping</span></span></a>（<span style="color:#b22222;">跳帧</span>）的技术被发明出来了，即，原来agent与environment的交互应该是这种画风：<br />
<span style="color:#b22222;">Atari给出一帧图像&rarr;agent选择一个action&rarr;Atari执行该action给出下一帧图像&rarr;agent选择下一个action&rarr;（如此循环下去）</span><br />
现在变成了这种画风：<br />
Atari给出一帧图像&rarr;agent选择一个action&rarr;Atari执行该action给出下一帧图像&rarr;<span style="color:#008000;">agent重复使用上次的action给Atari执行&rarr;Atari执行该action给出下一帧图像</span>&rarr;<span style="color:#008000;">agent重复使用上次的action给Atari执行&rarr;Atari执行该action给出下一帧图像</span>&rarr;（<span style="color:#0000ff;">如此重复N帧</span>）&rarr;agent重新选择一个action&rarr;Atari执行该action给出下一帧图像&rarr;......<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
注意上面的重复部分，简单地说就是：每经过N帧，agent才会选择一次action，在接下来的N帧内，会重复使用之前最后一次选择的那个action。<br />
为什么要这样做？因为action&nbsp;selection的过程是一个计算量较大的过程（想像成model的inference过程），而Atari环境向前走一步相对来说是计算量较小的过程，让Atari每走N步才选择一次action的话，可以让玩一次游戏的时间大幅减少，因此agent就能在单位时间内得到更充分的训练。<br />
这种每跳过N帧才选择一次action的技术，就叫&nbsp;<span style="color: rgb(0, 0, 255);">frame skipping</span>（<span style="color: rgb(178, 34, 34);">跳帧</span>），在很多强化学习框架中，也会看到这个参数的设定。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e7%9a%84atari%e7%8e%af%e5%a2%83%e4%b8%8b%e7%9a%84frame-skipping%e8%b7%b3%e5%b8%a7%e6%98%af%e6%8c%87%e4%bb%80%e4%b9%88/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt：如何使用预训练(pre-trained)的model</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e9%a2%84%e8%ae%ad%e7%bb%83pre-trained%e7%9a%84model/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e9%a2%84%e8%ae%ad%e7%bb%83pre-trained%e7%9a%84model/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 11 Dec 2019 08:58:12 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11303</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;<br />
本文描述了在 rlpyt 框架下，如何使用一个预训练过的（pre-trained）model作为起点，来训练自己的RL模型的过程。<br />
<span id="more-11303"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;什么是预训练模型<br />
引用<a href="https://cloud.tencent.com/developer/article/1077499" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">一篇文章</span></a>：</p>
<blockquote>
<div>
		简单来说，预训练模型(pre-trained model)是前人为了解决类似问题所创造出来的模型。你在解决问题的时候，不用从零开始训练一个新模型，可以从在类似问题中训练过的模型入手。</div>
<div>
		比如说，如果你想做一辆自动驾驶汽车，可以花数年时间从零开始构建一个性能优良的图像识别算法，也可以从Google在ImageNet数据集上训练得到的inception model(一个预训练模型)起步，来识别图像。</div>
<div>
		一个预训练模型可能对于你的应用中并不是100%的准确对口，但是它可以为你节省大量功夫。</div>
</blockquote>
<div>
	训练一个强化学习模型也可能会需要消耗大量计算资源，尤其是你手上没有强大算力的时候，靠一台普通电脑去train一个model可能会用掉很长时间，因此，在别人已经train好的model的基础上继续train自己的model是一个好办法。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;rlpyt 对预训练模型的支持<br />
	以使用 DQN 算法的 example_1 为例，class DQN(RlAlgorithm) 的&#160;<span style="color:#0000ff;">__init__()</span> 函数有一个&#160;<span style="color:#b22222;">initial_optim_state_dict</span>&#160;参数：
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
initial_optim_state_dict=<span style="color:#cc7832;">None,</span></pre>
<p>	另外，AtariDqnAgent 类的其中一个父类：DqnAgent，它又有一个父类&#160;BaseAgent，在 __init__() 初始化的时候也有一个 <span style="color:#b22222;">initial_model_state_dict</span> 参数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>ModelCls=<span style="color:#cc7832;">None, </span>model_kwargs=<span style="color:#cc7832;">None, </span>initial_model_state_dict=<span style="color:#cc7832;">None</span>):</pre>
<p>	这两个地方，就是当你使用预训练模型的时候需要传入的参数。<br />
	但<span style="color:#0000ff;">为什么会有两个参数？它们有什么区别？</span><br />
	<span style="color:#ff8c00;">✔</span> 前一个是Optimizer（优化器，例如 torch.optim.Adam）的</p></div>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e9%a2%84%e8%ae%ad%e7%bb%83pre-trained%e7%9a%84model/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;<br />
本文描述了在 rlpyt 框架下，如何使用一个预训练过的（pre-trained）model作为起点，来训练自己的RL模型的过程。<br />
<span id="more-11303"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;什么是预训练模型<br />
引用<a href="https://cloud.tencent.com/developer/article/1077499" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">一篇文章</span></a>：</p>
<blockquote>
<div>
		简单来说，预训练模型(pre-trained model)是前人为了解决类似问题所创造出来的模型。你在解决问题的时候，不用从零开始训练一个新模型，可以从在类似问题中训练过的模型入手。</div>
<div>
		比如说，如果你想做一辆自动驾驶汽车，可以花数年时间从零开始构建一个性能优良的图像识别算法，也可以从Google在ImageNet数据集上训练得到的inception model(一个预训练模型)起步，来识别图像。</div>
<div>
		一个预训练模型可能对于你的应用中并不是100%的准确对口，但是它可以为你节省大量功夫。</div>
</blockquote>
<div>
	训练一个强化学习模型也可能会需要消耗大量计算资源，尤其是你手上没有强大算力的时候，靠一台普通电脑去train一个model可能会用掉很长时间，因此，在别人已经train好的model的基础上继续train自己的model是一个好办法。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;rlpyt 对预训练模型的支持<br />
	以使用 DQN 算法的 example_1 为例，class DQN(RlAlgorithm) 的&nbsp;<span style="color:#0000ff;">__init__()</span> 函数有一个&nbsp;<span style="color:#b22222;">initial_optim_state_dict</span>&nbsp;参数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
initial_optim_state_dict=<span style="color:#cc7832;">None,</span></pre>
<p>	另外，AtariDqnAgent 类的其中一个父类：DqnAgent，它又有一个父类&nbsp;BaseAgent，在 __init__() 初始化的时候也有一个 <span style="color:#b22222;">initial_model_state_dict</span> 参数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>ModelCls=<span style="color:#cc7832;">None, </span>model_kwargs=<span style="color:#cc7832;">None, </span>initial_model_state_dict=<span style="color:#cc7832;">None</span>):</pre>
<p>	这两个地方，就是当你使用预训练模型的时候需要传入的参数。<br />
	但<span style="color:#0000ff;">为什么会有两个参数？它们有什么区别？</span><br />
	<span style="color:#ff8c00;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span> 前一个是Optimizer（优化器，例如 torch.optim.Adam）的 state_dict，其包含的参数有 learning rate 等。<br />
	<span style="color: rgb(255, 140, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;后一个是model的 state_dict，其包含的参数有 model 的 weight、bias 等。<br />
	直观点，来个图（图片可放大）：<br />
	<img decoding="async" alt="pre-trained model" src="https://www.codelast.com/wp-content/uploads/2019/12/load_pre_trained_model.png" style="width: 750px; height: 111px;" /><br />
	从图中可以清楚地看到model里存储的数据，<span style="color:#0000ff;">optimizer_state_dict</span> 就是&nbsp;Optimizer 的 state_dict，<span style="color:#0000ff;">agent_state_dict</span> 就是model的 state_dict。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;代码实操：加载预训练模型<br />
	首先我们要有一个预训练模型文件，因此，我们先把没有修改过代码的 example_1 运行一段时间，生成一个&nbsp;params.pkl 模型文件，假设此文件路径为：<span style="color:#b22222;">/home/codelast/rlpyt/data/local/20191111/example_1/run_0/params.pkl</span><br />
	现在修改 example_1.py，可以加载预训练模型了：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">加载预训练模型
</span>model_loaded = torch.load(<span style="color:#6a8759;">&#39;/home/codelast/rlpyt/data/local/20191111/example_1/run_0/params.pkl&#39;</span>)
optimizer_state_dict = model_loaded[<span style="color:#6a8759;">&#39;optimizer_state_dict&#39;</span>]
agent_state_dict = model_loaded[<span style="color:#6a8759;">&#39;agent_state_dict&#39;</span>]

algo = DQN(<span style="color:#aa4926;">min_steps_learn</span>=<span style="color:#6897bb;">1e3</span><span style="color:#cc7832;">, </span><span style="color:#aa4926;">initial_optim_state_dict</span>=optimizer_state_dict)
agent = AtariDqnAgent(<span style="color:#aa4926;">initial_model_state_dict</span>=agent_state_dict[<span style="color:#6a8759;">&#39;model&#39;</span>])</pre>
<p>	其他代码无需修改，就这么简单！<br />
	再重新运行修改过的example，现在就已经是在pre-trained model的基础上继续进行的训练了。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bd%bf%e7%94%a8%e9%a2%84%e8%ae%ad%e7%bb%83pre-trained%e7%9a%84model/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt：如何保存训练过程中的所有model</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bf%9d%e5%ad%98%e8%ae%ad%e7%bb%83%e8%bf%87%e7%a8%8b%e4%b8%ad%e7%9a%84%e6%89%80%e6%9c%89mo/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bf%9d%e5%ad%98%e8%ae%ad%e7%bb%83%e8%bf%87%e7%a8%8b%e4%b8%ad%e7%9a%84%e6%89%80%e6%9c%89mo/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 11 Dec 2019 06:24:26 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11293</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;<br />
本文描述了如何保存迭代训练过程的所有model，以及背后的逻辑。<br />
<span id="more-11293"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;迭代训练过程中产生的所有model，能全部保存下来吗<br />
当然可以。以 example_1 为例，它有如下代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">with </span>logger_context(log_dir<span style="color:#cc7832;">, </span>run_ID<span style="color:#cc7832;">, </span>name<span style="color:#cc7832;">, </span>config<span style="color:#cc7832;">, </span><span style="color:#aa4926;">snapshot_mode</span>=<span style="color:#6a8759;">&#34;last&#34;</span>):
    runner.train()</pre>
<p>只需要把&#160;snapshot_mode=&#34;last&#34;&#160;改成&#160;snapshot_mode=&#34;<span style="color:#0000ff;">all</span>&#34;，就可以把迭代过程中的所有模型全部保存到磁盘文件了。<br />
&#8220;last&#8221;表示只保存最后一次迭代的model文件。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bf%9d%e5%ad%98%e8%ae%ad%e7%bb%83%e8%bf%87%e7%a8%8b%e4%b8%ad%e7%9a%84%e6%89%80%e6%9c%89mo/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;<br />
本文描述了如何保存迭代训练过程的所有model，以及背后的逻辑。<br />
<span id="more-11293"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;迭代训练过程中产生的所有model，能全部保存下来吗<br />
当然可以。以 example_1 为例，它有如下代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">with </span>logger_context(log_dir<span style="color:#cc7832;">, </span>run_ID<span style="color:#cc7832;">, </span>name<span style="color:#cc7832;">, </span>config<span style="color:#cc7832;">, </span><span style="color:#aa4926;">snapshot_mode</span>=<span style="color:#6a8759;">&quot;last&quot;</span>):
    runner.train()</pre>
<p>只需要把&nbsp;snapshot_mode=&quot;last&quot;&nbsp;改成&nbsp;snapshot_mode=&quot;<span style="color:#0000ff;">all</span>&quot;，就可以把迭代过程中的所有模型全部保存到磁盘文件了。<br />
&ldquo;last&rdquo;表示只保存最后一次迭代的model文件。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;保存model的逻辑<br />
model是在 logger.py 里的&nbsp;save_itr_params() 函数里保存到磁盘文件的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">save_itr_params</span>(itr<span style="color:#cc7832;">, </span>params):
    <span style="color:#cc7832;">if </span>_snapshot_dir:
        <span style="color:#cc7832;">if </span>_snapshot_mode == <span style="color:#6a8759;">&#39;all&#39;</span>:
            file_name = osp.join(get_snapshot_dir()<span style="color:#cc7832;">, </span><span style="color:#6a8759;">&#39;itr_%d.pkl&#39; </span>% itr)
        <span style="color:#cc7832;">elif </span>_snapshot_mode == <span style="color:#6a8759;">&#39;last&#39;</span>:
            <span style="color:#808080;"># override previous params
</span><span style="color:#808080;">            </span>file_name = osp.join(get_snapshot_dir()<span style="color:#cc7832;">, </span><span style="color:#6a8759;">&#39;params.pkl&#39;</span>)
        <span style="color:#cc7832;">elif </span>_snapshot_mode == <span style="color:#6a8759;">&quot;gap&quot;</span>:
            <span style="color:#cc7832;">if </span>itr == <span style="color:#6897bb;">0 </span><span style="color:#cc7832;">or </span>(itr + <span style="color:#6897bb;">1</span>) % _snapshot_gap == <span style="color:#6897bb;">0</span>:
                file_name = osp.join(get_snapshot_dir()<span style="color:#cc7832;">, </span><span style="color:#6a8759;">&#39;itr_%d.pkl&#39; </span>% itr)
            <span style="color:#cc7832;">else</span>:
                <span style="color:#cc7832;">return
</span><span style="color:#cc7832;">        elif </span>_snapshot_mode == <span style="color:#6a8759;">&#39;none&#39;</span>:
            <span style="color:#cc7832;">return
</span><span style="color:#cc7832;">        else</span>:
            <span style="color:#cc7832;">raise </span><span style="color:#8888c6;">NotImplementedError
</span><span style="color:#8888c6;">        </span>torch.save(params<span style="color:#cc7832;">, </span>file_name)  <span style="color:#808080;"># </span><span style="color:#808080;font-family:'AR PL UKai CN';">模型参数保存到文件</span></pre>
<p>其根据&nbsp;<span style="color:#0000ff;">_snapshot_mode</span>&nbsp;变量来控制保存逻辑：<br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;all：保存所有迭代的model文件。<br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;last：只保存最后一次迭代的model文件。<br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;gap：每N次迭代保存一个model文件，N可以通过<span style="color:#b22222;">logger.set_snapshot_mode()</span>函数来设置。<br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;none：不保存任何model文件。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
而&nbsp;<span style="color: rgb(0, 0, 255);">_snapshot_mode</span>，正是由&nbsp;logger_context() 函数的 <span style="color:#b22222;">snapshot_mode</span> 参数最终设置进去的。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e4%bf%9d%e5%ad%98%e8%ae%ad%e7%bb%83%e8%bf%87%e7%a8%8b%e4%b8%ad%e7%9a%84%e6%89%80%e6%9c%89mo/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt：如何同时输出gaussian（高斯）和categorical（类别）的action</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e5%90%8c%e6%97%b6%e8%be%93%e5%87%bagaussian%ef%bc%88%e9%ab%98%e6%96%af%ef%bc%89%e5%92%8ccat/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e5%90%8c%e6%97%b6%e8%be%93%e5%87%bagaussian%ef%bc%88%e9%ab%98%e6%96%af%ef%bc%89%e5%92%8ccat/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 11 Dec 2019 03:18:01 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11277</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;<br />
本文记录 rlpyt 的一些issue提及的问题以及解决方案。<br />
<span id="more-11277"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;同时输出gaussian（高斯）和categorical（类别）的action<br />
<span style="color:#6699ff;">✔</span> issue链接：<a href="https://github.com/astooke/rlpyt/issues/39" rel="noopener noreferrer" target="_blank">在这里</a><br />
<span style="color: rgb(102, 153, 255);">✔</span>&#160;问题描述：一般来说，action要么是一个高斯分布（gaussian），要么是一个类别值（categorical），如何能把这二者混合起来，即同时输出gaussian和categorical的action？<br />
<span style="color: rgb(102, 153, 255);">✔</span>&#160;我的理解：gaussian的action，指的是policy network输出的是一个action的概率分布，而不是一个确定的action（例如，有70%的可能选择action 1，有30%的可能选择action 2），此时，我们是按这个概率分布来选择一个具体的action，举个例子，对应到Python程序，你可能会用&#160;np.random.choice(a, size=None, replace=True, p=None) 函数来在一个指定的概率分布上选择一个action，其p参数可以指定概率值。<br />
而categorical的action，指的是policy network输出的是一个确定的action，例如它直接输出的可能是action 1，或action 2，诸如此类，而不是给出action 1，action 2的概率值再让用户去选。<br />
这个issue的目的，就是想实现一种&#8220;非常规&#8221;的用法，输出的action同时具有这两种性质&#8212;&#8212;我想像不出来应该用什么来举例。<br />
<span style="color: rgb(102, 153, 255);">✔</span>&#160;解决方案：rlpyt 里面有一个&#160;<span style="color:#b22222;">Composite</span>（复合）的action space：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">class </span>Composite(Space):</pre>
<p>可以分别实现两个action space：一个gaussian的和一个categorical的，再用一个Composite的action space把它们包在里面。和environment交互的action space，就是这个Composite的action space了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e5%90%8c%e6%97%b6%e8%be%93%e5%87%bagaussian%ef%bc%88%e9%ab%98%e6%96%af%ef%bc%89%e5%92%8ccat/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;<br />
本文记录 rlpyt 的一些issue提及的问题以及解决方案。<br />
<span id="more-11277"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;同时输出gaussian（高斯）和categorical（类别）的action<br />
<span style="color:#6699ff;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span> issue链接：<a href="https://github.com/astooke/rlpyt/issues/39" rel="noopener noreferrer" target="_blank">在这里</a><br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;问题描述：一般来说，action要么是一个高斯分布（gaussian），要么是一个类别值（categorical），如何能把这二者混合起来，即同时输出gaussian和categorical的action？<br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;我的理解：gaussian的action，指的是policy network输出的是一个action的概率分布，而不是一个确定的action（例如，有70%的可能选择action 1，有30%的可能选择action 2），此时，我们是按这个概率分布来选择一个具体的action，举个例子，对应到Python程序，你可能会用&nbsp;np.random.choice(a, size=None, replace=True, p=None) 函数来在一个指定的概率分布上选择一个action，其p参数可以指定概率值。<br />
而categorical的action，指的是policy network输出的是一个确定的action，例如它直接输出的可能是action 1，或action 2，诸如此类，而不是给出action 1，action 2的概率值再让用户去选。<br />
这个issue的目的，就是想实现一种&ldquo;非常规&rdquo;的用法，输出的action同时具有这两种性质&mdash;&mdash;我想像不出来应该用什么来举例。<br />
<span style="color: rgb(102, 153, 255);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;解决方案：rlpyt 里面有一个&nbsp;<span style="color:#b22222;">Composite</span>（复合）的action space：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">class </span>Composite(Space):</pre>
<p>可以分别实现两个action space：一个gaussian的和一个categorical的，再用一个Composite的action space把它们包在里面。和environment交互的action space，就是这个Composite的action space了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
但是要注意，algorithm（例如PPO）是不支持Composite的action space的，所以还需要对algorithm类再做一点改造：另外定义一个 <span style="color:#0000ff;">distribution</span> 类，它能根据Composite的action算出其 log probability。还是以PPO为例，它使用 self.agent.distribution 来对action space进行action的选取，此时要用自定义的 distribution 类来替换掉这个distribution，然后就OK了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt%ef%bc%9a%e5%a6%82%e4%bd%95%e5%90%8c%e6%97%b6%e8%be%93%e5%87%bagaussian%ef%bc%88%e9%ab%98%e6%96%af%ef%bc%89%e5%92%8ccat/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(6) 模型指标什么时候从 nan 变成有意义的值</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a6-%e6%a8%a1%e5%9e%8b%e6%8c%87%e6%a0%87%e4%bb%80%e4%b9%88%e6%97%b6/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a6-%e6%a8%a1%e5%9e%8b%e6%8c%87%e6%a0%87%e4%bb%80%e4%b9%88%e6%97%b6/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sun, 08 Dec 2019 14:32:42 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11252</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&#160;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;观察训练日志引出的问题<br />
以 example_1&#160;为例，在训练的过程中，程序会不断打印出类似于下面的日志（部分内容）：<br />
<span id="more-11252"></span></p>
<div>
<blockquote>
<div>
			2019-11-08 20:38:42.067188&#160; &#124; StepsInEval&#160; &#160; &#160; &#160; &#160; &#160; &#160; 3796</div>
<div>
			2019-11-08 20:38:42.067216&#160; &#124; TrajsInEval&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;5</div>
<div>
			2019-11-08 20:38:42.067240&#160; &#124; CumEvalTime&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 23.1265</div>
<div>
			2019-11-08 20:38:42.067276&#160; &#124; CumTrainTime&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 2.64641</div>
<div>
			2019-11-08 20:38:42.067297&#160; &#124; Iteration&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;249</div>
<div>
			2019-11-08 20:38:42.067315&#160; &#124; CumTime (s)&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 25.7729</div>
<div>
			2019-11-08 20:38:42.067333&#160; &#124; CumSteps&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;1000</div>
<div>
			2019-11-08 20:38:42.067350&#160; &#124; CumCompletedTrajs&#160; &#160; &#160; &#160; &#160; &#160;1</div>
<div>
			2019-11-08 20:38:42.067368&#160; &#124; CumUpdates&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 0</div>
<div>
			2019-11-08 20:38:42.067385&#160; &#124; StepsPerSecond&#160; &#160; &#160; &#160; &#160; &#160; 386.079</div>
<div>
			2019-11-08 20:38:42.067402&#160; &#124; UpdatesPerSecond&#160; &#160; &#160; &#160; &#160; &#160; 0</div>
<div>
			2019-11-08 20:38:42.067419&#160; &#124; ReplayRatio&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;0</div>
<div>
			2019-11-08 20:38:42.067436&#160; &#124; CumReplayRatio&#160; &#160; &#160; &#160; &#160; &#160; &#160; 0</div>
<div>
			2019-11-08 20:38:42.067453&#160; &#124; LengthAverage&#160; &#160; &#160; &#160; &#160; &#160; &#160;759.2</div>
<div>
			2019-11-08 20:38:42.067480&#160; &#124; LengthStd&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;1.16619</div>
<div>
			2019-11-08 20:38:42.067499&#160; &#124; LengthMedian&#160; &#160; &#160; &#160; &#160; &#160; &#160; 759</div>
<div>
			2019-11-08 20:38:42.067516&#160; &#124; LengthMin&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;758</div>
<div>
			2019-11-08 20:38:42.067533&#160; &#124; LengthMax&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;761</div>
<div>
			2019-11-08 20:38:42.067550&#160; &#124; ReturnAverage&#160; &#160; &#160; &#160; &#160; &#160; &#160;-21</div>
<div>
			2019-11-08 20:38:42.067567&#160; &#124; ReturnStd&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;0</div>
<div>
			2019-11-08 20:38:42.067584&#160; &#124; ReturnMedian&#160; &#160; &#160; &#160; &#160; &#160; &#160; -21</div>
<div>
			2019-11-08 20:38:42.067601&#160; &#124; ReturnMin&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;-21</div>
<div>
			2019-11-08 20:38:42.067618&#160; &#124; ReturnMax&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;-21</div>
<div>
			2019-11-08 20:38:42.067635&#160; &#124; NonzeroRewardsAverage&#160; &#160; &#160; 21</div>
<div>
			2019-11-08 20:38:42.067652&#160; &#124; NonzeroRewardsStd&#160; &#160; &#160; &#160; &#160; &#160;0</div>
<div>
			2019-11-08 20:38:42.067669&#160; &#124; NonzeroRewardsMedian&#160; &#160; &#160; &#160;21</div>
<div>
			2019-11-08 20:38:42.067686&#160; &#124; NonzeroRewardsMin&#160; &#160; &#160; &#160; &#160; 21</div>
<div>
			2019-11-08 20:38:42.067703&#160; &#124; NonzeroRewardsMax&#160; &#160; &#160; &#160; &#160; 21</div>
<div>
			2019-11-08 20:38:42.067720&#160; &#124; DiscountedReturnAverage&#160; &#160; -1.87771</div>
<div>
			2019-11-08 20:38:42.067737&#160; &#124; DiscountedReturnStd&#160; &#160; &#160; &#160; &#160;0.0219605</div>
<div>
			2019-11-08 20:38:42.067754&#160; &#124; DiscountedReturnMedian&#160; &#160; &#160;-1.88136</div>
<div>
			2019-11-08 20:38:42.067771&#160; &#124; DiscountedReturnMin&#160; &#160; &#160; &#160; -1.90036</div>
<div>
			2019-11-08 20:38:42.067788&#160; &#124; DiscountedReturnMax&#160; &#160; &#160; &#160; -1.84392</div>
<div>
			2019-11-08 20:38:42.067805&#160; &#124; lossAverage&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067822&#160; &#124; lossStd&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067839&#160; &#124; lossMedian&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; nan</div>
<div>
			2019-11-08 20:38:42.067856&#160; &#124; lossMin&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067873&#160; &#124; lossMax&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067890&#160; &#124; gradNormAverage&#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067907&#160; &#124; gradNormStd&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067924&#160; &#124; gradNormMedian&#160; &#160; &#160; &#160; &#160; &#160; nan</div>
<div>
			2019-11-08 20:38:42.067941&#160; &#124; gradNormMin&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067958&#160; &#124; gradNormMax&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067975&#160; &#124; tdAbsErrAverage&#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.067992&#160; &#124; tdAbsErrStd&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.068009&#160; &#124; tdAbsErrMedian&#160; &#160; &#160; &#160; &#160; &#160; nan</div>
<div>
			2019-11-08 20:38:42.068026&#160; &#124; tdAbsErrMin&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
<div>
			2019-11-08 20:38:42.068043&#160; &#124; tdAbsErrMax&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160;nan</div>
</blockquote>
</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></div>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a6-%e6%a8%a1%e5%9e%8b%e6%8c%87%e6%a0%87%e4%bb%80%e4%b9%88%e6%97%b6/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&nbsp;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;观察训练日志引出的问题<br />
以 example_1&nbsp;为例，在训练的过程中，程序会不断打印出类似于下面的日志（部分内容）：<br />
<span id="more-11252"></span></p>
<div>
<blockquote>
<div>
			2019-11-08 20:38:42.067188&nbsp; | StepsInEval&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3796</div>
<div>
			2019-11-08 20:38:42.067216&nbsp; | TrajsInEval&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;5</div>
<div>
			2019-11-08 20:38:42.067240&nbsp; | CumEvalTime&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 23.1265</div>
<div>
			2019-11-08 20:38:42.067276&nbsp; | CumTrainTime&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2.64641</div>
<div>
			2019-11-08 20:38:42.067297&nbsp; | Iteration&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;249</div>
<div>
			2019-11-08 20:38:42.067315&nbsp; | CumTime (s)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 25.7729</div>
<div>
			2019-11-08 20:38:42.067333&nbsp; | CumSteps&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1000</div>
<div>
			2019-11-08 20:38:42.067350&nbsp; | CumCompletedTrajs&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1</div>
<div>
			2019-11-08 20:38:42.067368&nbsp; | CumUpdates&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div>
<div>
			2019-11-08 20:38:42.067385&nbsp; | StepsPerSecond&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 386.079</div>
<div>
			2019-11-08 20:38:42.067402&nbsp; | UpdatesPerSecond&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div>
<div>
			2019-11-08 20:38:42.067419&nbsp; | ReplayRatio&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0</div>
<div>
			2019-11-08 20:38:42.067436&nbsp; | CumReplayRatio&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div>
<div>
			2019-11-08 20:38:42.067453&nbsp; | LengthAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;759.2</div>
<div>
			2019-11-08 20:38:42.067480&nbsp; | LengthStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1.16619</div>
<div>
			2019-11-08 20:38:42.067499&nbsp; | LengthMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 759</div>
<div>
			2019-11-08 20:38:42.067516&nbsp; | LengthMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;758</div>
<div>
			2019-11-08 20:38:42.067533&nbsp; | LengthMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;761</div>
<div>
			2019-11-08 20:38:42.067550&nbsp; | ReturnAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-21</div>
<div>
			2019-11-08 20:38:42.067567&nbsp; | ReturnStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0</div>
<div>
			2019-11-08 20:38:42.067584&nbsp; | ReturnMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; -21</div>
<div>
			2019-11-08 20:38:42.067601&nbsp; | ReturnMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-21</div>
<div>
			2019-11-08 20:38:42.067618&nbsp; | ReturnMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;-21</div>
<div>
			2019-11-08 20:38:42.067635&nbsp; | NonzeroRewardsAverage&nbsp; &nbsp; &nbsp; 21</div>
<div>
			2019-11-08 20:38:42.067652&nbsp; | NonzeroRewardsStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0</div>
<div>
			2019-11-08 20:38:42.067669&nbsp; | NonzeroRewardsMedian&nbsp; &nbsp; &nbsp; &nbsp;21</div>
<div>
			2019-11-08 20:38:42.067686&nbsp; | NonzeroRewardsMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 21</div>
<div>
			2019-11-08 20:38:42.067703&nbsp; | NonzeroRewardsMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 21</div>
<div>
			2019-11-08 20:38:42.067720&nbsp; | DiscountedReturnAverage&nbsp; &nbsp; -1.87771</div>
<div>
			2019-11-08 20:38:42.067737&nbsp; | DiscountedReturnStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0219605</div>
<div>
			2019-11-08 20:38:42.067754&nbsp; | DiscountedReturnMedian&nbsp; &nbsp; &nbsp;-1.88136</div>
<div>
			2019-11-08 20:38:42.067771&nbsp; | DiscountedReturnMin&nbsp; &nbsp; &nbsp; &nbsp; -1.90036</div>
<div>
			2019-11-08 20:38:42.067788&nbsp; | DiscountedReturnMax&nbsp; &nbsp; &nbsp; &nbsp; -1.84392</div>
<div>
			2019-11-08 20:38:42.067805&nbsp; | lossAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067822&nbsp; | lossStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067839&nbsp; | lossMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; nan</div>
<div>
			2019-11-08 20:38:42.067856&nbsp; | lossMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067873&nbsp; | lossMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067890&nbsp; | gradNormAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067907&nbsp; | gradNormStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067924&nbsp; | gradNormMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; nan</div>
<div>
			2019-11-08 20:38:42.067941&nbsp; | gradNormMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067958&nbsp; | gradNormMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067975&nbsp; | tdAbsErrAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.067992&nbsp; | tdAbsErrStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.068009&nbsp; | tdAbsErrMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; nan</div>
<div>
			2019-11-08 20:38:42.068026&nbsp; | tdAbsErrMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
<div>
			2019-11-08 20:38:42.068043&nbsp; | tdAbsErrMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;nan</div>
</blockquote>
</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	仔细看就会发现，最后的若干个模型指标都是&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;，在训练了一段时间之后，这些值就变成了有意义的值，例如：</p>
<blockquote>
<div>
			2019-11-08 20:40:40.941580&nbsp; | lossAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0129165</div>
<div>
			2019-11-08 20:40:40.941597&nbsp; | lossStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0137061</div>
<div>
			2019-11-08 20:40:40.941614&nbsp; | lossMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.0150348</div>
<div>
			2019-11-08 20:40:40.941631&nbsp; | lossMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.000105323</div>
<div>
			2019-11-08 20:40:40.941648&nbsp; | lossMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0602407</div>
<div>
			2019-11-08 20:40:40.941665&nbsp; | gradNormAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0283939</div>
<div>
			2019-11-08 20:40:40.941682&nbsp; | gradNormStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0168219</div>
<div>
			2019-11-08 20:40:40.941699&nbsp; | gradNormMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.0301482</div>
<div>
			2019-11-08 20:40:40.941716&nbsp; | gradNormMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.00661218</div>
<div>
			2019-11-08 20:40:40.941732&nbsp; | gradNormMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.086334</div>
<div>
			2019-11-08 20:40:40.941749&nbsp; | tdAbsErrAverage&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.0529054</div>
<div>
			2019-11-08 20:40:40.941766&nbsp; | tdAbsErrStd&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0.168416</div>
<div>
			2019-11-08 20:40:40.941783&nbsp; | tdAbsErrMedian&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.0233203</div>
<div>
			2019-11-08 20:40:40.941800&nbsp; | tdAbsErrMin&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;8.33329e-05</div>
<div>
			2019-11-08 20:40:40.941817&nbsp; | tdAbsErrMax&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1</div>
</blockquote>
<div>
		所以这些值是在什么时候才会从&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;变成有意义的值呢？为什么刚开始训练不久的时候，会获取不到这些值？理论上，只要开始训练了，哪怕这些数字错得再离谱，它们也是有数的，不应该是&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;才对，对吧？所以这里为什么会显示&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;？<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;nan&nbsp;日志在哪记下来的<br />
		为了弄清楚上面的问题，我们要找到根源&mdash;&mdash;打印&ldquo;nan&rdquo;日志的地方。上面那些显示为&ldquo;nan&rdquo;的日志，是&nbsp;<span style="color:#b22222;">rlpyt/utils/logging/logger.py</span>&nbsp;的&nbsp;record_tabular_misc_stat()&nbsp;函数记录下来的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">record_tabular_misc_stat</span>(key<span style="color:#cc7832;">, </span>values<span style="color:#cc7832;">, </span>placement=<span style="color:#008080;">&#39;back&#39;</span>):
    <span style="color:#cc7832;font-weight:bold;">if </span>placement == <span style="color:#008080;">&#39;front&#39;</span>:
        prefix = <span style="color:#008080;">&quot;&quot;
</span><span style="color:#008080;">        </span>suffix = key
    <span style="color:#cc7832;font-weight:bold;">else</span>:
        prefix = key
        suffix = <span style="color:#008080;">&quot;&quot;
</span><span style="color:#008080;">    </span><span style="color:#cc7832;font-weight:bold;">if </span><span style="color:#8888c6;">len</span>(values) &gt; <span style="color:#6897bb;">0</span>:
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Average&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.<span style="color:#cc7833;">average</span>(values))
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Std&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.<span style="color:#cc7833;">std</span>(values))
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Median&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.<span style="color:#cc7833;">median</span>(values))
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Min&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.<span style="color:#cc7833;">min</span>(values))
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Max&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.<span style="color:#cc7833;">max</span>(values))
    <span style="color:#cc7832;font-weight:bold;">else</span>:
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Average&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.nan)
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Std&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.nan)
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Median&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.nan)
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Min&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.nan)
        <span style="color:#cc7833;">record_tabular</span>(prefix + <span style="color:#008080;">&quot;Max&quot; </span>+ suffix<span style="color:#cc7832;">, </span>np.nan)</pre>
<p>		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		这个函数用来计算某些模型指标，这些模型指标有一个共同的特征：它们都可以计算<span style="color:#ff0000;">平均值</span>、<span style="color:#ff0000;">标准差</span>等统计值。这是什么意思？举个例子，有一个指标&ldquo;CumTrainTime&rdquo;（累积的训练时间），它就没有&ldquo;平均值&rdquo;的概念；而像 <span style="color:#0000ff;">loss</span>（损失函数的值）这种指标，它在多轮训练迭代过程中，是可以有&ldquo;平均值&rdquo;的概念的。<br />
		而类似于 loss&nbsp;这种指标，还不止一个。为了简化代码，这里采用了拼接模型指标名称的做法，例如日志里的&quot;lossAverage&quot;，&quot;gradNormAverage&quot;之类的名称都是拼出来的，而不是直接写死，正如你上面看到的代码一样。<br />
		从上面的代码可见，当传入的&ldquo;<span style="color:#0000ff;">values</span>&rdquo;为空的时候，记下来的某些模型指标就会变成&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;。<br />
		所以现在的问题变成了：在什么时候，传入的&ldquo;<span style="color:#0000ff;">values</span>&rdquo;会为空？<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;logger的调用者&nbsp;MinibatchRlEval&nbsp;更新模型指标的逻辑<br />
		example_1 使用的 runner&nbsp;是&nbsp;MinibatchRlEval，它就是 logger&nbsp;的调用者。在 MinibatchRlEval.train()&nbsp;函数中定义了模型的训练、评估流程。<br />
		下面这句代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
opt_info = <span style="color:#94558d;">self</span>.algo.<span style="color:#cc7833;">optimize_agent</span>(itr<span style="color:#cc7832;">, </span>samples)</pre>
<p>		会把 loss&nbsp;等参数收集到 opt_info&nbsp;对象中，而下面这句代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.<span style="color:#cc7833;">store_diagnostics</span>(itr<span style="color:#cc7832;">, </span>traj_infos<span style="color:#cc7832;">, </span>opt_info)</pre>
<p>		则会把 opt_info&nbsp;更新到内存里。最后，这一句代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.<span style="color:#cc7833;">log_diagnostics</span>(itr<span style="color:#cc7832;">, </span>eval_traj_infos<span style="color:#cc7832;">, </span>eval_time)</pre>
<p>		会把内存里的信息记录到日志，以及print到屏幕上。</p>
<p>		所以，其实我们只要弄清楚&nbsp;self.algo.optimize_agent()&nbsp;返回 opt_info 的逻辑，就知道在什么情况下 loss&nbsp;等指标为&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;了。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;找到根本原因：algorithm类更新模型指标的逻辑<br />
		example_1&nbsp;使用的algorithm类是：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">DQN</span>(RlAlgorithm):</pre>
<p>		它的 optimize_agent()&nbsp;函数里有这样一段代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
opt_info = <span style="color:#cc7833;">OptInfo</span>(*([] <span style="color:#cc7832;font-weight:bold;">for </span>_ <span style="color:#cc7832;font-weight:bold;">in </span><span style="color:#8888c6;">range</span>(<span style="color:#8888c6;">len</span>(OptInfo._fields))))
<span style="color:#cc7832;font-weight:bold;">if </span>itr &lt; <span style="color:#94558d;">self</span>.min_itr_learn:
    <span style="color:#cc7832;font-weight:bold;">return </span>opt_info</pre>
<p>		这里的 opt_info&nbsp;其实就是一个各字段为空list的 namedtuple&nbsp;对象：</p>
<blockquote>
<p>
				OptInfo(loss=[], gradNorm=[], tdAbsErr=[])</p>
</blockquote>
<p>		答案已经很明显了，当前模型训练的迭代次数 &lt; self.min_itr_learn&nbsp;的时候，就会造成 loss&nbsp;等模型指标为&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;。<br />
		self.min_itr_learn&nbsp;是在 DQN.initialize()&nbsp;函数里初始化的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.min_itr_learn = <span style="color:#8888c6;">int</span>(<span style="color:#94558d;">self</span>.min_steps_learn // sampler_bs)</pre>
<p>		不用去管这个看似有点奇怪的逻辑，只需要知道：self.min_steps_learn&nbsp;越大，&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;打印出的次数就越多。<br />
		而 self.min_steps_learn&nbsp;这个参数，是在 DQN&nbsp;类对象构造的时候传入的(example_1.py)：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
algo = <span style="color:#cc7833;">DQN</span>(<span style="color:#aa4926;">min_steps_learn</span>=<span style="color:#6897bb;">1e3</span>)</pre>
<p>		所以，你只要改小这个值，就可以让&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;出现的次数减少。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;为什么要这样做，以及调整 min_steps_learn 参数的注意事项<br />
		rlpyt&nbsp;为什么要用一个参数来控制模型指标的计算过程？其实它不是为了控制什么时候不显示&ldquo;<span style="color:#0000ff;">nan</span>&rdquo;，看 DQN.optimize_agent()&nbsp;函数的这几句代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">if </span>samples <span style="color:#cc7832;font-weight:bold;">is not None</span>:
    samples_to_buffer = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">samples_to_buffer</span>(samples)
    <span style="color:#94558d;">self</span>.replay_buffer.<span style="color:#cc7833;">append_samples</span>(samples_to_buffer)
opt_info = <span style="color:#cc7833;">OptInfo</span>(*([] <span style="color:#cc7832;font-weight:bold;">for </span>_ <span style="color:#cc7832;font-weight:bold;">in </span><span style="color:#8888c6;">range</span>(<span style="color:#8888c6;">len</span>(OptInfo._fields))))
<span style="color:#cc7832;font-weight:bold;">if </span>itr &lt; <span style="color:#94558d;">self</span>.min_itr_learn:
    <span style="color:#cc7832;font-weight:bold;">return </span>opt_info</pre>
<p>		就会发现：当训练迭代次数没有达到&nbsp;self.min_itr_learn&nbsp;的时候，算法会一直把与environment交互得到的采样数据收集到 <span style="color:#b22222;">Replay Buffer</span>&nbsp;里面，如果 <span style="color:#b22222;">Replay Buffer</span>&nbsp;里的数据太少，没有达到预设的数量，那么开始优化策略网络也是没有意义的。当满足 irt &gt;= self.min_itr_learn&nbsp;的条件之后，后面才会进行反向传播之类的工作。<br />
		所以我认为，min_steps_learn&nbsp;的值确实不能设置得太小。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		这一节就到这，且听下回分解。<br />
		<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
		<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
		转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
		感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
			<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
</p></div>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a6-%e6%a8%a1%e5%9e%8b%e6%8c%87%e6%a0%87%e4%bb%80%e4%b9%88%e6%97%b6/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 的数据可视化工具：viskit</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%95%b0%e6%8d%ae%e5%8f%af%e8%a7%86%e5%8c%96/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%95%b0%e6%8d%ae%e5%8f%af%e8%a7%86%e5%8c%96/#comments</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Tue, 03 Dec 2019 13:15:54 +0000</pubDate>
				<category><![CDATA[原创]]></category>
		<category><![CDATA[综合]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[viskit]]></category>
		<category><![CDATA[可视化]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11183</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;<br />
在训练强化学习模型的过程中，rlpyt 产生的大量训练日志看起来无比枯燥，本文展示了如何利用 <a href="https://github.com/vitchyr/viskit" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">viskit</span></a> 把这些日志数据可视化。<br />
<span id="more-11183"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;viskit是什么<br />
<a href="https://github.com/vitchyr/viskit" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">viskit</span></a>是<a href="https://github.com/rll/rllab" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">rllab</span></a>的一个可视化组件，rllab是一个曾经有一定知名度的强化学习框架，但可惜的是它早就停止开发了。不过，viskit却被人单独抽了出来，作为一个可视化工具来使用。rlpyt 生成的训练日志也可以利用它来可视化，这是因为 rlpyt 生成的日志格式遵循了 viskit 的规范。<br />
<span style="color:#ff0000;">viskit的功能是</span>：读取它能解析的日志数据，启动一个网页，在网页上用图形展示日志内容。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;安装viskit<br />
把代码下载到本地：</p>
<blockquote>
<p>
		git clone&#160;git@github.com:vitchyr/viskit.git<br />
		cd&#160;viskit/</p>
</blockquote>
<p>安装viskit的依赖包：</p>
<blockquote>
<div>
		conda install -c anaconda matplotlib</div>
<div>
		conda install -c anaconda flask</div>
<div>
		conda install -c anaconda plotly</div>
</blockquote>
<p>注意，在这里我是在Anaconda环境里装的，如果你不用Anaconda，也可以用对应的pip install命令来装。<br />
需要flask是因为viskit会启动一个网页来可视化地展示数据，它正是使用了Flask来实现的（Flask 是一个 Python 实现的 Web 开发微框架）。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%95%b0%e6%8d%ae%e5%8f%af%e8%a7%86%e5%8c%96/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;<br />
在训练强化学习模型的过程中，rlpyt 产生的大量训练日志看起来无比枯燥，本文展示了如何利用 <a href="https://github.com/vitchyr/viskit" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">viskit</span></a> 把这些日志数据可视化。<br />
<span id="more-11183"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;viskit是什么<br />
<a href="https://github.com/vitchyr/viskit" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">viskit</span></a>是<a href="https://github.com/rll/rllab" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">rllab</span></a>的一个可视化组件，rllab是一个曾经有一定知名度的强化学习框架，但可惜的是它早就停止开发了。不过，viskit却被人单独抽了出来，作为一个可视化工具来使用。rlpyt 生成的训练日志也可以利用它来可视化，这是因为 rlpyt 生成的日志格式遵循了 viskit 的规范。<br />
<span style="color:#ff0000;">viskit的功能是</span>：读取它能解析的日志数据，启动一个网页，在网页上用图形展示日志内容。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;安装viskit<br />
把代码下载到本地：</p>
<blockquote>
<p>
		git clone&nbsp;git@github.com:vitchyr/viskit.git<br />
		cd&nbsp;viskit/</p>
</blockquote>
<p>安装viskit的依赖包：</p>
<blockquote>
<div>
		conda install -c anaconda matplotlib</div>
<div>
		conda install -c anaconda flask</div>
<div>
		conda install -c anaconda plotly</div>
</blockquote>
<p>注意，在这里我是在Anaconda环境里装的，如果你不用Anaconda，也可以用对应的pip install命令来装。<br />
需要flask是因为viskit会启动一个网页来可视化地展示数据，它正是使用了Flask来实现的（Flask 是一个 Python 实现的 Web 开发微框架）。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;使用viskit<br />
在使用viskit之前，首先你得有供它读取的日志。这里假设日志的路径为：<span style="color:#0000ff;">/path/to/your/log/dir</span><br />
这个路径下的日志文件名应该是类似于这样的：</p>
<div>
	<span style="color:#b22222;">├── debug.log</span></div>
<div>
	<span style="color:#b22222;">├── params.json</span></div>
<div>
	<span style="color:#b22222;">└── progress.csv</span></div>
<p>
你可以用 rlpyt 跑一下自带的example，就会生成这样的日志数据。</p>
<p>设置PYTHONPATH：</p>
<blockquote>
<p>
		export PYTHONPATH=/path/to/your/viskit:$PYTHONPATH</p>
</blockquote>
<p>这里的 <span style="color:#0000ff;">/path/to/your/viskit</span> 是你的viskit源码所在路径。<br />
运行viskit：</p>
<blockquote>
<p>
		python viskit/frontend.py /path/to/your/log/dir</p>
</blockquote>
<p>最后一个参数就是日志所在的目录。<br />
不出错的话，你会看到命令行有这样的输出：</p>
<blockquote>
<div>
		Importing data from [&#39;/path/to/your/log/dir&#39;]...</div>
<div>
		Reading /path/to/your/log/dir/progress.csv</div>
<div>
		View http://localhost:5000 in your browser</div>
<div>
		&nbsp;* Serving Flask app &quot;frontend&quot; (lazy loading)</div>
<div>
		&nbsp;* Environment: production</div>
<div>
		&nbsp; &nbsp;WARNING: This is a development server. Do not use it in a production deployment.</div>
<div>
		&nbsp; &nbsp;Use a production WSGI server instead.</div>
<div>
		&nbsp;* Debug mode: off</div>
<div>
		&nbsp;* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)</div>
</blockquote>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;效果展示<br />
用浏览器访问&nbsp;<span style="color:#ff0000;">http://localhost:5000/</span> 即可打开可视化网页：<br />
<img decoding="async" alt="viskit" src="https://www.codelast.com/wp-content/uploads/2019/12/viskit.png" style="width: 700px; height: 411px;" /></p>
<p>在上面的&ldquo;<span style="color:#0000ff;">Y-Axis Attributes</span>&rdquo;下拉列表框里可以选择要把哪些指标绘制成图形，点击&ldquo;<span style="color:#0000ff;">Upadte</span>&rdquo;按钮即可刷新下面的图像。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
各种指标图：<br />
<img decoding="async" alt="viskit graph" src="https://www.codelast.com/wp-content/uploads/2019/12/viskit_graph.png" style="width: 700px; height: 560px;" /><br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
比较酷炫的是，这些图是可以进行缩放等操作的，在图的右上方有一排工具按钮，点一下就看到效果了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%95%b0%e6%8d%ae%e5%8f%af%e8%a7%86%e5%8c%96/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(5) 提供额外参数的Mixin类</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a5-%e4%b8%bamodel%e7%b1%bb%e6%8f%90%e4%be%9b%e9%a2%9d%e5%a4%96%e5%8f%82/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a5-%e4%b8%bamodel%e7%b1%bb%e6%8f%90%e4%be%9b%e9%a2%9d%e5%a4%96%e5%8f%82/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sun, 01 Dec 2019 05:36:34 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11163</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&#160;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;Mixin类简介<br />
rlpyt&#160;里面有大量的 *Mixin&#160;类，例如&#160;AtariMixin，MujocoMixin，RecurrentAgentMixin 等，作者并没有为这些名字很怪的class写任何注释，仅从使用的地方来看，很多Mixin类都与agent类有关联。<br />
<span id="more-11163"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;分析具体实例：AtariMixin<br />
要充分理解Mixin类的设计意图，可以从一个具体的class来分析：AtariMixin。它是&#160;AtariDqnAgent&#160;的其中一个父类：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">AtariDqnAgent</span>(AtariMixin<span style="color:#cc7832;">, </span>DqnAgent):
    <span style="color:#cc7832;font-weight:bold;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>ModelCls=AtariDqnModel<span style="color:#cc7832;">, </span>**kwargs):
        <span style="color:#8888c6;">super</span>().<span style="color:#cc7833;">__init__</span>(<span style="color:#aa4926;">ModelCls</span>=ModelCls<span style="color:#cc7832;">, </span>**kwargs)</pre>
<p>其中，另一个父类 DqnAgent 是实现了agent逻辑的类。AtariMixin&#160;里面只实现了一个非常简单的函数，返回了一个字典：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">AtariMixin</span>:
    <span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">make_env_to_model_kwargs</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>env_spaces):<span style="color:#629755;font-style:italic;">
</span><span style="color:#629755;font-style:italic;">        </span><span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#8888c6;">dict</span>(<span style="color:#aa4926;">image_shape</span>=env_spaces.observation.shape</pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a5-%e4%b8%bamodel%e7%b1%bb%e6%8f%90%e4%be%9b%e9%a2%9d%e5%a4%96%e5%8f%82/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&nbsp;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;Mixin类简介<br />
rlpyt&nbsp;里面有大量的 *Mixin&nbsp;类，例如&nbsp;AtariMixin，MujocoMixin，RecurrentAgentMixin 等，作者并没有为这些名字很怪的class写任何注释，仅从使用的地方来看，很多Mixin类都与agent类有关联。<br />
<span id="more-11163"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;分析具体实例：AtariMixin<br />
要充分理解Mixin类的设计意图，可以从一个具体的class来分析：AtariMixin。它是&nbsp;AtariDqnAgent&nbsp;的其中一个父类：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">AtariDqnAgent</span>(AtariMixin<span style="color:#cc7832;">, </span>DqnAgent):
    <span style="color:#cc7832;font-weight:bold;">def </span><span style="color:#b200b2;">__init__</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>ModelCls=AtariDqnModel<span style="color:#cc7832;">, </span>**kwargs):
        <span style="color:#8888c6;">super</span>().<span style="color:#cc7833;">__init__</span>(<span style="color:#aa4926;">ModelCls</span>=ModelCls<span style="color:#cc7832;">, </span>**kwargs)</pre>
<p>其中，另一个父类 DqnAgent 是实现了agent逻辑的类。AtariMixin&nbsp;里面只实现了一个非常简单的函数，返回了一个字典：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">AtariMixin</span>:
    <span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">make_env_to_model_kwargs</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>env_spaces):<span style="color:#629755;font-style:italic;">
</span><span style="color:#629755;font-style:italic;">        </span><span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#8888c6;">dict</span>(<span style="color:#aa4926;">image_shape</span>=env_spaces.observation.shape<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">                    </span><span style="color:#aa4926;">output_size</span>=env_spaces.action.n)</pre>
<p>这个函数是在哪里被调用的？这就有点tricky了：它是在&nbsp;DqnAgent&nbsp;的父类&nbsp;BaseAgent&nbsp;的&nbsp;initialize()&nbsp;函数里被调用的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.env_model_kwargs = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">make_env_to_model_kwargs</span>(env_spaces)</pre>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
我们来理一下，这个调用链很有意思：<br />
<img decoding="async" alt="rlpyt mixin class hierarchy" src="https://www.codelast.com/wp-content/uploads/2019/12/rlpyt_mixin_class_hierarchy.png" style="width: 600px; height: 486px;" /><br />
从这幅图可以看到，在agent类 initialize()&nbsp;的时候，它调用的 make_env_to_model_kwargs()&nbsp;函数，实际上调用的是 Mixin&nbsp;类实现的&nbsp;make_env_to_model_kwargs()&nbsp;函数。<br />
看上面的继承关系图，如果你产生一种疑问：&ldquo;Python还能这样做的？&rdquo;&nbsp;那么我建议你可以自己去写几个简单的class实验一下&mdash;&mdash;确实可以这样。<br />
然而这个绕了一大圈的逻辑，是不是太麻烦了？<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;为什么要插入一个Mixin类<br />
一开始我在想，为什么不直接在&nbsp;DqnAgent&nbsp;类中实现其父类&nbsp;BaseAgent&nbsp;定义的接口&nbsp;make_env_to_model_kwargs()&nbsp;呢？那样不就可以少写一个Mixin类？<br />
为了想明白这个问题，我们来看看&nbsp;BaseAgent&nbsp;类在调用了&nbsp;make_env_to_model_kwargs()&nbsp;函数后干了什么事情：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.env_model_kwargs = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">make_env_to_model_kwargs</span>(env_spaces)
<span style="color:#94558d;">self</span>.model = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">ModelCls</span>(**<span style="color:#94558d;">self</span>.env_model_kwargs<span style="color:#cc7832;">, </span>**<span style="color:#94558d;">self</span>.model_kwargs)</pre>
<p>可见，它用返回的字典(dict)&nbsp;<span style="color:#0000ff;">self.env_model_kwargs</span> 来实例化 model&nbsp;类。<br />
要知道，<span style="color:#ff0000;">rlpyt&nbsp;是一个强化学习的框架，而不是一个专用于Atari游戏的强化学习库，我们可以用它来实现跟游戏毫不相关的强化学习应用</span>。每一种强化学习应用，都有其对应的model类，而model类的参数(通常是跟environment space相关)因应用而异，我们不可能强行规定这些model类的参数必须叫什么名字，而是应该具有普适性：由应用的开发者自己去定义。<br />
以 AtariMixin&nbsp;为例，它返回的dict里包含两个参数：image_shape&nbsp;和&nbsp;output_size，即输入图像的shape以及输出的size，如果我自己的强化学习应用不是游戏应用、完全没有image这种东西呢？<br />
在这个时候，我就需要几个更合适的名字来描述它们。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
所以，看似半路杀出来的无厘头&nbsp;Mixin&nbsp;类，其实是为了&nbsp;rlpyt&nbsp;框架的良好扩展性而设计的一个类，它用于向model类提供实例化所需的特殊参数。<br />
不过，在 rlpyt&nbsp;中，并不是所有 Mixin&nbsp;类都是为model类服务的，例如&nbsp;EpsilonGreedy&nbsp;类的父类&nbsp;DiscreteMixin，就和model类无关。但这个类它也带了&ldquo;为子类提供一些额外的功能，但放在子类中实现又不太好&rdquo;的思想。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
这一节就到这，且听下回分解。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a5-%e4%b8%bamodel%e7%b1%bb%e6%8f%90%e4%be%9b%e9%a2%9d%e5%a4%96%e5%8f%82/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(3) 相当简洁又十分巧妙的EpsilonGreedy类</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a3-%e7%9b%b8%e5%bd%93%e7%ae%80%e6%b4%81%e5%8f%88%e5%8d%81%e5%88%86/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a3-%e7%9b%b8%e5%bd%93%e7%ae%80%e6%b4%81%e5%8f%88%e5%8d%81%e5%88%86/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 21 Nov 2019 19:00:45 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=11023</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&#160;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。<br />
<span id="more-11023"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;EpsilonGreedy 类从哪来，做何用<br />
agent&#160;在 environment&#160;里步进的时候，会根据policy network的计算结果，选择一个 action，再去根据这个 action&#160;计算相应的 reward。对 example_1&#160;来说，agent&#160;类是&#160;DqnAgent，其 step()&#160;函数就是用于执行步进操作的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#bbb529;">@torch.no_grad</span>()
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">step</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>observation<span style="color:#cc7832;">, </span>prev_action<span style="color:#cc7832;">, </span>prev_reward):
    prev_action = <span style="color:#94558d;">self</span>.distribution.</pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a3-%e7%9b%b8%e5%bd%93%e7%ae%80%e6%b4%81%e5%8f%88%e5%8d%81%e5%88%86/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&nbsp;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。<br />
<span id="more-11023"></span><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;EpsilonGreedy 类从哪来，做何用<br />
agent&nbsp;在 environment&nbsp;里步进的时候，会根据policy network的计算结果，选择一个 action，再去根据这个 action&nbsp;计算相应的 reward。对 example_1&nbsp;来说，agent&nbsp;类是&nbsp;DqnAgent，其 step()&nbsp;函数就是用于执行步进操作的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#bbb529;">@torch.no_grad</span>()
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">step</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>observation<span style="color:#cc7832;">, </span>prev_action<span style="color:#cc7832;">, </span>prev_reward):
    prev_action = <span style="color:#94558d;">self</span>.distribution.<span style="color:#cc7833;">to_onehot</span>(prev_action)
    model_inputs = <span style="color:#cc7833;">buffer_to</span>((observation<span style="color:#cc7832;">, </span>prev_action<span style="color:#cc7832;">, </span>prev_reward)<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span><span style="color:#aa4926;">device</span>=<span style="color:#94558d;">self</span>.device)
    q = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">model</span>(*model_inputs)
    q = q.<span style="color:#cc7833;">cpu</span>()
    action = <span style="color:#94558d;">self</span>.distribution.<span style="color:#cc7833;">sample</span>(q)
    agent_info = <span style="color:#cc7833;">AgentInfo</span>(<span style="color:#aa4926;">q</span>=q)<span style="color:#808080;">
</span><span style="color:#808080;">    </span><span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#cc7833;">AgentStep</span>(<span style="color:#aa4926;">action</span>=action<span style="color:#cc7832;">, </span><span style="color:#aa4926;">agent_info</span>=agent_info)</pre>
<p><span style="color:#0000ff;">action = self.distribution.sample(q)</span> 这里会用到 rlpyt/distributions/epsilon_greedy.py&nbsp;里实现的 <span style="color:#0000ff;">EpsilonGreedy</span> 类，从名字上看，猜测它是&nbsp;&epsilon;-greedy 算法的实现（实际上它就是）。<br />
&epsilon;-greedy 是强化学习算法使用的一种探索策略。这里的目的是使用&nbsp;&epsilon;-greedy&nbsp;算法来选择 action。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;EpsilonGreedy 类详解<br />
EpsilonGreedy&nbsp;类有两个父类：DiscreteMixin&nbsp;和&nbsp;Distribution。其中&nbsp;DiscreteMixin&nbsp;实现了一些辅助功能的函数；Distribution&nbsp;里基本是各种未实现的接口定义。<br />
对&nbsp;EpsilonGreedy&nbsp;类本身来说，其精华在于只有短短5行代码的 sample()&nbsp;函数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">sample</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>q):
    arg_select = torch.<span style="color:#cc7833;">argmax</span>(q<span style="color:#cc7832;">, </span><span style="color:#aa4926;">dim</span>=-<span style="color:#6897bb;">1</span>)
    mask = torch.<span style="color:#cc7833;">rand</span>(arg_select.shape) &lt; <span style="color:#94558d;">self</span>._epsilon
    arg_rand = torch.<span style="color:#cc7833;">randint</span>(<span style="color:#aa4926;">low</span>=<span style="color:#6897bb;">0</span><span style="color:#cc7832;">, </span><span style="color:#aa4926;">high</span>=q.shape[-<span style="color:#6897bb;">1</span>]<span style="color:#cc7832;">, </span><span style="color:#aa4926;">size</span>=(mask.<span style="color:#cc7833;">sum</span>()<span style="color:#cc7832;">,</span>))
    arg_select[mask] = arg_rand
    <span style="color:#cc7832;font-weight:bold;">return </span>arg_select</pre>
<p>乍一看的感觉就是：这都是些什么乱七八糟的操作啊？完全不知道它在干嘛。<br />
但从前文的分析，我们可以猜测出来这个函数是 &epsilon;-greedy&nbsp;算法的实现，带着这个想法，我们读起代码来就有方向了。&nbsp;<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
下面用一些实例辅助，来一行行分析代码，包你看懂！<br />
sample()&nbsp;函数的输入参数 q&nbsp;是一个 tensor，因为从上面的分析知道，q&nbsp;是policy network前向传播的计算结果。在这里，我们假设&nbsp;q&nbsp;为下面这个矩阵：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#629755;font-style:italic;">[[-0.2187, -0.2758,  0.4933,  1.0700],
</span><span style="color:#629755;font-style:italic;">[ 0.2689,  3.5079,  1.5640,  1.1730],
</span><span style="color:#629755;font-style:italic;">[-0.6858,  0.2571,  1.0396,  0.6344]]</span></pre>
<p>现在再假设 sample()&nbsp;函数用到的一个变量 self._epsilon&nbsp;= 0.3。这里要提一下，尽管这里我为了简单，用一个标量 0.3&nbsp;来举例，但不代表&nbsp;self._epsilon&nbsp;一定要是个标量。如果仔细研读另一个类&nbsp;<span style="color:#0000ff;">EpsilonGreedyAgentMixin</span>&nbsp;的代码，会发现它调用了&nbsp;EpsilonGreedy.set_epsilon()&nbsp;函数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#94558d;">self</span>.distribution.set_epsilon(<span style="color:#94558d;">self</span>.eps_sample)</pre>
<p>而EpsilonGreedy.set_epsilon()&nbsp;函数的定义为：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">set_epsilon</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>epsilon):
    <span style="color:#94558d;">self</span>._epsilon = epsilon</pre>
<p>此时 set&nbsp;进去的 epsilon&nbsp;有可能是一个 tensor&nbsp;而不是一个&nbsp;scalar。<br />
记住这一点，我们继续用简单的scalar的情况来举例，即令&nbsp;self._epsilon&nbsp;= 0.3。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;第1行代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
arg_select = torch.<span style="color:#cc7833;">argmax</span>(q<span style="color:#cc7832;">, </span><span style="color:#aa4926;">dim</span>=-<span style="color:#6897bb;">1</span>)</pre>
<p>这句的功能是：返回指定的维度(dim，-1表示最后一个维度)上，值最大的那个数的index。<br />
结果，arg_select 值为 [3, 1, 2]，这是因为，对输入矩阵来说，第一行最大的值是 1.0700，其index为3；第二行最大的值是 3.5079，其index为0；第三行最大的值是 1.0396，其index为0，因此拼起来就是 [3, 0, 0]。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;第2行代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
mask = torch.<span style="color:#cc7833;">rand</span>(arg_select.shape) &lt; <span style="color:#94558d;">self</span>._epsilon</pre>
<p>会得到一个bool的矩阵，标识了torch.rand生成的随机数组里的每个元素是比self._epsilon大还是小。<br />
结果，mask 值为[True, False, True]，这是因为，此时 torch.rand(arg_select.shape)得到的一个随机矩阵是[0.2983, 0.4749, 0.2926] (<span style="color:#0000ff;">由于是随机的，因此不是每次都是这个结果，这里仅拿某一次运行的结果作为例子来陈述</span>)，这个随机矩阵的3个数，分别和 self._epsilon 比小，得到的结果就是 [True, False, True]。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" />&nbsp;</span>第3行代码最为复杂：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
arg_rand = torch.<span style="color:#cc7833;">randint</span>(<span style="color:#aa4926;">low</span>=<span style="color:#6897bb;">0</span><span style="color:#cc7832;">, </span><span style="color:#aa4926;">high</span>=q.shape[-<span style="color:#6897bb;">1</span>]<span style="color:#cc7832;">, </span><span style="color:#aa4926;">size</span>=(mask.<span style="color:#cc7833;">sum</span>()<span style="color:#cc7832;">,</span>))</pre>
<p>torch.randint()返回均匀分布的[low,high)之间的整数随机值，mask.sum()得到bool矩阵中True元素的个数(假设为x)，因此得到的arg_rand是x个[low,high)之间的随机数。例如 print(torch.randint(0, 20, (6, ))) 的输出可能是：tensor([14,&nbsp; 4,&nbsp; 7, 17, 16,&nbsp; 3])。<br />
mask.sum() 的值为 2，因为这等同于执行 torch.sum(mask)，即计算 mask 这个 Tensor 上的所有元素的和，对元素为 bool 类型的情况，True为1，False为0，因此结果为2。<br />
q.shape[-1] 的值为 4，因为 shape 为(3, 4)，因此 shape[-1] 就是最后一个值，即 4。<br />
因此 arg_rand 这一句执行的语句就是：torch.rand(low=0, high=4, size=(2, ))，即在 [0, 4) 间随机取两个整数，结果为 [2, 3]。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color:#ff0000;"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2714.png" alt="✔" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span>&nbsp;第4行代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
arg_select[mask] = arg_rand</pre>
<p>mask是一个bool的Tensor，把它传给另一个Tensor arg_select的时候，返回的是mask中为True的那些entry。<br />
arg_select[mask] = arg_rand 这句在执行之前，arg_select为[3, 1, 2]，mask为[True, False, True]，arg_rand为[2, 3]，对mask里为True的两个位置，找到arg_select里的对应位置，替换成arg_rand里的值，就是最后的结果：[2, 1, 3]。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
从最后的结果 [2, 1, 3] 可以看到，它已经不能标识输入矩阵 q&nbsp;的每一行的最大值的index了。<br />
所以把上面的逻辑总结一遍，sample()&nbsp;函数实现的功能就是：<br />
<span style="color:#ff0000;"><strong>找出输入矩阵某个维度上的最大值，然后按一定的机率(即epsilon)&ldquo;不选取&rdquo;那个值最大的index，最终得到一个具有&ldquo;少量随机性&rdquo;的最大值index矩阵。</strong></span><br />
这不正是&nbsp;&epsilon;-greedy&nbsp;算法干的事情吗？所以你明白&nbsp;EpsilonGreedy&nbsp;类为什么叫这个名字了吧。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
这一节就到这，且听下回分解。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a3-%e7%9b%b8%e5%bd%93%e7%ae%80%e6%b4%81%e5%8f%88%e5%8d%81%e5%88%86/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>[原创] 强化学习框架 rlpyt 源码分析：(4) 收集训练数据的sampler类</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a4-%e6%94%b6%e9%9b%86%e8%ae%ad%e7%bb%83%e6%95%b0%e6%8d%ae%e7%9a%84sample/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a4-%e6%94%b6%e9%9b%86%e8%ae%ad%e7%bb%83%e6%95%b0%e6%8d%ae%e7%9a%84sample/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Thu, 21 Nov 2019 18:59:48 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[Reinforcement Learning]]></category>
		<category><![CDATA[rlpyt]]></category>
		<category><![CDATA[强化学习]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=10932</guid>

					<description><![CDATA[<p>
查看关于 rlpyt&#160;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&#160;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&#160;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&#160;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;sampler的主要功能<br />
训练强化学习模型需要训练数据，收集训练数据的工作就是由sampler类做的。<br />
收集训练数据，就需要在environment中步进，因此environment的实例化工作也在sampler中完成。<br />
<span id="more-10932"></span><br />
在很多强化学习教程中，收集数据也叫<span style="color:#0000ff;">采样数据</span>，这也是sampler这个名字的由来。但需要注意的是，真正去做&#8220;收集数据&#8221;这个工作的，是一种叫做<span style="color:#0000ff;">collector</span>的class。sampler会在&#160;initialize()&#160;的时候，把 collector&#160;对象也初始化。<br />
所以 sampler&#160;可以看做是在 collector&#160;外面又包了一层。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);">▶▶</span></span>&#160;BatchSpec里的 <span style="color:#ff0000;">T</span>&#160;和 <span style="color:#ff0000;">B</span>&#160;的概念<br />
在&#160;SerialSampler&#160;的&#160;initialize()&#160;函数里，会看到实例化 environment&#160;的代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
B = <span style="color:#94558d;">self</span>.batch_spec.B</pre>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a4-%e6%94%b6%e9%9b%86%e8%ae%ad%e7%bb%83%e6%95%b0%e6%8d%ae%e7%9a%84sample/" class="read-more">Read More </a>]]></description>
										<content:encoded><![CDATA[<p>
查看关于 rlpyt&nbsp;的更多文章请点击<a href="https://www.codelast.com/?p=10907" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">这里</span></a>。</p>
<p><a href="https://github.com/astooke/rlpyt" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">rlpyt</span></a>&nbsp;是<span style="color: rgb(0, 0, 255);">BAIR</span>(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(<span style="color: rgb(255, 0, 0);">RL</span>)框架。我之前写了一篇它的<a href="https://www.codelast.com/?p=10643" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">简介</span></a>。&nbsp;如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt&nbsp;自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。</p>
<p><span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;sampler的主要功能<br />
训练强化学习模型需要训练数据，收集训练数据的工作就是由sampler类做的。<br />
收集训练数据，就需要在environment中步进，因此environment的实例化工作也在sampler中完成。<br />
<span id="more-10932"></span><br />
在很多强化学习教程中，收集数据也叫<span style="color:#0000ff;">采样数据</span>，这也是sampler这个名字的由来。但需要注意的是，真正去做&ldquo;收集数据&rdquo;这个工作的，是一种叫做<span style="color:#0000ff;">collector</span>的class。sampler会在&nbsp;initialize()&nbsp;的时候，把 collector&nbsp;对象也初始化。<br />
所以 sampler&nbsp;可以看做是在 collector&nbsp;外面又包了一层。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;BatchSpec里的 <span style="color:#ff0000;">T</span>&nbsp;和 <span style="color:#ff0000;">B</span>&nbsp;的概念<br />
在&nbsp;SerialSampler&nbsp;的&nbsp;initialize()&nbsp;函数里，会看到实例化 environment&nbsp;的代码：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
B = <span style="color:#94558d;">self</span>.batch_spec.B<span style="color:#808080;">
</span>envs = [<span style="color:#94558d;">self</span>.<span style="color:#cc7833;">EnvCls</span>(**<span style="color:#94558d;">self</span>.env_kwargs) <span style="color:#cc7832;font-weight:bold;">for </span>_ <span style="color:#cc7832;font-weight:bold;">in </span><span style="color:#8888c6;">range</span>(B)]</pre>
<p>这里会把 B&nbsp;个&nbsp;environment&nbsp;对象构造出来。<br />
我觉得作者起了一个非常不好的变量名：B。仔细看一下，self.batch_spec&nbsp;这个变量是在&nbsp;SerialSampler&nbsp;的父类&nbsp;BaseSampler&nbsp;中赋值的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.batch_spec = <span style="color:#cc7833;">BatchSpec</span>(batch_T<span style="color:#cc7832;">, </span>batch_B)</pre>
<p>而&nbsp;BatchSpec&nbsp;是一个父类为&nbsp;namedtuple&nbsp;的class：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">class </span><span style="font-weight:bold;">BatchSpec</span>(<span style="color:#cc7833;">namedtuple</span>(<span style="color:#008080;">&quot;BatchSpec&quot;</span><span style="color:#cc7832;">, </span><span style="color:#008080;">&quot;T B&quot;</span>)):</pre>
<p>由&nbsp;Python namedtuple&nbsp;的性质可以知道，当用&nbsp;BatchSpec(batch_T, batch_B)&nbsp;构造一个对象的时候，该对象内部会生成两个成员变量 self.T&nbsp;和&nbsp;self.B，它们的值分别为 batch_T&nbsp;和 batch_B。<br />
这也是为什么&nbsp;BatchSpec&nbsp;类的 size()&nbsp;函数可以这样写的原因：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#bbb529;">@property
</span><span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">size</span>(<span style="color:#94558d;">self</span>):<span style="color:#629755;font-style:italic;">
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#94558d;">self</span>.T * <span style="color:#94558d;">self</span>.B</pre>
<p>从&nbsp;BatchSpec&nbsp;类的注释里可以知道，T&nbsp;是时间步(time step)的概念，B&nbsp;是&nbsp;独立的trajectory分段的概念。<br />
所谓时间步 T 是指agent与一个environment交互时，会按时间先后顺序不断地步进到下一个state，走一步即一个step。此值&gt;=1。<br />
所谓独立的trajectory分段，是指独立的trajectory的数量，即environment实例的数量。此值&gt;=1。</p>
<p>说到这里，不难发现，environment按B的数量来实例化是有道理的。<br />
进一步：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
global_B = B * world_size</pre>
<p>基于<a href="https://www.codelast.com/?p=10883" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">之前的文章</span></a>里提到的 world_size&nbsp;的概念，就可以看出来这里的 global_B&nbsp;指的是多个&ldquo;平行宇宙&rdquo;下的所有 environment&nbsp;的数量和。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;env_ranks的概念<br />
env_ranks&nbsp;又是一个&ldquo;没有注释，又很难看懂是什么意思&rdquo;的东西。</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
env_ranks = <span style="color:#8888c6;">list</span>(<span style="color:#8888c6;">range</span>(rank * B<span style="color:#cc7832;">, </span>(rank + <span style="color:#6897bb;">1</span>) * B))</pre>
<p>在 example_1 中，env_ranks 计算出来得到了一个 list：[0]。<br />
这里得到的list是一个长度为 B 的list，B为environment的数量。你需要一层层挖下去才知道它是干嘛用的。</p>
<p>env_ranks 在 rlpyt/samplers/serial/sampler.py 的两个地方用到了：一个是 agent 的 initialize() 函数，另一个是 collector 类的构造函数，如下：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
agent.initialize(envs[<span style="color:#6897bb;">0</span>].spaces<span style="color:#cc7832;">, </span><span style="color:#aa4926;">share_memory</span>=<span style="color:#cc7832;">False,
</span><span style="color:#cc7832;">                 </span><span style="color:#aa4926;">global_B</span>=global_B<span style="color:#cc7832;">, </span><span style="color:#aa4926;">env_ranks</span>=env_ranks)</pre>
<p>以及：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
collector = <span style="color:#94558d;">self</span>.CollectorCls(
    <span style="color:#aa4926;">rank</span>=<span style="color:#6897bb;">0</span><span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">envs</span>=envs<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">samples_np</span>=samples_np<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">batch_T</span>=<span style="color:#94558d;">self</span>.batch_spec.T<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">TrajInfoCls</span>=<span style="color:#94558d;">self</span>.TrajInfoCls<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">agent</span>=agent<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">global_B</span>=global_B<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">env_ranks</span>=env_ranks<span style="color:#cc7832;">,  </span><span style="color:#808080;"># Might get applied redundantly to agent.
</span>)</pre>
<p>作者对第2种情况做了注释：&ldquo;Might get applied redundantly to agent.&rdquo; 这里的意思是：可能和agent(里面的逻辑)重复了。通过下面的分析可以知道，第1种情况和第2种情况最终会调用到同一个函数，因此它们确实是做了重复的工作。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
分别看看这两个地方用 env_ranks&nbsp;来做什么。<br />
<span style="color:#ff0000;"><span style="background-color:#ffff00;">★</span></span> agent 的 initialize() 函数<br />
在 DqnAgent 类的&nbsp;initialize() 函数里，和 env_ranks 有关的代码，只有一个地方是有用的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">if </span>env_ranks <span style="color:#cc7832;">is not None</span>:
    <span style="color:#94558d;">self</span>.make_vec_eps(global_B<span style="color:#cc7832;">, </span>env_ranks)</pre>
<p>这里调用的是&nbsp;EpsilonGreedyAgentMixin 类的&nbsp;make_vec_eps() 函数。巧合的是，这与下面的第2种情况相同，所以直接来分析第2种情况。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0); background-color: rgb(255, 255, 0);">★</span>&nbsp;collector类的构造函数<br />
example_1 使用的 collector 类是&nbsp;<span style="color:#0000ff;">CpuResetCollector</span>，在这个类的代码中（<span style="color:#b22222;">rlpyt/samplers/parallel/cpu/collectors.py</span>）并没有使用 env_ranks，但是在其父类&nbsp;<span style="color:#0000ff;">DecorrelatingStartCollector</span> 的父类&nbsp;<span style="color:#0000ff;">BaseCollector</span>（这句话很拗口，&ldquo;父类的父类&rdquo;）的 start_agent() 函数里面，我们就会看到使用了&nbsp;env_ranks：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">start_agent</span>(<span style="color:#94558d;">self</span>):
    <span style="color:#cc7832;">if </span><span style="color:#8888c6;">getattr</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span><span style="color:#6a8759;">&quot;agent&quot;</span><span style="color:#cc7832;">, None</span>) <span style="color:#cc7832;">is not None</span>:  <span style="color:#808080;"># Not in GPU collectors.
</span><span style="color:#808080;">        </span><span style="color:#94558d;">self</span>.agent.collector_initialize(
            <span style="color:#aa4926;">global_B</span>=<span style="color:#94558d;">self</span>.global_B<span style="color:#cc7832;">,  </span><span style="color:#808080;"># Args used e.g. for vector epsilon greedy.
</span><span style="color:#808080;">            </span><span style="color:#aa4926;">env_ranks</span>=<span style="color:#94558d;">self</span>.env_ranks<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">        </span>)
        <span style="color:#94558d;">self</span>.agent.reset()
        <span style="color:#94558d;">self</span>.agent.sample_mode(<span style="color:#aa4926;">itr</span>=<span style="color:#6897bb;">0</span>)</pre>
<p>这里的 self.env_ranks 就是在 __init__() 里传入的，即 sampler 中传入的 env_ranks。<br />
同时我们会看到，对 example_1 来说，<span style="color:#0000ff;">if getattr(self, &quot;agent&quot;, None) is not None</span> 这个条件是满足的，因此这里会执行 agent.collector_initialize()。<br />
example_1 的 agent 类是&nbsp;DqnAgent，它有两个父类：BaseAgent 和&nbsp;EpsilonGreedyAgentMixin，其中&nbsp;BaseAgent 没有实现 collector_initialize() 函数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">collector_initialize</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>global_B=<span style="color:#6897bb;">1</span><span style="color:#cc7832;">, </span>env_ranks=<span style="color:#cc7832;">None</span>):
    <span style="color:#629755;font-style:italic;">&quot;&quot;&quot;If need to initialize within CPU sampler (e.g. vector eps greedy)&quot;&quot;&quot;
</span><span style="color:#629755;font-style:italic;">    </span><span style="color:#cc7832;">pass</span></pre>
<p>而&nbsp;EpsilonGreedyAgentMixin 类实现了&nbsp;collector_initialize()&nbsp;函数，所以最终调用的就是它（层层嵌套，已疯）：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">collector_initialize</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>global_B=<span style="color:#6897bb;">1</span><span style="color:#cc7832;">, </span>env_ranks=<span style="color:#cc7832;">None</span>):
    <span style="color:#cc7832;">if </span>env_ranks <span style="color:#cc7832;">is not None</span>:
        <span style="color:#94558d;">self</span>.make_vec_eps(global_B<span style="color:#cc7832;">, </span>env_ranks)</pre>
<p>所以这里的 make_vec_eps() 又是干了啥？</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Droid Sans Mono';font-size:13.5pt;">
<span style="color:#cc7832;">def </span><span style="color:#ffc66d;">make_vec_eps</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>global_B<span style="color:#cc7832;">, </span>env_ranks):
    <span style="color:#cc7832;">if </span><span style="color:#94558d;">self</span>.eps_final_min <span style="color:#cc7832;">is not None and </span><span style="color:#94558d;">self</span>.eps_final_min != <span style="color:#94558d;">self</span>._eps_final_scalar:  <span style="color:#808080;"># vector epsilon.
</span><span style="color:#808080;">        </span><span style="color:#cc7832;">if </span><span style="color:#94558d;">self</span>.alternating:  <span style="color:#808080;"># In FF case, sampler sets agent.alternating.
</span><span style="color:#808080;">            </span><span style="color:#cc7832;">assert </span>global_B % <span style="color:#6897bb;">2 </span>== <span style="color:#6897bb;">0
</span><span style="color:#6897bb;">            </span>global_B = global_B // <span style="color:#6897bb;">2  </span><span style="color:#808080;"># Env pairs will share epsilon.
</span><span style="color:#808080;">            </span>env_ranks = <span style="color:#8888c6;">list</span>(<span style="color:#8888c6;">set</span>([i // <span style="color:#6897bb;">2 </span><span style="color:#cc7832;">for </span>i <span style="color:#cc7832;">in </span>env_ranks]))
        <span style="color:#94558d;">self</span>.eps_init = <span style="color:#94558d;">self</span>._eps_init_scalar * torch.ones(<span style="color:#8888c6;">len</span>(env_ranks))
        global_eps_final = torch.logspace(
            torch.log10(torch.tensor(<span style="color:#94558d;">self</span>.eps_final_min))<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span>torch.log10(torch.tensor(<span style="color:#94558d;">self</span>._eps_final_scalar))<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">            </span>global_B)
        <span style="color:#94558d;">self</span>.eps_final = global_eps_final[env_ranks]
    <span style="color:#94558d;">self</span>.eps_sample = <span style="color:#94558d;">self</span>.eps_init</pre>
<p>可以看到这个函数就是为了计算&nbsp;<span style="color:#0000ff;">self.eps_final</span> 以及&nbsp;<span style="color:#0000ff;">self.eps_sample</span> 的值。<br />
对&nbsp;example_1&nbsp;来说，self.eps_final_min&nbsp;为 None，因此&nbsp;make_vec_eps()&nbsp;函数里最外层的 if&nbsp;为 False，只有最后一句代码 self.eps_sample = self.eps_init 有实效，因此，<span style="color:#ff0000;">env_ranks&nbsp;在这里啥用也没有</span>！<br />
&ldquo;你让我看了这么多字，结果就告诉我它没用？!&rdquo; 真不好意思，事实就是这样。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
但是，env_ranks 对 example_1 没用，在其他的场景下还是有用的啊，我讲了这么多废话，还是没有说清楚 env_ranks 到底是干嘛的。我说一下<span style="color:#ff0000;">我的理解</span>：<span style="color:#0000ff;">对不同的environment实例，对它们用&epsilon;-greedy来选择action的时候，&epsilon; 可能是不同的。由于rlpyt在不同的并行模式下，会形成不同的&ldquo;虚拟environment数量&rdquo;的概念（比如在Alternating模式下，每两个environment构成的一个pair会共享相同的&nbsp;&epsilon; 值，两个 environment 视为一个虚拟的environment），因此在各种场景下都要确定一个对应到实际场景下的、虚拟的environment数量，这就是env_ranks的含义。</span><br />
再次强调，这只是我目前的理解，如果有一天我有了新的领悟，那我可能会回来修正这些表述。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(0, 0, 255);"><span style="background-color: rgb(0, 255, 0);"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/25b6.png" alt="▶" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></span>&nbsp;收集训练数据发生的地方：obtain_samples()&nbsp;函数<br />
obtain_samples()&nbsp;函数其实是调用了 collector&nbsp;类的&nbsp;collect_batch()&nbsp;函数去收集训练数据：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#cc7832;font-weight:bold;">def </span><span style="font-weight:bold;">obtain_samples</span>(<span style="color:#94558d;">self</span><span style="color:#cc7832;">, </span>itr):<span style="color:#808080;">
</span><span style="color:#808080;">    </span>agent_inputs<span style="color:#cc7832;">, </span>traj_infos<span style="color:#cc7832;">, </span>completed_infos = <span style="color:#94558d;">self</span>.collector.<span style="color:#cc7833;">collect_batch</span>(
        <span style="color:#94558d;">self</span>.agent_inputs<span style="color:#cc7832;">, </span><span style="color:#94558d;">self</span>.traj_infos<span style="color:#cc7832;">, </span>itr)
    <span style="color:#94558d;">self</span>.collector.<span style="color:#cc7833;">reset_if_needed</span>(agent_inputs)
    <span style="color:#94558d;">self</span>.agent_inputs = agent_inputs
    <span style="color:#94558d;">self</span>.traj_infos = traj_infos
    <span style="color:#cc7832;font-weight:bold;">return </span><span style="color:#94558d;">self</span>.samples_pyt<span style="color:#cc7832;">, </span>completed_infos</pre>
<p><span style="color:#0000ff;">这里看上去有一点奇怪的是</span>：收集到的数据&nbsp;self.samples_pyt&nbsp;并没有在&nbsp;collect_batch()&nbsp;函数中被更新，所以为什么每次收集一个batch的数据的时候，得到的&nbsp;self.samples_pyt&nbsp;都是最新的呢？<br />
我觉得类似的现象在 rlpyt&nbsp;中太多了，无形中增加了理解源码的难度。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
要弄清楚这个问题，来看看&nbsp;self.samples_pyt&nbsp;是怎么定义的：在&nbsp;initialize()&nbsp;函数里有：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
<span style="color:#94558d;">self</span>.samples_pyt = samples_pyt</pre>
<p>而&nbsp;samples_pyt&nbsp;是由另一个函数创建出来的：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
samples_pyt<span style="color:#cc7832;">, </span>samples_np<span style="color:#cc7832;">, </span>examples = <span style="color:#cc7833;">build_samples_buffer</span>(agent<span style="color:#cc7832;">, </span>envs[<span style="color:#6897bb;">0</span>]<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#94558d;">self</span>.batch_spec<span style="color:#cc7832;">, </span>bootstrap_value<span style="color:#cc7832;">, </span><span style="color:#aa4926;">agent_shared</span>=<span style="color:#cc7832;font-weight:bold;">False</span><span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">env_shared</span>=<span style="color:#cc7832;font-weight:bold;">False</span><span style="color:#cc7832;">, </span><span style="color:#aa4926;">subprocess</span>=<span style="color:#cc7832;font-weight:bold;">False</span>)</pre>
<p>进这个函数里看一下就知道，samples_pyt&nbsp;其实就是&nbsp;samples_np&nbsp;转成的对应 tensor&nbsp;形式。而 PyTorch&nbsp;和 NumPy array&nbsp;是共享底层内存的，修改其中一个的数据会导致另一个也被修改，<span style="color:#ff0000;">可以认为&nbsp;</span><span style="color:#0000ff;">samples_pyt</span><span style="color:#ff0000;">&nbsp;和&nbsp;</span><span style="color:#0000ff;">samples_np</span>&nbsp;<span style="color:#ff0000;">在底层是对应到同一个东西</span>（<span style="color:#0000ff;">对这句话我持保留意见，目前还不能完全肯定这种说法正确，需要进一步理解 rlpyt&nbsp;源码才能给出确定的答案，但姑且这么理解先</span>）。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
此外，samples_np&nbsp;被传给了 collector&nbsp;类的构造函数：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
collector = <span style="color:#94558d;">self</span>.<span style="color:#cc7833;">CollectorCls</span>(
    <span style="color:#aa4926;">rank</span>=<span style="color:#6897bb;">0</span><span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">envs</span>=envs<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">samples_np</span>=samples_np<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">batch_T</span>=<span style="color:#94558d;">self</span>.batch_spec.T<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">TrajInfoCls</span>=<span style="color:#94558d;">self</span>.TrajInfoCls<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">agent</span>=agent<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">global_B</span>=global_B<span style="color:#cc7832;">,
</span><span style="color:#cc7832;">    </span><span style="color:#aa4926;">env_ranks</span>=env_ranks<span style="color:#cc7832;">,  </span><span style="color:#808080;"># Might get applied redundantly to agent.
</span>)</pre>
<p>所以这就相当于把&nbsp;samples_pyt&nbsp;和 collector&nbsp;类建立了联系。<br />
再看一下 example_1&nbsp;的 collector&nbsp;类(即&nbsp;<span style="color:#0000ff;">CpuResetCollector</span>)的&nbsp;collect_batch()&nbsp;函数，它在计算返回值的时候，果然有用到 samples_np：</p>
<pre style="background-color:#2b2b2b;color:#a9b7c6;font-family:'Menlo';font-size:12.0pt;">
agent_buf<span style="color:#cc7832;">, </span>env_buf = <span style="color:#94558d;">self</span>.samples_np.agent<span style="color:#cc7832;">, </span><span style="color:#94558d;">self</span>.samples_np.env</pre>
<p>经过这么一绕，obtain_samples()&nbsp;函数中返回的&nbsp;self.samples_pyt&nbsp;就有意义了。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
这一节就到这，且听下回分解。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%bc%ba%e5%8c%96%e5%ad%a6%e4%b9%a0%e6%a1%86%e6%9e%b6-rlpyt-%e6%ba%90%e7%a0%81%e5%88%86%e6%9e%90%ef%bc%9a4-%e6%94%b6%e9%9b%86%e8%ae%ad%e7%bb%83%e6%95%b0%e6%8d%ae%e7%9a%84sample/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
