<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Deep Deterministic Policy Gradient &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/deep-deterministic-policy-gradient/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Mon, 27 Apr 2020 17:15:16 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] 怎么理解DDPG(Deep Deterministic Policy Gradient)里的Deterministic</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%8e%e4%b9%88%e7%90%86%e8%a7%a3ddpgdeep-deterministic-policy-gradient%e9%87%8c%e7%9a%84deterministic/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%8e%e4%b9%88%e7%90%86%e8%a7%a3ddpgdeep-deterministic-policy-gradient%e9%87%8c%e7%9a%84deterministic/#respond</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Sun, 29 Sep 2019 07:20:36 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[DDPG]]></category>
		<category><![CDATA[Deep Deterministic Policy Gradient]]></category>
		<category><![CDATA[深度确定性策略梯度]]></category>
		<guid isPermaLink="false">https://www.codelast.com/?p=10599</guid>

					<description><![CDATA[<p>
DDPG（Deep Deterministic Policy Gradient，深度确定性策略梯度）是强化学习领域的一种知名算法。<br />
如何理解其中的Deterministic（确定性）这个名词？<br />
通俗地说，对一个状态(state)来说，根据这个state所采取的action有可能是带有随机性的。在两次与environment交互的时候，即使是一模一样的state，所采取的action也有可能不同，这就不是一种&#8220;确定性&#8221;的策略。<br />
对一种&#8220;确定性&#8221;的策略来说，只要state相同，它给出的action必然相同。<br />
<span id="more-10599"></span><br />
随机策略： <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_184683f0038510c2ebc87433724dcb4b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\pi (a&#124;s) = P[a&#124;s]" /></span><script type='math/tex'>\pi (a&#124;s) = P[a&#124;s]</script> <br />
确定性策略： <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_b50544ed44af995a24b5f2f33a45cf6e.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a = \mu (s)" /></span><script type='math/tex'>a = \mu (s)</script> <br />
其中， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0cc175b9c0f1b6a831c399e269772661.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a" /></span><script type='math/tex'>a</script>  是指action， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_03c7c0ace395d80182db07ae2c30f034.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="s" /></span><script type='math/tex'>s</script>  是指state。由上面的简单公式可见，对一个随机策略而言，当处于某个state的时候，采取&#160; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0cc175b9c0f1b6a831c399e269772661.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a" /></span><script type='math/tex'>a</script>  这个action的可能性并不是100%，而是有一个概率 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_44c29edb103a2872f519ad0c9a0fdaaa.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="P" /></span><script type='math/tex'>P</script> 的，就像抽奖一样。而对确定性策略而言，没有概率的影响，输入同样的  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_03c7c0ace395d80182db07ae2c30f034.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="s" /></span><script type='math/tex'>s</script>  必然输出同样的  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0cc175b9c0f1b6a831c399e269772661.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a" /></span><script type='math/tex'>a</script> 。</p>
<p>exploration（探索性）是训练一个好agent的重要因素，在确定性策略作用下，既然输入某个state一定会输出相同的action，那么在策略上就失去了探索性。为了实现探索性，一个办法是给policy网络的参数添加噪音<sup>[1]</sup>，这就使得同样的state经过了policy网络之后，也有可能输出不同的action。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<ul>
<li>
		参考文献：</li>
</ul>
<p>[1]&#160;<a href="https://openai.com/blog/better-exploration-with-parameter-noise/" rel="noopener noreferrer" target="_blank">Better Exploration with Parameter Noise</a></p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%8e%e4%b9%88%e7%90%86%e8%a7%a3ddpgdeep-deterministic-policy-gradient%e9%87%8c%e7%9a%84deterministic/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
DDPG（Deep Deterministic Policy Gradient，深度确定性策略梯度）是强化学习领域的一种知名算法。<br />
如何理解其中的Deterministic（确定性）这个名词？<br />
通俗地说，对一个状态(state)来说，根据这个state所采取的action有可能是带有随机性的。在两次与environment交互的时候，即使是一模一样的state，所采取的action也有可能不同，这就不是一种&ldquo;确定性&rdquo;的策略。<br />
对一种&ldquo;确定性&rdquo;的策略来说，只要state相同，它给出的action必然相同。<br />
<span id="more-10599"></span><br />
随机策略： <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_184683f0038510c2ebc87433724dcb4b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\pi (a|s) = P[a|s]" /></span><script type='math/tex'>\pi (a|s) = P[a|s]</script> <br />
确定性策略： <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_b50544ed44af995a24b5f2f33a45cf6e.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a = \mu (s)" /></span><script type='math/tex'>a = \mu (s)</script> <br />
其中， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0cc175b9c0f1b6a831c399e269772661.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a" /></span><script type='math/tex'>a</script>  是指action， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_03c7c0ace395d80182db07ae2c30f034.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="s" /></span><script type='math/tex'>s</script>  是指state。由上面的简单公式可见，对一个随机策略而言，当处于某个state的时候，采取&nbsp; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0cc175b9c0f1b6a831c399e269772661.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a" /></span><script type='math/tex'>a</script>  这个action的可能性并不是100%，而是有一个概率 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_44c29edb103a2872f519ad0c9a0fdaaa.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="P" /></span><script type='math/tex'>P</script> 的，就像抽奖一样。而对确定性策略而言，没有概率的影响，输入同样的  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_03c7c0ace395d80182db07ae2c30f034.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="s" /></span><script type='math/tex'>s</script>  必然输出同样的  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0cc175b9c0f1b6a831c399e269772661.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="a" /></span><script type='math/tex'>a</script> 。</p>
<p>exploration（探索性）是训练一个好agent的重要因素，在确定性策略作用下，既然输入某个state一定会输出相同的action，那么在策略上就失去了探索性。为了实现探索性，一个办法是给policy网络的参数添加噪音<sup>[1]</sup>，这就使得同样的state经过了policy网络之后，也有可能输出不同的action。<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a></p>
<ul>
<li>
		参考文献：</li>
</ul>
<p>[1]&nbsp;<a href="https://openai.com/blog/better-exploration-with-parameter-noise/" rel="noopener noreferrer" target="_blank">Better Exploration with Parameter Noise</a></p>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
	<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e6%80%8e%e4%b9%88%e7%90%86%e8%a7%a3ddpgdeep-deterministic-policy-gradient%e9%87%8c%e7%9a%84deterministic/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
