<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>gradient descent &#8211; 编码无悔 /  Intent &amp; Focused</title>
	<atom:link href="https://www.codelast.com/tag/gradient-descent/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.codelast.com</link>
	<description>最优化之路</description>
	<lastBuildDate>Tue, 28 Apr 2020 02:11:54 +0000</lastBuildDate>
	<language>zh-Hans</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>[原创] 再谈 梯度下降法/最速下降法/Gradient descent/Steepest Descent</title>
		<link>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%86%8d%e8%b0%88-%e6%9c%80%e9%80%9f%e4%b8%8b%e9%99%8d%e6%b3%95%e6%a2%af%e5%ba%a6%e6%b3%95steepest-descent/</link>
					<comments>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%86%8d%e8%b0%88-%e6%9c%80%e9%80%9f%e4%b8%8b%e9%99%8d%e6%b3%95%e6%a2%af%e5%ba%a6%e6%b3%95steepest-descent/#comments</comments>
		
		<dc:creator><![CDATA[learnhard]]></dc:creator>
		<pubDate>Wed, 02 Apr 2014 16:23:41 +0000</pubDate>
				<category><![CDATA[Algorithm]]></category>
		<category><![CDATA[Math]]></category>
		<category><![CDATA[原创]]></category>
		<category><![CDATA[gradient descent]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[steepest descent]]></category>
		<category><![CDATA[最优化]]></category>
		<category><![CDATA[最速下降法]]></category>
		<category><![CDATA[梯度下降法]]></category>
		<guid isPermaLink="false">http://www.codelast.com/?p=8006</guid>

					<description><![CDATA[<p>
当今世界，深度学习应用已经渗透到了我们生活的方方面面，深度学习技术背后的核心问题是最优化(Optimization)。最优化是应用数学的一个分支，它是研究在给定约束之下如何寻求某些因素(的量)，以使某一(或某些)指标达到最优的一些学科的总称。<br />
梯度下降法（Gradient descent，又称最速下降法/Steepest&#160;descent），是无约束<a href="http://zh.wikipedia.org/wiki/%E6%9C%80%E4%BC%98%E5%8C%96" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">最优化</span></a>领域中历史最悠久、最简单的算法，单独就这种算法来看，属于早就&#8220;过时&#8221;了的一种算法。但是，它的理念是其他某些算法的组成部分，或者说在其他某些算法中，也有梯度下降法的&#8220;影子&#8221;。例如，各种深度学习库都会使用SGD（Stochastic Gradient Descent，随机梯度下降）或变种作为其优化算法。<br />
今天我们就再来回顾一下梯度下降法的基础知识。<br />
<span id="more-8006"></span><br />
<span style="background-color:#00ff00;">『1』</span>名字释义<br />
在很多机器学习算法中，我们通常会通过多轮的迭代计算，最小化一个损失函数(loss function)的值，这个损失函数，对应到最优化里就是所谓的&#8220;目标函数&#8221;。<br />
在寻找最优解的过程中，梯度下降法只使用目标函数的一阶导数信息&#8212;&#8212;从&#8220;梯度&#8221;这个名字也可见一斑。并且它的本意是取目标函数值&#8220;最快下降&#8221;的方向作为搜索方向，这也是&#8220;最速下降&#8221;这个名字的来源。<br />
于是自然而然地，我们就想知道一个问题的答案：沿什么方向，目标函数  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_50bbd36e1fd2333108437a2ca378be62.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x)" /></span><script type='math/tex'>f(x)</script>  的值下降最快呢？</p>
<p><span style="background-color:#00ff00;">『2』</span>函数值下降最快的方向是什么<br />
先说结论：沿负梯度方向&#160; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_aaf665c8dc32efb0ed392432cb3091ef.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d = - {g_k}" /></span><script type='math/tex'>d = - {g_k}</script> ，函数值下降最快。此处，我们用&#160; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_8277e0910d750195b448797616e091ad.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d" /></span><script type='math/tex'>d</script> &#160;表示方向(direction)，用&#160; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_b2f5ff47436671b6e533d8dc3614845d.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="g" /></span><script type='math/tex'>g</script> &#160;表示梯度(gradient)。<br />
下面就来推导一下。<br />
将目标函数 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_50bbd36e1fd2333108437a2ca378be62.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x)" /></span><script type='math/tex'>f(x)</script> 在点 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_550187f469eda08b9e5b55143f19c4ce.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{x_k}" /></span><script type='math/tex'>{x_k}</script> 处泰勒展开（在最优化领域，这是一个常用的手段）：<br />
 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_37ec8b5a5be749597c9ef1bb8e9b1d2d.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x) = f({x_k}) + \alpha g_k^T{d_k} + o(\alpha )" /></span><script type='math/tex'>f(x) = f({x_k}) + \alpha g_k^T{d_k} + o(\alpha )</script> <br />
高阶无穷小 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_c42304a467a6fa377fc950fcc6a5ccf9.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="o(\alpha )" /></span><script type='math/tex'>o(\alpha )</script> 可忽略，由于我们定义了步长 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_7f54bc2116eb53e3231634d008c92d90.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\alpha/> 0" /</span><script type='math/tex'>\alpha 0</script> （在ML领域，步长就是平常所说的learning rate），因此，当 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_ead119a63ddef55ab91efbf5514e3609.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="g_k^T{d_k} < 0" /></span><script type='math/tex'>g_k^T{d_k} < 0</script> 时， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_320ccfc2694ebec4aa2bccff656c7baa.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x) < f({x_k})" /></span><script type='math/tex'>f(x) < f({x_k})</script> ，即函数值是<span style="color:#0000ff;">下降</span>的。此时 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_66eea6bfeea7fcb327d435f627a2390b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k}" /></span><script type='math/tex'>{d_k}</script> 就是一个下降方向。<br />
但是 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_66eea6bfeea7fcb327d435f627a2390b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k}" /></span><script type='math/tex'>{d_k}</script> 具体等于什么的时候，可使目标函数值下降最快呢？<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a>&#8230; <a href="https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%86%8d%e8%b0%88-%e6%9c%80%e9%80%9f%e4%b8%8b%e9%99%8d%e6%b3%95%e6%a2%af%e5%ba%a6%e6%b3%95steepest-descent/" class="read-more">Read More </a></p>]]></description>
										<content:encoded><![CDATA[<p>
当今世界，深度学习应用已经渗透到了我们生活的方方面面，深度学习技术背后的核心问题是最优化(Optimization)。最优化是应用数学的一个分支，它是研究在给定约束之下如何寻求某些因素(的量)，以使某一(或某些)指标达到最优的一些学科的总称。<br />
梯度下降法（Gradient descent，又称最速下降法/Steepest&nbsp;descent），是无约束<a href="http://zh.wikipedia.org/wiki/%E6%9C%80%E4%BC%98%E5%8C%96" rel="noopener noreferrer" target="_blank"><span style="background-color: rgb(255, 160, 122);">最优化</span></a>领域中历史最悠久、最简单的算法，单独就这种算法来看，属于早就&ldquo;过时&rdquo;了的一种算法。但是，它的理念是其他某些算法的组成部分，或者说在其他某些算法中，也有梯度下降法的&ldquo;影子&rdquo;。例如，各种深度学习库都会使用SGD（Stochastic Gradient Descent，随机梯度下降）或变种作为其优化算法。<br />
今天我们就再来回顾一下梯度下降法的基础知识。<br />
<span id="more-8006"></span><br />
<span style="background-color:#00ff00;">『1』</span>名字释义<br />
在很多机器学习算法中，我们通常会通过多轮的迭代计算，最小化一个损失函数(loss function)的值，这个损失函数，对应到最优化里就是所谓的&ldquo;目标函数&rdquo;。<br />
在寻找最优解的过程中，梯度下降法只使用目标函数的一阶导数信息&mdash;&mdash;从&ldquo;梯度&rdquo;这个名字也可见一斑。并且它的本意是取目标函数值&ldquo;最快下降&rdquo;的方向作为搜索方向，这也是&ldquo;最速下降&rdquo;这个名字的来源。<br />
于是自然而然地，我们就想知道一个问题的答案：沿什么方向，目标函数  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_50bbd36e1fd2333108437a2ca378be62.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x)" /></span><script type='math/tex'>f(x)</script>  的值下降最快呢？</p>
<p><span style="background-color:#00ff00;">『2』</span>函数值下降最快的方向是什么<br />
先说结论：沿负梯度方向&nbsp; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_aaf665c8dc32efb0ed392432cb3091ef.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d = - {g_k}" /></span><script type='math/tex'>d = - {g_k}</script> ，函数值下降最快。此处，我们用&nbsp; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_8277e0910d750195b448797616e091ad.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d" /></span><script type='math/tex'>d</script> &nbsp;表示方向(direction)，用&nbsp; <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_b2f5ff47436671b6e533d8dc3614845d.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="g" /></span><script type='math/tex'>g</script> &nbsp;表示梯度(gradient)。<br />
下面就来推导一下。<br />
将目标函数 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_50bbd36e1fd2333108437a2ca378be62.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x)" /></span><script type='math/tex'>f(x)</script> 在点 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_550187f469eda08b9e5b55143f19c4ce.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{x_k}" /></span><script type='math/tex'>{x_k}</script> 处泰勒展开（在最优化领域，这是一个常用的手段）：<br />
 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_37ec8b5a5be749597c9ef1bb8e9b1d2d.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x) = f({x_k}) + \alpha g_k^T{d_k} + o(\alpha )" /></span><script type='math/tex'>f(x) = f({x_k}) + \alpha g_k^T{d_k} + o(\alpha )</script> <br />
高阶无穷小 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_c42304a467a6fa377fc950fcc6a5ccf9.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="o(\alpha )" /></span><script type='math/tex'>o(\alpha )</script> 可忽略，由于我们定义了步长 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_7f54bc2116eb53e3231634d008c92d90.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\alpha > 0" /></span><script type='math/tex'>\alpha > 0</script> （在ML领域，步长就是平常所说的learning rate），因此，当 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_ead119a63ddef55ab91efbf5514e3609.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="g_k^T{d_k} < 0" /></span><script type='math/tex'>g_k^T{d_k} < 0</script> 时， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_320ccfc2694ebec4aa2bccff656c7baa.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x) < f({x_k})" /></span><script type='math/tex'>f(x) < f({x_k})</script> ，即函数值是<span style="color:#0000ff;">下降</span>的。此时 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_66eea6bfeea7fcb327d435f627a2390b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k}" /></span><script type='math/tex'>{d_k}</script> 就是一个下降方向。<br />
但是 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_66eea6bfeea7fcb327d435f627a2390b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k}" /></span><script type='math/tex'>{d_k}</script> 具体等于什么的时候，可使目标函数值下降最快呢？<br />
<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
数学上，有一个非常著名的不等式：<a href="http://www.codelast.com/?p=8022" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">Cauchy-Schwartz不等式（柯西-许瓦兹不等式）</span></a><sup><span style="font-size: 13.3333px;">1</span></sup>，它是一个在很多场合都用得上的不等式：</p>
<div>
	<span style="color:#b22222;"> <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_2faef7bfff1303da60951a77f6c7d4d0.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="({a_1}{b_1} + {a_2}{b_2} + \cdots + {a_n}{b_n}) \le \sqrt {(a_1^2 + a_2^2 + \cdots + a_n^2)} \sqrt {(b_1^2 + b_2^2 + \cdots + b_n^2)} " /></span><script type='math/tex'>({a_1}{b_1} + {a_2}{b_2} + \cdots + {a_n}{b_n}) \le \sqrt {(a_1^2 + a_2^2 + \cdots + a_n^2)} \sqrt {(b_1^2 + b_2^2 + \cdots + b_n^2)} </script> </span></div>
<div>
	当且仅当：</div>
<div>
	<span style="color:#b22222;"> <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_6337b1fff757e26d559865b0f907b86f.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\frac{{{a_1}}}{{{b_1}}} = \frac{{{a_2}}}{{{b_2}}} = \cdots = \frac{{{a_n}}}{{{b_n}}}" /></span><script type='math/tex'>\frac{{{a_1}}}{{{b_1}}} = \frac{{{a_2}}}{{{b_2}}} = \cdots = \frac{{{a_n}}}{{{b_n}}}</script> </span></div>
<div>
	时等号成立。<br />
	&nbsp;</div>
<p>由Cauchy-Schwartz不等式可知：<br />
 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_0a0acce3485256ce4904d0abea1a4717.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\left| {d_k^T{g_k}} \right| \le \left\| {{d_k}} \right\|\left\| {{g_k}} \right\|" /></span><script type='math/tex'>\left| {d_k^T{g_k}} \right| \le \left\| {{d_k}} \right\|\left\| {{g_k}} \right\|</script> <br />
当且仅当 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_a7c9e705147a8c9a65808417da4ea36f.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k} = {g_k}" /></span><script type='math/tex'>{d_k} = {g_k}</script> 时，等号成立， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_62f84a80c057c4892a5e22755b4976aa.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d_k^T{g_k}" /></span><script type='math/tex'>d_k^T{g_k}</script> 最大（&gt;0）。<br />
所以 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_6f02d7b4f8531090efdc877eb851f093.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k} = - {g_k}" /></span><script type='math/tex'>{d_k} = - {g_k}</script> 时， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_62f84a80c057c4892a5e22755b4976aa.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d_k^T{g_k}" /></span><script type='math/tex'>d_k^T{g_k}</script> 最小（&lt;0）， <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_50bbd36e1fd2333108437a2ca378be62.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x)" /></span><script type='math/tex'>f(x)</script> 下降量最大。<br />
所以 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_be9976a20363f7c49bb370084b76dca7.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt=" - {g_k}" /></span><script type='math/tex'> - {g_k}</script> 是<span style="color:#ff0000;">最</span>快<span style="color:#ff0000;">速</span>下降方向。</p>
<p><span style="background-color:#00ff00;">『3』</span>缺点<br />
它真的如它的名字所描述的，是&ldquo;最快速&rdquo;的吗？从很多经典的最优化书籍你会了解到：并不是。<br />
事实上，它只在局部范围内具有&ldquo;最速&rdquo;性质；对整体求最优解的过程而言，它让目标函数值下降非常缓慢。</p>
<p><span style="background-color:#00ff00;">『4』</span>感受一下它是如何&ldquo;慢&rdquo;的<br />
先来看一幅图<sup>2</sup>：</p>
<div style="text-align: center;">
	<img decoding="async" alt="" src="http://www.codelast.com/wp-content/uploads/ckfinder/images/Rosenbrock_function.png" style="width: 378px; height: 302px;" /></div>
<p><span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
这幅图表示的是对一个目标函数寻找最优解的过程，图中锯齿状的路线就是寻优路线在二维平面上的投影。从这幅图我们可以看到，锯齿一开始比较大（跨越的距离比较大），后来越来越小；这就像一个人走路迈的步子，一开始大，后来步子越迈越小。<br />
这个函数的表达式是这样的：<br />
 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_4526ba6075c8c9c1d532166726779a6a.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f({x_1},{x_2}) = {(1 - {x_1})^2} + 100 \cdot {({x_2} - {x_1}^2)^2}" /></span><script type='math/tex'>f({x_1},{x_2}) = {(1 - {x_1})^2} + 100 \cdot {({x_2} - {x_1}^2)^2}</script> <br />
它叫做<span style="color:#0000ff;">Rosenbrock function<sup>3</sup>（罗森布罗克函数）</span>，是个非凸函数，在最优化领域，它可以用作一个最优化算法的performance test函数。这个函数还有一个更好记也更滑稽的名字：banana function（香蕉函数）。<br />
我们来看一看它在三维空间中的图形：</p>
<div style="text-align: center;">
	<img decoding="async" alt="Rosenbrock function 3D" src="http://www.codelast.com/wp-content/uploads/ckfinder/images/Rosenbrock_function_3d.jpg" style="width: 450px; height: 338px;" /></div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
	它的全局最优点位于一个长长的、狭窄的、抛物线形状的、扁平的&ldquo;山谷&rdquo;中。</div>
<div>
	找到&ldquo;山谷&rdquo;并不难，难的是收敛到全局最优解（在 (1,1) 处）。</div>
<div>
	正所谓：<span style="color:#800000;">世界上最遥远的距离，不是你离我千山万水，而是你就在我眼前，我却要跨越千万步，才能找到你</span>。</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
	我们再来看另一个目标函数 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_8befafb138a9640e761679a0f15f30bb.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="f(x,y) = \sin \left( {\frac{1}{2}{x^2} - \frac{1}{4}{y^2} + 3} \right)\cos \left( {2x + 1 - {e^y}} \right)" /></span><script type='math/tex'>f(x,y) = \sin \left( {\frac{1}{2}{x^2} - \frac{1}{4}{y^2} + 3} \right)\cos \left( {2x + 1 - {e^y}} \right)</script> 的寻优过程<sup>4</sup>：</div>
<div style="text-align: center;">
	<img decoding="async" alt="" src="http://www.codelast.com/wp-content/uploads/ckfinder/images/function_find_opt_process.jpg" style="width: 310px; height: 309px;" /></div>
<div>
	和前面的Rosenbrock function一样，它的寻优过程也是&ldquo;锯齿状&rdquo;的。<br />
	它在三维空间中的图形是这样的：</div>
<div style="text-align: center;">
	<img decoding="async" alt="" src="http://www.codelast.com/wp-content/uploads/ckfinder/images/function_find_opt_process_3d.jpg" style="width: 365px; height: 300px;" /></div>
<div>
	总而言之就是：当目标函数的等值线接近于圆(球)时，下降较快；等值线类似于扁长的椭球时，一开始快，后来很慢。</div>
<div>
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
	<span style="background-color:#00ff00;">『5』</span>为什么&ldquo;慢&rdquo;<br />
	从上面花花绿绿的图，我们看到了寻找最优解的过程有多么&ldquo;艰辛&rdquo;，但不能光看热闹，还要分析一下原因。<br />
	在最优化算法中，精确的line search满足一个<a href="http://www.codelast.com/?p=7838" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">一阶必要条件</span></a>，即：梯度与方向的点积为零（当前点在 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_66eea6bfeea7fcb327d435f627a2390b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k}" /></span><script type='math/tex'>{d_k}</script> 方向上移动到的那一点（ <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_dfce03f6c63c6112ec0a9e19d3390177.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{x_k} + {\alpha _k}{d_k}" /></span><script type='math/tex'>{x_k} + {\alpha _k}{d_k}</script> ）处的梯度，与当前点的搜索方向 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_66eea6bfeea7fcb327d435f627a2390b.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k}" /></span><script type='math/tex'>{d_k}</script> 的点积为零）。<br />
	由此得知：<br />
	 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_79c40a2ebc114054140981fdaed79c9f.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\nabla f{({x_k} + {\alpha _k}{d_k})^T}{d_k} = 0" /></span><script type='math/tex'>\nabla f{({x_k} + {\alpha _k}{d_k})^T}{d_k} = 0</script> ，即 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_2b4eb83bd55e78edd9cda00404acfeb0.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="g_{k + 1}^T{d_k} = 0" /></span><script type='math/tex'>g_{k + 1}^T{d_k} = 0</script> <br />
	故由梯度下降法的 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_6f02d7b4f8531090efdc877eb851f093.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="{d_k} = - {g_k}" /></span><script type='math/tex'>{d_k} = - {g_k}</script> 得：<br />
	 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_03f082fe728fefc7e4909366635af3e9.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="g_{k + 1}^T{d_k} = g_{k + 1}^T( - {g_k}) = - g_{k + 1}^T{g_k} = - d_{k + 1}^T{d_k} = 0 \Rightarrow " /></span><script type='math/tex'>g_{k + 1}^T{d_k} = g_{k + 1}^T( - {g_k}) = - g_{k + 1}^T{g_k} = - d_{k + 1}^T{d_k} = 0 \Rightarrow </script> <span style="color:#ff0000;"> <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_fb24eac9024e19c63ae8e21214df893f.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d_{k + 1}^T{d_k} = 0" /></span><script type='math/tex'>d_{k + 1}^T{d_k} = 0</script> </span><br />
	即：相邻两次的搜索方向是相互直交的（投影到二维平面上，就是锯齿形状了）。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
	如果你非要问，为什么 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_fb24eac9024e19c63ae8e21214df893f.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="d_{k + 1}^T{d_k} = 0" /></span><script type='math/tex'>d_{k + 1}^T{d_k} = 0</script> 就表明这两个向量是相互直交的？那是因为，由两向量夹角的公式：<br />
	 <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_162dccc3552556d48c0bb57f60aaebc4.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\cos \theta = \frac{{{d_k}^T{d_k}}}{{\left\| {{d_k}} \right\|\left\| {{d_k}} \right\|}} = \frac{0}{{\left\| {{d_k}} \right\|\left\| {{d_k}} \right\|}} = 0\;" /></span><script type='math/tex'>\cos \theta = \frac{{{d_k}^T{d_k}}}{{\left\| {{d_k}} \right\|\left\| {{d_k}} \right\|}} = \frac{0}{{\left\| {{d_k}} \right\|\left\| {{d_k}} \right\|}} = 0\;</script> <br />
	=&gt;<span style="color:#ff0000;">  <span class='MathJax_Preview'><img src='https://www.codelast.com/wp-content/plugins/latex/cache/tex_f898dd39bd002624d891bc76fb86aa9f.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\theta = \frac{\pi }{2}" /></span><script type='math/tex'>\theta = \frac{\pi }{2}</script> </span><br />
	可知两向量夹角为90度，因此它们直交。</p>
<p>	<span style="background-color:#00ff00;">『6』</span>优点<br />
	这个被我们说得一无是处的方法真的就那么糟糕吗？其实它还是有优点的：程序简单，计算量小；并且对初始点没有特别的要求；此外，许多算法的初始/再开始方向都是最速下降方向（即负梯度方向）。<br />
	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="http://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">http://www.codelast.com/</span></a><br />
	<span style="background-color:#00ff00;">『7』</span>收敛性及收敛速度<br />
	梯度下降法具有整体收敛性&mdash;&mdash;对初始点没有特殊要求。<br />
	采用<a href="http://www.codelast.com/?p=2348" rel="noopener noreferrer" target="_blank"><span style="background-color:#ffa07a;">精确的line search</span></a>的梯度下降法的收敛速度：线性。<br />
	&nbsp;</div>
<ul>
<li>
		引用</li>
</ul>
<div>
	（1）https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality<br />
	（2）https://en.wikipedia.org/wiki/Gradient_descent<br />
	（3）https://en.wikipedia.org/wiki/Rosenbrock_function<br />
	（4）https://en.wikipedia.org/wiki/Gradient_descent</p>
<p>	<span style="color: rgb(255, 255, 255);">文章来源：</span><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><span style="color: rgb(255, 255, 255);">https://www.codelast.com/</span></a><br />
	<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;版权声明&nbsp;<span style="color: rgb(255, 0, 0);">➤➤</span>&nbsp;<br />
	转载需注明出处：<u><a href="https://www.codelast.com/" rel="noopener noreferrer" target="_blank"><em><span style="color: rgb(0, 0, 255);"><strong style="font-size: 16px;"><span style="font-family: arial, helvetica, sans-serif;">codelast.com</span></strong></span></em></a></u>&nbsp;<br />
	感谢关注我的微信公众号（微信扫一扫）：</p>
<p style="border: 0px; font-size: 13px; margin: 0px 0px 9px; outline: 0px; padding: 0px; color: rgb(77, 77, 77);">
		<img decoding="async" alt="wechat qrcode of codelast" src="https://www.codelast.com/codelast_wechat_qr_code.jpg" style="width: 200px; height: 200px;" /></p>
</div>
]]></content:encoded>
					
					<wfw:commentRss>https://www.codelast.com/%e5%8e%9f%e5%88%9b-%e5%86%8d%e8%b0%88-%e6%9c%80%e9%80%9f%e4%b8%8b%e9%99%8d%e6%b3%95%e6%a2%af%e5%ba%a6%e6%b3%95steepest-descent/feed/</wfw:commentRss>
			<slash:comments>9</slash:comments>
		
		
			</item>
	</channel>
</rss>
