[原创] 强化学习框架 rlpyt 源码分析：前言

查看关于 rlpyt 的更多文章请点击这里。

rlpyt 是BAIR(Berkeley Artificial Intelligence Research，伯克利人工智能研究所)开源的一个强化学习(RL)框架。我之前写了一篇它的简介。如果你想用这个框架来开发自己的强化学习程序（尤其是那些不属于Atari游戏领域的强化学习程序），那么需要对它的源码有一定的了解。本文尝试从 rlpyt 自带的一个实例来分析它的部分源码，希望能帮助到一小部分人。
要先声明一下：rlpyt 的源码比较复杂，想要充分理解全部模块需要下很大的功夫，本系列“源码分析”文章，并没有把 rlpyt 的源码全部分析一遍，而只是分析了它的“冰山一角”，主要目的是让读者能了解它的基本结构及基本运作方式。

▶▶ 吐槽
开篇第一段，当然是免不了要吐槽一下的。
从我“不专业”的角度看，rlpyt的源码使用了很多Python技巧，并且作者在代码中经常不遵守PEP8编码规范，导致代码格式看起来比较混乱，所以我猜测作者一定不用PyCharm来开发Python项目，因为如果不遵守PEP8的话，在PyCharm里会看到满屏都是波浪线，一定会让作者疯掉，上个图给大家感受下：
rlpyt PEP8 problem
上图里的波浪线不完全是PEP8的问题，但至少有6处是。所以我觉得作者一定是像很多Python高手那样，开发的时候都不用IDE的，直接上记事本吧？反正在PyCharm里面看到这种情景的时候我是要疯了。
文章来源：https://www.codelast.com/

▶▶ 主要模块
首先，要了解一下 rlpyt 作者给出的代码模块描述：

Runner - Connects the sampler, agent, and algorithm; manages the training loop and logging of diagnostics.

Sampler - Manages agent / environment interaction to collect training data, can initialize parallel workers.

Collector - Steps environments (and maybe operates agent) and records samples, attached to sampler.

Environment - The task to be learned.

Observation Space/Action Space - Interface specifications from environment to agent.

TrajectoryInfo - Diagnostics logged on a per-trajectory basis.

Agent - Chooses control action to the environment in sampler; trained by the algorithm. Interface to model.

Model - Torch neural network module, attached to the agent.

Distribution - Samples actions for stochastic agents and defines related formulas for use in loss function, attached to the agent.

Algorithm - Uses gathered samples to train the agent (e.g. defines a loss function and performs gradient descent).

Optimizer - Training update rule (e.g. Adam), attached to the algorithm.

OptimizationInfo - Diagnostics logged on a per-training batch basis.

英文我就不翻译了。单看上面的介绍可能会被绕晕，先别慌，从这里我们可以大概了解到以下事实：
☀ rlpyt 模块化抽象做得很好。但是随便一看，多个模块之间好像互相有互相依赖？所以难以在脑子里形成一个清晰的全景图。
☀ rlpyt 主要分为以下重要模块：
☃ Runner：这是一个控制训练流程的模块，它和多个其他模块都有关联。
☃ Sampler：收集/采样数据用的模块，但是后面会说到，这个模块并不是直接去采样数据，而是调用了下面的Collector模块进行采样。
☃ Collector：真正进行数据采样的模块。
☃ Environment：环境模块，也正是你需要把真实世界建模成的东西。
☃ Agent：用于选择/计算 action，控制和Algorithm相关的一些逻辑，等等。
☃ Model：用于定义模型结构。例如策略网络的模型结构。
☃ Distribution：用于对随机agent进行action采样。这里所说的“随机”，是相当于“确定性”来说的，比如DDPG里的“Deterministic”概念。对“随机”的agent，我们可能会使用 ε-greedy 算法来选择一个action，这个选择的过程，就是在Distribution里做的。
☃ Algorithm：算法模块，像DQN，DDPG，PPO之类的实现，就属于算法模块。
☃ Optimizer：其实就是PyTorch的optimzer，例如 torch.optim.Adam，用于优化/调节神经网络的参数。
文章来源：https://www.codelast.com/
有了上面的粗略认识，我们就可以走到下一步了，请听下回分解。
文章来源：https://www.codelast.com/
➤➤ 版权声明 ➤➤
转载需注明出处：codelast.com
感谢关注我的微信公众号（微信扫一扫）：

wechat qrcode of codelast

发表评论 取消回复

发表评论取消回复