高级人工智能 第十章 强化学习 史忠植 中国科学院计算技术研究所 2018/9/20 强化学习 史忠植
内容提要 引言 强化学习模型 动态规划 蒙特卡罗方法 时序差分学习 Q学习 强化学习中的函数估计 应用 2018/9/20 强化学习 史忠植
引言 人类通常从与外界环境的交互中学习。所谓强化(reinforcement)学习是指从环境状态到行为映射的学习,以使系统行为从环境中获得的累积奖励值最大。在强化学习中,我们设计算法来把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自己发现。主体的动作的影响不只是立即得到的奖励,而且还影响接下来的动作和最终的奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重要的特性。 2018/9/20 强化学习 史忠植
引言 强化学习技术是从控制理论、统计学、心理学等相关学科发展而来,最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用,并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后,对强化学习的研究和应用日益开展起来,成为目前机器学习领域的研究热点之一。 2018/9/20 强化学习 史忠植
引言 强化思想最先来源于心理学的研究。1911年Thorndike提出了效果律(Law of Effect):一定情景下让动物感到舒服的行为,就会与此情景增强联系(强化),当此情景再现时,动物的这种行为也更易再现;相反,让动物感觉不舒服的行为,会减弱与情景的联系,此情景再现时,此行为将很难再现。换个说法,哪种行为会“记住”,会与刺激建立联系,取决于行为产生的效果。 动物的试错学习,包含两个含义:选择(selectional)和联系(associative),对应计算上的搜索和记忆。所以,1954年,Minsky在他的博士论文中实现了计算上的试错学习。同年,Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“Steps Toward Artificial Intelligence”,此后开始广泛使用。1969年, Minsky因在人工智能方面的贡献而获得计算机图灵奖。 2018/9/20 强化学习 史忠植
引言 1953到1957年,Bellman提出了求解最优控制问题的一个有效方法:动态规划(dynamic programming) Bellman于 1957年还提出了最优控制问题的随机离散版本,就是著名的马尔可夫决策过程(MDP, Markov decision processe),1960年Howard提出马尔可夫决策过程的策略迭代方法,这些都成为现代强化学习的理论基础。 1972年,Klopf把试错学习和时序差分结合在一起。1978年开始,Sutton、Barto、 Moore,包括Klopf等对这两者结合开始进行深入研究。 1989年Watkins提出了Q-学习[Watkins 1989],也把强化学习的三条主线扭在了一起。 1992年,Tesauro用强化学习成功了应用到西洋双陆棋(backgammon)中,称为TD-Gammon 。 2018/9/20 强化学习 史忠植
内容提要 引言 强化学习模型 动态规划 蒙特卡罗方法 时序差分学习 Q学习 强化学习中的函数估计 应用 2018/9/20 强化学习 史忠植
强化学习模型 i: input r: reward s: state a: action 主体 环境 状态 si 奖励 ri 动作 ai 2018/9/20 强化学习 史忠植
描述一个环境(问题) Accessible vs. inaccessible Deterministic vs. non-deterministic Episodic vs. non-episodic Static vs. dynamic Discrete vs. continuous The most complex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous. 2018/9/20 强化学习 史忠植
强化学习问题 To define a finite MDP state and action sets : S and A Environment action state reward RL Agent Agent-environment interaction States, Actions, Rewards To define a finite MDP state and action sets : S and A one-step “dynamics” defined by transition probabilities (Markov Property): reward probabilities: 2018/9/20 强化学习 史忠植
Training Info = evaluations (“rewards” / “penalties”) 与监督学习对比 Supervised Learning – Learn from examples provided by a knowledgable external supervisor. Training Info = evaluations (“rewards” / “penalties”) RL System Inputs Outputs (“actions”) Reinforcement Learning – Learn from interaction learn from its own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. 2018/9/20 强化学习 史忠植
强化学习要素 Policy: stochastic rule for selecting actions Return/Reward: the function of future rewards agent tries to maximize Value: what is good because it predicts reward Model: what follows what Policy Reward Value Model of environment Is my goal Is I can get Is my method Is unknown 2018/9/20 强化学习 史忠植
在策略Π下的Bellman公式 The basic idea: So: is the discount rate So: Or, without the expectation operator: 2018/9/20 强化学习 史忠植
Bellman最优策略公式 2018/9/20 强化学习 史忠植
MARKOV DECISION PROCESS Characteristics of MDP: a set of states : S a set of actions : A a reward function : R : S x A R A state transition function: T: S x A ∏ ( S) T (s,a,s’): probability of transition from s to s’ using action a k-armed bandit gives immediate reward DELAYED REWARD? 2018/9/20 强化学习 史忠植
MDP EXAMPLE: Bellman Equation: States and rewards Transition function (Greedy policy selection) Transition function 2018/9/20 强化学习 史忠植
MDP Graphical Representation β, α : T (s, action, s’ ) Similarity to Hidden Markov Models (HMMs) 2018/9/20 强化学习 史忠植
动态规划 Dynamic Programming - Problem A discrete-time dynamic system States {1, … , n} + termination state 0 Control U(i) Transition Probability pij(u) Accumulative cost structure Policies 2018/9/20 强化学习 史忠植
动态规划 Dynamic Programming – Iterative Solution Finite Horizon Problem Infinite Horizon Problem Value Iteration 2018/9/20 强化学习 史忠植
动态规划中的策略迭代/值迭代 policy evaluation policy improvement “greedification” Policy Iteration Value Iteration 2018/9/20 强化学习 史忠植
动态规划方法 T 2018/9/20 强化学习 史忠植
自适应动态规划(ADP) Idea: use the constraints (state transition probabilities) between states to speed learning. Solve = value determination. No maximization over actions because agent is passive unlike in value iteration. using DP Large state space e.g. Backgammon: 1050 equations in 1050 variables 2018/9/20 强化学习 史忠植
Value Iteration Algorithm Stop Iteration when V(s) differs less than є. Policy difference ratio =< 2єγ / (1-γ ) ( Williams & Baird 1993b) AN ALTERNATIVE ITERATION: (Singh,1993) 2018/9/20 强化学习 史忠植 (Important for model free learning)
Policy Iteration Algorithm Policies converge faster than values. Why faster convergence? 2018/9/20 强化学习 史忠植
Reinforcement Learning … Deterministic transitions Stochastic transitions is the probability to reaching state j when taking action a in state i start 3 2 1 4 +1 -1 Move cost = 0.04 A simple environment that presents the agent with a sequential decision problem: (Temporal) credit assignment problem sparse reinforcement problem Offline alg: action sequences determined ex ante Online alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying) 2018/9/20 强化学习 史忠植
Reinforcement Learning … M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left 0.1 right Policy: mapping from states to actions utilities of states: 3 2 1 4 +1 -1 0.705 3 2 1 4 +1 -1 0.812 0.762 0.868 0.912 0.660 0.655 0.611 0.388 An optimal policy for the stochastic environment: Environment Observable (accessible): percept identifies the state Partially observable Markov property: Transition probabilities depend on state only, not on the path to the state. Markov decision problem (MDP). Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities. 2018/9/20 强化学习 史忠植
Model Free Methods Models of the environment: T: S x A ∏ ( S) and R : S x A R Do we know them? Do we have to know them? Monte Carlo Methods Adaptive Heuristic Critic Q Learning 2018/9/20 强化学习 史忠植
Monte Carlo策略评价 Goal: learn Vp(s) under P and R are unknown in advance Given: some number of episodes under p which contain s Idea: Average returns observed after visits to s 1 2 3 4 5 Every-Visit MC: average returns for every time s is visited in an episode First-visit MC: average returns only for first time s is visited in an episode Both converge asymptotically 2018/9/20 强化学习 史忠植
蒙特卡罗方法 Monte Carlo Methods Idea: Hold statistics about rewards for each state Take the average This is the V(s) Based only on experience Assumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by-step sense. 2018/9/20 强化学习 史忠植
蒙特卡罗方法 Problem: Unvisited <s, a> pairs (problem of maintaining exploration) For every <s, a> make sure that: P(<s, a> selected as a start state and action) >0 (Assumption of exploring starts ) 2018/9/20 强化学习 史忠植
Monte Carlo方法 T 2018/9/20 强化学习 史忠植
蒙特卡罗控制 How to select Policies: (Similar to policy evaluation) MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function 2018/9/20 强化学习 史忠植
时序差分学习 Temporal-Difference target: the actual return after time t target: an estimate of the return 2018/9/20 强化学习 史忠植
时序差分学习 (TD) Idea: Do ADP backups on a per move basis, not for the whole state space. Theorem: Average value of U(i) converges to the correct value. Theorem: If is appropriately decreased as a function of times a state is visited (=[N[i]]), then U(i) itself converges to the correct value 2018/9/20 强化学习 史忠植
时序差分学习 TD T 2018/9/20 强化学习 史忠植
TD(l) – A Forward View TD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation) l-return: Backup using l-return: 2018/9/20 强化学习 史忠植
时序差分学习算法 TD() Idea: update from the whole epoch, not just on state transition. Special cases: =1: Least-mean-square (LMS), Mont Carlo =0: TD Intermediate choice of (between 0 and 1) is best. Interplay with … 2018/9/20 强化学习 史忠植
时序差分学习算法 2018/9/20 强化学习 史忠植
时序差分学习算法收敛性TD() Theorem: Converges w.p. 1 under certain boundaries conditions. Decrease i(t) s.t. In practice, often a fixed is used for all i and t. 2018/9/20 强化学习 史忠植
时序差分学习 TD 2018/9/20 强化学习 史忠植
Q-Learning Watkins,1989 Estimate the Q-function using some approximator (for example, linear regression or neural networks or decision trees etc.). Derive the estimated policy as an argument of the maximum of the estimated Q-function. Allow different parameter vectors at different time points. Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function. 2018/9/20 强化学习 史忠植
Q-learning Q (a,i) Direct approach (ADP) would require learning a model . Q-learning does not: Do this update after each state transition: 2018/9/20 强化学习 史忠植
Exploration Tradeoff between exploitation (control) and exploration (identification) Extremes: greedy vs. random acting (n-armed bandit models) Q-learning converges to optimal Q-values if * Every state is visited infinitely often (due to exploration), * The action selection becomes greedy as time approaches infinity, and * The learning rate a is decreased fast enough but not too fast (as we discussed in TD learning) 2018/9/20 强化学习 史忠植
Common exploration methods In value iteration in an ADP agent: Optimistic estimate of utility U+(i) Є-greedy method Nongreedy actions Greedy action Boltzmann exploration Exploration func R+ if n<N u o.w. 2018/9/20 强化学习 史忠植
Q-Learning Algorithm Set For The estimated policy satisfies 2018/9/20 强化学习 史忠植
What is the intuition? Bellman equation gives If and the training set were infinite, then Q-learning minimizes which is equivalent to minimizing 2018/9/20 强化学习 史忠植
A-Learning Murphy, 2003 and Robins, 2004 Estimate the A-function (advantages) using some approximator, as in Q-learning. Derive the estimated policy as an argument of the maximum of the estimated A-function. Allow different parameter vectors at different time points. Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function. 2018/9/20 强化学习 史忠植
A-Learning Algorithm (Inefficient Version) For The estimated policy satisfies 2018/9/20 强化学习 史忠植
Differences between Q and A-learning Q-learning At time t we model the main effects of the history, (St,,At-1) and the action At and their interaction Our Yt-1 is affected by how we modeled the main effect of the history in time t, (St,,At-1) A-learning At time t we only model the effects of At and its interaction with (St,,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,,At-1) 2018/9/20 强化学习 史忠植
Q-Learning Vs. A-Learning Relative merits and demerits are not completely known till now. Q-learning has low variance but high bias. A-learning has high variance but low bias. Comparison of Q-learning with A-learning involves a bias-variance trade-off. 2018/9/20 强化学习 史忠植
The optimal strategy might need to consider the history. POMDP部分感知马氏决策过程 Rather than observing the state we observe some function of the state. Ob – Observable function a random variable for each states. Problem : different states may look similar The optimal strategy might need to consider the history. 2018/9/20 强化学习 史忠植
Framework of POMDP 2018/9/20 强化学习 史忠植 [1] Ω可以是S的子集,也可以与S无关 POMDP由六元组<S, A, R, P,Ω,О>定义。其中<S, A, P, R>定义了环境潜在的马尔可夫决策模型上,Ω是观察的集合,即系统可以感知的世界状态集合,观察函数О:S×A→PD(Ω)。系统在采取动作a转移到状态s′时,观察函数О确定其在可能观察上的概率分布。记为О(s′, a, o)。 2018/9/20 强化学习 史忠植
POMDPs MDP techniques are suboptimal! Two halls are not the same. What if state information (from sensors) is noisy? Mostly the case! MDP techniques are suboptimal! Two halls are not the same. 2018/9/20 强化学习 史忠植
POMDPs – A Solution Strategy SE: Belief State Estimator (Can be based on HMM) П: MDP Techniques 2018/9/20 强化学习 史忠植
POMDP_信度状态方法 Idea: Given a history of actions and observable value, we compute a posterior distribution for the state we are in (belief state) The belief-state MDP States: distribution over S (states of the POMDP) Actions: as in POMDP Transition: the posterior distribution (given the observation) Open Problem : How to deal with the continuous distribution? 2018/9/20 强化学习 史忠植
The Learning Process of Belief MDP 2018/9/20 强化学习 史忠植
Major Methods to Solve POMDP 算法名称 基本思想 学习值函数 Memoryless policies 直接采用标准的强化学习算法 Simple memory based approaches 使用k个历史观察表示当前状态 UDM(Utile Distinction Memory) 分解状态,构建有限状态机模型 NSM(Nearest Sequence Memory) 存储状态历史,进行距离度量 USM(Utile Suffix Memory) 综合UDM和NSM两种方法 Recurrent-Q 使用循环神经网络进行状态预测 策略搜索 Evolutionary algorithms 使用遗传算法直接进行策略搜索 Gradient ascent method 使用梯度下降(上升)法搜索 2018/9/20 强化学习 史忠植
强化学习中的函数估计 Generalization of the value function to the entire state space RL FA Subset of states Value estimate as targets V (s) is the TD operator. is the function approximation operator. 2018/9/20 强化学习 史忠植
并行两个迭代过程 值函数迭代过程 值函数逼近过程 How to construct the M function? Using state cluster, interpolation, decision tree or neural network? 2018/9/20 强化学习 史忠植
w w + a [rt+1 + g Q(st+1,at+1)- Q(st,at)] w f(st,at,w) Function Approximator: V( s) = f( s, w) Update: Gradient-descent Sarsa: w w + a [rt+1 + g Q(st+1,at+1)- Q(st,at)] w f(st,at,w) Standard gradient weight vector estimated value target value Open Problem : How to design the non-liner FA system which can converge with the incremental instances? 2018/9/20 强化学习 史忠植
Semi-MDP Discrete time Homogeneous discount Continuous time Discrete events Interval-dependent discount A discrete-time SMDP overlaid on an MDP Can be analyzed at either level. One approach to Temporal Hierarchical RL 2018/9/20 强化学习 史忠植
The equations 2018/9/20 强化学习 史忠植
Multi-agent MDP Distributed RL Markov Game Best Response RL Agent Environment state action reward RL Agent 2018/9/20 强化学习 史忠植
三种观点 问题空间 主要方法 算法准则 合作多agent 强化学习 基于平衡解 多agent强化学习 最佳响应 分布、同构、 合作环境 交换状态 提高学习收敛速度 交换经验 交换策略 交换建议 基于平衡解 多agent强化学习 同构或异构、 合作或竞争环境 极小极大-Q 理性和收敛性 NASH-Q CE-Q WoLF 最佳响应 异构、竞争环境 PHC 收敛性和不遗憾性 IGA GIGA GIGA-WoLF 2018/9/20 强化学习 史忠植
马尔可夫对策 在n个agent的系统中,定义离散的状态集S(即对策集合G),agent动作集Ai的集合A, 联合奖赏函数Ri:S×A1×…×An→ℛ和状态转移函数P:S×A1×…×An→PD(S)。 2018/9/20 强化学习 史忠植
基于平衡解方法的强化学习 The optimal policy in single game is Nash equilibrium. Open Problem : Nash equilibrium or other equilibrium is enough? 2018/9/20 强化学习 史忠植
Applications of RL Checker’s [Samuel 59] TD-Gammon [Tesauro 92] World’s best downpeak elevator dispatcher [Crites at al ~95] Inventory management [Bertsekas et al ~95] 10-15% better than industry standard Dynamic channel assignment [Singh & Bertsekas, Nie&Haykin ~95] Outperforms best heuristics in the literature Cart-pole [Michie&Chambers 68-] with bang-bang control Robotic manipulation [Grupen et al. 93-] Path planning Robot docking [Lin 93] Parking Football [Stone98] Tetris Multiagent RL [Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots of work since] Combinatorial optimization: maintenance & repair Control of reasoning [Zhang & Dietterich IJCAI-95] 2018/9/20 强化学习 史忠植
仿真机器人足球 应用Q学习算法进行仿真机器人足球2 对1 训练,训练的目的是试图使主体学习获得到一种战略上的意识,能够在进攻中进行配合[宋志伟, 2003] 2018/9/20 强化学习 史忠植
仿真机器人足球 前锋A控球,并且在可射门的区域内,但是A已经没有射门角度了;队友B也处于射门区域,并且B具有良好的射门角度。A传球给B,射门由B来完成,那么这次进攻配合就会很成功。通过Q学习的方法来进行2 对1的射门训练,让A掌握在这种状态情况下传球给B的动作是最优的策略;主体通过大量的学习训练(大数量级的状态量和重复相同状态)来获得策略,因此更具有适应性 。 2018/9/20 强化学习 史忠植
仿真机器人足球 状态描述,将进攻禁区划分为个小区域,每个小区域是边长为2m的正方形,一个二维数组()便可描述这个区域。使用三个Agent的位置来描述2 对 1进攻时的环境状态,利用图10.11所示的划分来泛化状态。可认为主体位于同一战略区域为相似状态,这样对状态的描述虽然不精确,但设计所需的是一种战略层次的描述,可认为Agent在战略区域内是积极跑动的,这种方法满足了需求。如此,便描述了一个特定的状态;其中,是进攻队员A的区域编号,是进攻队员B的区域编号,是守门员的区域编号。区域编号计算公式为:。相应的,所保存的状态值为三个区域编号组成的对。前锋A控球,并且在可射门的区域内,但是A已经没有射门角度了;队友B也处于射门区域,并且B具有良好的射门角度。A传球给B,射门由B来完成,那么这次进攻配合就会很成功。通过Q学习的方法来进行2 对1的射门训练,让A掌握在这种状态情况下传球给B的动作是最优的策略;主体通过大量的学习训练(大数量级的状态量和重复相同状态)来获得策略,因此更具有适应性 。 19 图10.11 进攻禁区内的位置划分 7 2018/9/20 强化学习 史忠植
仿真机器人足球 可选动作集确定为 的策略通过基于概率的射门训练的学习来得到。 的策略是,始终向受到威胁小,并且射门成功率高的区域带球。为了实现这一策略目标,可划分进攻区域为多个战略区,在每个战略区进行射门评价,记录每个区域的射门成功率。 Pass策略很简单,只需在两个Agent间进行传球,即不需要选择球传送的对象,也不需要判断传球路径。如果传球失败的话,则认为在这种状态下执行Pass策略是不成功的;经过此训练后,不可能的传球路径也不会被执行了。 2018/9/20 强化学习 史忠植
仿真机器人足球 训练中的所有状态包含了四个吸收状态。假设进攻方在左半场,按照标准的Soccer server规范,这四个状态的比赛模式为play_on、goal_left、goal_kick_right和free_kick_right。当达到吸收状态时,给与主体最终奖励r。促使到达吸收状态的上一步动作获得的立即回报值为最终奖励值r,其他未直接促使到达吸收状态的动作均获得过程奖励值作为立即奖励;其中goal_left的r最大为1表示进球,其他状态下r为不同大小的负数。 2018/9/20 强化学习 史忠植
仿真机器人足球 主体在经过一定的状态和执行多个动作后获得了终态奖励(到达了吸收状态),这时就会对这个状态-动作序列分配奖励。Q学习算法的核心就是每一个状态和动作的组合都拥有一个Q值,每次获得最终回报后通过更新等式更新这个Q值。由于Robocup仿真平台在设计的时候在状态转换的时候加入了一个较小的随机噪音,所以该模型为非确定MDP, 确定Q更新等式为: 规定=0.1,=0.95。 2018/9/20 强化学习 史忠植
仿真机器人足球 在实际的训练中,初始Q表各项的值为1。经过大约2万次的训练(到达吸收状态为一次),Agent的Q表中的绝大部分项发生了变化,并且已经区分开来。下表是某场景下的训练次数和动作选择的变化示意表。 2018/9/20 强化学习 史忠植
强化学习应用_调度 Job shop scheduling Group-control elevator scheduling Dynamical channel allocation 2018/9/20 强化学习 史忠植
应用 - 分布式电梯控制 2018/9/20 强化学习 史忠植
强化学习应用_信息检索 Web crawling Domain-specific Search Engineer Web-log mining Text mining Image mining Q(“Publications”, “My research”, “Will’s FrontDoor Page”) Q(“British Comedy”, “Fun Stuff”, 2018/9/20 强化学习 史忠植
强化学习应用_自适应控制&生物信息学 Robotics behavior control Face cognition Speech recognition 2018/9/20 强化学习 史忠植
References Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning- An Introduction. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning-Data Mining, Inference and Prediction. Murphy, S.A. (2003). Optimal Dynamic Treatment Regimes. JRSS-B. Blatt, D., Murphy, S.A. and Zhu, J. (2004). A-Learning for Approximate Planning. Murphy, S.A. (2004). A Generalization Error for Q-Learning. D. P. Bertsekas and J. N. Tsitsiklis (1996). Neuro-Dynamic Programming. 高 阳.强化学习研究进展.南京大学计算机科学与技术系 宋志伟, 陈小平, 2003. 仿真机器人足球中的强化学习. 《机器人》, 24(7S):761-766. 2018/9/20 强化学习 史忠植