So, no, it is not the same. In: Proceedings 15th European Conference on Machine Learning (ECML 2004), Pisa, Italy, pp. 12, pp. We will use primarily the most popular name: reinforcement learning. 4. IEEE Transactions on Automatic Control 34(6), 589–598 (1989), Bertsekas, D.P. 2533, pp. Advances in Neural Information Processing Systems, vol. : +49 (0)89 289 23601Fax: +49 (0)89 289 23600E-Mail: ldv@ei.tum.de, Approximate Dynamic Programming and Reinforcement Learning, Fakultät für Elektrotechnik und Informationstechnik, Clinical Applications of Computational Medicine, High Performance Computing für Maschinelle Intelligenz, Information Retrieval in High Dimensional Data, Maschinelle Intelligenz und Gesellschaft (in Python), von 07.10.2020 bis 29.10.2020 via TUMonline, (Partially observable Markov decision processes), describe classic scenarios in sequential decision making problems, derive ADP/RL algorithms that are covered in the course, characterize convergence properties of the ADP/RL algorithms covered in the course, compare performance of the ADP/RL algorithms that are covered in the course, both theoretically and practically, select proper ADP/RL algorithms in accordance with specific applications, construct and implement ADP/RL algorithms to solve simple decision making problems. So this is my updated estimate. Springer, Heidelberg (2004), Reynolds, S.I. : Convergence results for some temporal difference methods based on least-squares. SETN 2002. We review theoretical guarantees on the approximate solutions produced by these algorithms. 477–488. Due to its generality, reinforcement learning is studied in many disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, and statistics.In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. ALT 2002. Hi, I am doing a research project for my optimization class and since I enjoyed the dynamic programming section of class, my professor suggested researching "approximate dynamic programming". : Adaptive resolution model-free reinforcement learning: Decision boundary partitioning. essentially equivalent names: reinforcement learning, approximate dynamic programming, and neuro-dynamic programming. Now, this is classic approximate dynamic programming reinforcement learning. In: Proceedings 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), Nashville, US, pp. In: Solla, S.A., Leen, T.K., Müller, K.R. Our subject has benefited greatly from the interplay of ideas from optimal control and from artificial intelligence. Guidance in the use of adaptive critics for control. Technische Universität MünchenArcisstr. BRM, TD, LSTD/LSPI: BRM [Williams and Baird, 1993] TD learning [Tsitsiklis and Van Roy, 1996] Therefore, approximation is essential in practical DP and RL. 278–287 (1999), Ng, A.Y., Jordan, M.I. : Neuronlike adaptive elements than can solve difficult learning control problems. p. cm. Numerical examples illustrate the behavior of several representative algorithms in practice. Athena Scientific, Belmont (1996), Borkar, V.: An actor–critic algorithm for constrained Markov decision processes. The two required properties of dynamic programming are: 1. Overlapping sub-problems: sub-problems recur many times. LNCS (LNAI), vol. 5629–5634 (2008), Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Policy search with cross-entropy optimization of basis functions. Over 10 million scientific documents at your fingertips. (eds.) 783–790 (2000), Riedmiller, M.: Neural fitted Q-iteration – first experiences with a data efficient neural reinforcement learning method. Systems & Control Letters 54, 207–213 (2005), Buşoniu, L., Babuška, R., De Schutter, B.: A comprehensive survey of multi-agent reinforcement learning. 791–798 (2004), Torczon, V.: On the convergence of pattern search algorithms. In: Proceedings 15th National Conference on Artificial Intelligence and 10th Innovative Applications of Artificial Intelligence Conference (AAAI 1998/IAAI 1998), Madison, US, pp. Journal of Machine Learning Research 8, 2169–2231 (2007), Mannor, S., Rubinstein, R.Y., Gat, Y.: The cross-entropy method for fast policy search. In: Solla, S.A., Leen, T.K., Müller, K.R. Machine Learning 8, 279–292 (1992), Wiering, M.: Convergence and divergence in standard and averaging reinforcement learning. This article provides a brief review of approximate dynamic programming, without intending to be a … ECML 2006. 1224, pp. Applications to date have concentrated on optimal management of asset and portfolios [4], as well as derivative pricing and trading systems [5], given the fact that they can be Tech. Academic Press, London (1978), Bertsekas, D.P., Tsitsiklis, J.N. : Reinforcement learning: A survey. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single… In: Tesauro, G., Touretzky, D.S., Leen, T.K. LNCS (LNAI), vol. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. LNCS, vol. In: Proceedings 7th International Conference on Machine Learning (ICML 1990), Austin, US, pp. In: Proceedings 17th International Conference on Machine Learning (ICML 2000), Stanford University, US, pp. In: Proceedings 30th Southeastern Symposium on System Theory, Morgantown, US, pp. IEEE websites place cookies on your device to give you the best user experience. ECML 2004. tion to MDPs with countable state spaces. Robotics and Autonomous Systems 22(3-4), 251–281 (1997), Tsitsiklis, J.N., Van Roy, B.: Feature-based methods for large scale dynamic programming. In: Proceedings 2008 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2008), Hong Kong, pp. Neuro-Dynamic Programming is mainly a theoretical treatment of the field using the language of control theory. The linear programming approach to approximate dynamic programming. This service is more advanced with JavaScript available, Interactive Collaborative Information Systems Springer, Heidelberg (2005), Riedmiller, M., Peters, J., Schaal, S.: Evaluation of policy gradient methods and variants on the cart-pole benchmark. In: Proceedings European Symposium on Intelligent Techniques (ESIT 2000), Aachen, Germany, pp. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 39(2), 517–529 (2009), Glorennec, P.Y. Markov Decision Processes in Arti cial Intelligence, Sigaud and Bu et ed., 2008. Teaching and learning methods Springer, Heidelberg (2007), Chin, H.H., Jafari, A.A.: Genetic algorithm methods for solving the best stationary policy of finite Markov decision processes. Springer, Heidelberg (2006), Gonzalez, R.L., Rofman, E.: On deterministic control problems: An approximation procedure for the optimal cost I. 403–413. Part of Springer Nature. In: Proceedings 5th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 1996), New Orleans, US, pp. : Reinforcement learning: An overview. Feedback control systems. Model-based adaptive critic designs. Springer, Heidelberg (2002), Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. : Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Terminology in RL/AI and DP/Control RL uses Max/Value, DP uses Min/Cost Reward of a stage= (Opposite of) Cost of a stage. IEEE Transactions on Neural Networks 3(5), 724–740 (1992), Berenji, H.R., Vengerov, D.: A convergent actor-critic-based FRL algorithm with application to power management of wireless transmitters. 361–368 (1995), Sutton, R.S. The function Vn is an approximation of V, (eds.) LNCS (LNAI), vol. The stationary problem. Looking for abbreviations of ADPRL? ADPRL - Approximate Dynamic Programming and Reinforcement Learning. IEEE Control Systems Magazine 12(2), 19–22 (1992), Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. Approximate Dynamic Programming vs Reinforcement Learning? ADP methods tackle the problems by developing optimal control methods that adapt to uncertain systems over time, while RL algorithms take the perspective of an agent that optimizes its behavior by interacting with its environment and learning from the feedback received. 6. Not affiliated (eds.) This is a preview of subscription content, Baddeley, B.: Reinforcement learning in continuous time and space: Interference and not ill conditioning is the main problem when using distributed function approximators. Journal of Machine Learning Research 7, 771–791 (2006), Munos, R., Moore, A.: Variable-resolution discretization in optimal control. LNCS (LNAI), vol. : Adaptive aggregation methods for infinite horizon dynamic programming. 499–503 (2006), Jung, T., Uthmann, T.: Experiments in value function approximation with sparse support vector regression. By using our websites, you agree to the placement of these cookies. Register for the lecture and excercise. Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. In: Vlahavas, I.P., Spyropoulos, C.D. We need a different set of tools to handle this. (eds.) : An optimal one-way multigrid algorithm for discrete-time stochastic control. 1000–1005 (2005), Mahadevan, S., Maggioni, M.: Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. : Infinite-horizon policy-gradient estimation. 720–725 (2008), Wang, X., Tian, X., Cheng, Y.: Value approximation with least squares support vector machine in reinforcement learning system. IEEE Transactions on Systems, Man, and Cybernetics 38(2), 156–172 (2008), Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Consistency of fuzzy model-based reinforcement learning. Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. 2Consider the very rich field known as approximate dynamic programming. In: Proceedings 20th International Conference on Machine Learning (ICML 2003), Washington, US, pp. LNCS (LNAI), vol. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 38(4), 950–956 (2008), Barash, D.: A genetic search in policy space for solving Markov decision processes. 2. Ph.D. thesis, King’s College, Oxford (1989), Watkins, C.J.C.H., Dayan, P.: Q-learning. 406–415 (2000), Ormoneit, D., Sen, S.: Kernel-based reinforcement learning. 170–182. Reinforcement learning in large, high-dimensional state spaces. : Tight performance bounds on greedy policies based on imperfect value functions. : Neuro-Dynamic Programming. This book describes the latest RL and ADP techniques for decision and control in human engineered systems, covering both single player decision and control and multi-player games. This process is experimental and the keywords may be updated as the learning algorithm improves. The question session is a placeholder in Tumonline and will take place whenever needed. : Dynamic programming and suboptimal control: A survey from ADP to MPC. Approximate dynamic programming (ADP) and reinforcement learning (RL) are two closely related paradigms for solving sequential decision making problems. Unable to display preview. These keywords were added by machine and not by the authors. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. Sample chapter: Ch. The chapter closes with a discussion of open issues and promising research directions in approximate DP and RL. ISBN 978-1-118-10420-0 (hardback) 1. : On actor–critic algorithms. SIAM Journal on Optimization 7(1), 1–25 (1997), Touzet, C.F. Robert Babuˇska is a full professor at the Delft Center for Systems and Control of Delft University of Technology in the Netherlands. : Neural reinforcement learning for behaviour synthesis. (eds.) In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. Reinforcement learning and approximate dynamic programming for feedback control / edited by Frank L. Lewis, Derong Liu. In: Proceedings 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Honolulu, US, pp. Direct neural dynamic programming. Springer, Heidelberg (2001), Peters, J., Schaal, S.: Natural actor–critic. ECML 2004. The solutions to the sub-problems are combined to solve overall problem. 216–224 (1990), Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Journal of Computational and Theoretical Nanoscience 4(7-8), 1290–1294 (2007), Watkins, C.J.C.H. Journal of Artificial Intelligence Research 4, 237–285 (1996), Konda, V.: Actor–critic algorithms. 3 - Dynamic programming and reinforcement learning in large and continuous spaces. Discrete Event Dynamic Systems 13, 79–110 (2003), Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations: Theory and application to reward shaping. : Reinforcement learning with soft state aggregation. : Actor–critic algorithms. MIT Press, Cambridge (1998), Sutton, R.S., Barto, A.G., Williams, R.J.: Reinforcement learning is adaptive optimal control. 2. Dynamic programming (DP) and reinforcement learning (RL) can be used to address problems from a variety of fields, including automatic control, artificial intelligence, operations research, and economy. 347–358. Solutions of sub-problems can be cached and reused Markov Decision Processes satisfy both of these … In this paper, we show how to implement ADP methods … (eds.) # $ % & ' (Dynamic Programming Figure 2.1: The roadmap we use to introduce various DP and RL techniques in a unified framework. IEEE Transactions on Automatic Control 36(8), 898–914 (1991), Coulom, R.: Feedforward neural networks in reinforcement learning applied to high-dimensional motor control. Download preview PDF. Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same.These algorithms are "planning" methods.You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy. Deep Reinforcement learning is responsible for the two biggest AI wins over human professionals – Alpha Go and OpenAI Five. However, when combined with function approximation, these methods are notoriously brittle, and often face instability during training. (eds.) The foundations of learning and approximate dynamic programming have evolved from several fields–optimal control, artificial intelligence (reinforcement learning), operations research (dynamic programming), and stochastic approximation methods (neural networks). Palo Alto, US (1999), Barto, A.G., Sutton, R.S., Anderson, C.W. I Sutton and Barto, 1998, Reinforcement Learning (new edition 2018, on-line) I Powell, Approximate Dynamic Programming, 2011 Bertsekas Reinforcement Learning 10 / 21. (eds.) Annals of Operations Research 134, 215–238 (2005), Millán, J.d.R., Posenato, D., Dedieu, E.: Continuous-action Q-learning. 190–196 (1993), Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. One of the aims of the Based on the book Dynamic Programming and Optimal Control, Vol. interests include reinforcement learning and dynamic programming with function approximation, intelligent and learning techniques for control problems, and multi-agent learning. 12, pp. 1057–1063. It is Approximate Dynamic Programming and Reinforcement Learning. Model-based (DP) as well as online and batch model-free (RL) algorithms are discussed. Dynamic Programming is an umbrella encompassing many algorithms. 261–268 (1995), Grüne, L.: Error estimation and adaptive discretization for the discrete stochastic Hamilton-Jacobi-Bellman equation. SIAM Journal on Control and Optimization 42(4), 1143–1166 (2003), Lagoudakis, M., Parr, R., Littman, M.: Least-squares methods in reinforcement learning for control. Many problems in these fields are described by continuous variables, whereas DP and RL can find exact solutions only in the discrete case. 273–278 (2002), Mahadevan, S.: Samuel meets Amarel: Automating value function approximation using global state space analysis. : PEGASUS: A policy search method for large MDPs and POMDPs. 594–600 (1996), Jaakkola, T., Jordan, M.I., Singh, S.P. 7. In: Proceedings 18th National Conference on Artificial Intelligence and 14th Conference on Innovative Applications of Artificial Intelligence AAAI/IAAI 2002, Edmonton, Canada, pp. Algorithms for Reinforcement Learning, Szepesv ari, 2009. Rep. LIDS 2697, Massachusetts Institute of Technology, Cambridge, US (2006), Interactive Collaborative Information Systems, Delft Center for Systems and Control & Marine and Transport Technology Department, https://doi.org/10.1007/978-3-642-11688-9_1. Model-free reinforcement learning methods such as Q-learning and actor-critic methods have shown considerable success on a variety of problems. 2. 3201, pp. Machine Learning 3, 9–44 (1988), Sutton, R.S. In: Proceedings 2008 IEEE World Congress on Computational Intelligence (WCCI 2008), Hong Kong, pp. Journal of Artificial Intelligence Research 15, 319–350 (2001), Berenji, H.R., Khedkar, P.: Learning and tuning fuzzy logic controllers through reinforcements. In: Proceedings of 17th European Conference on Artificial Intelligence (ECAI 2006), Riva del Garda, Italy, pp. : Stochastic Optimal Control: The Discrete Time Case. referred to under the names of reinforcement learning [4], neuro-dynamic programming [5], or approximate dynamic programming [6]. So now I'm going to illustrate fundamental methods for approximate dynamic programming reinforcement learning, but for the setting of having large fleets, large numbers of resources, not just the one truck problem. Both technologies have succeeded in applications of operation research, robotics, game playing, network management, and computational intelligence. In: AAAI Spring Symposium on Search Techniques for Problem Solving under Uncertainty and Incomplete Information. We will cover the following topics (not exclusively): On completion of this course, students are able to: The course communication will be handled through the moodle page (link is coming soon). : Tree based discretization for continuous state space reinforcement learning. In: van Someren, M., Widmer, G. Advances in Neural Information Processing Systems, vol. Reinforcement learning (RL) and adaptive dynamic programming (ADP) has been one of the most critical research fields in science and engineering for modern complex systems. In: Proceedings 20th International Conference on Machine Learning (ICML 2003), Washington, US, pp. In: Proceedings 16th International Conference on Machine Learning (ICML 1999), Bled, Slovenia, pp. MIT Press, Cambridge (2000), Szepesvári, C., Smart, W.D. : Least-squares policy evaluation algorithms with linear function approximation. I. Lewis, Frank L. II. Q-Learning is a specific algorithm. For such MDPs, we denote the probability of getting to state s0by taking action ain state sas Pa ss0. pp 3-44 | Reinforcement Learning describes the field from the perspective of … SIAM Journal on Optimization 9(4), 1082–1099 (1999), Lin, L.J. 4212, pp. : On the convergence of stochastic iterative dynamic programming algorithms. Discrete Event Dynamic Systems: Theory and Applications 13, 111–148 (2003), McCallum, A.: Overcoming incomplete perception with utile distinction memory. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews 28(3), 338–355 (1998), Jung, T., Polani, D.: Least squares SVM for least squares TD learning. Content Approximate Dynamic Programming (ADP) and Reinforcement Learning (RL) are two closely related paradigms for solving sequential decision making problems. But the richer message of approximate dynamic programming is learning what to learn, and how to learn it, to make better decisions over time. Many problems in these fields are described by continuous variables, whereas DP and RL can find exact solutions only in the discrete case. 3201, pp. (eds.) Fourth, we use a combination of supervised regression and … IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics 38(4), 988–993 (2008), Madani, O.: On policy iteration as a newton s method and polynomial policy iteration algorithms. Springer, Heidelberg (2004), Williams, R.J., Baird, L.C. Dynamic programming (DP) and reinforcement learning (RL) can be used to address problems from a variety of fields, including automatic control, artificial intelligence, operations research, and economy. 178.77.98.17. So let's assume that I have a set of drivers. Machine Learning 49(2-3), 247–265 (2002), Munos, R.: Finite-element methods with local triangulation refinement for continuous reinforcement learning problems. 153–160 (2009), Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I. In further work of Bertsekas (2006), neuro-dynamic programming (NDP), another term used for reinforcement learning/ADP was discussed (see also book by Bertsekas & Tsitsiklis (1996)). LNCS (LNAI), vol. Machine Learning 8(3/4), 293–321 (1992); Special Issue on Reinforcement Learning, Liu, D., Javaherian, H., Kovalenko, O., Huang, T.: Adaptive critic learning techniques for engine torque and air-fuel ratio control. : Learning to predict by the method of temporal differences. 7, pp. IEEE Transactions on Automatic Control 42(5), 674–690 (1997), Uther, W.T.B., Veloso, M.M. 3720, pp. In: Proceedings 16th Conference in Uncertainty in Artificial Intelligence (UAI 2000), Palo Alto, US, pp. Approximate dynamic programming - Reinforcement learning - Policy gradient algorithms - Partially observable Markov decision processes. 2. Tech. Emergent Neural Computational Architectures Based on Neuroscience. 249–260. In: Proceedings 17th IFAC World Congress (IFAC 2008), Seoul, Korea, pp. 108–113 (1994), Xu, X., Hu, D., Lu, X.: Kernel-based least-squares policy iteration for reinforcement learning. : Planning and acting in partially observable stochastic domains. Neural Computation 6(6), 1185–1201 (1994), Jouffe, L.: Fuzzy inference system learning by reinforcement methods. Abstract. Not logged in In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. 317–328. Neural Networks 20, 723–735 (2007), Nedić, A., Bertsekas, D.P. IEEE Transactions on Neural Networks 8(5), 997–1007 (1997), Ratitch, B., Precup, D.: Sparse distributed memories for on-line value-based reinforcement learning. In: Proceedings 21st International Conference on Machine Learning (ICML 2004), Bannf, Canada, pp. Video from a January 2017 slide presentation on the relation of Proximal Algorithms and Temporal Difference Methods, for solving large linear systems of equations. Journal of Machine Learning Research 4, 1107–1149 (2003), Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging modern classifiers. So I get a number of 0.9 times the old estimate plus 0.1 times the new estimate gives me an updated estimate of the value being in Texas of 485. Springer, Heidelberg (1997), Munos, R.: Policy gradient in continuous time. Machine Learning 49(2-3), 161–178 (2002), Pérez-Uribe, A.: Using a time-delay actor–critic neural architecture with dopamine-like reinforcement signal for learning in autonomous robots. ECML 2005. 2180333 München, Tel. 8. Cite as. II: Approximate Dynamic Programming, ISBN-13: 978-1-886529-44-1, 712 pp., hardcover, 2012 (eds.) In: Proceedings 12th International Conference on Machine Learning (ICML 1995), Tahoe City, US, pp. 254–261 (2007), Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. Dynamic Programming and Optimal Control, Vol. Automatica 45(2), 477–484 (2009), Waldock, A., Carse, B.: Fuzzy Q-learning with an adaptive representation. 1008–1014. Dynamic programmingis a method for solving complex problems by breaking them down into sub-problems. Reinforcement learning. I, and to high profile developments in deep reinforcement learning, which have brought approximate DP to the forefront of attention. there are actually up to three curses of dimensionality. : Dynamic Programming and Optimal Control, 3rd edn., vol. Numerical Mathematics 99, 85–112 (2004), Horiuchi, T., Fujino, A., Katai, O., Sawaragi, T.: Fuzzy interpolation-based Q-learning with continuous states and actions. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, US (2002), Konda, V.R., Tsitsiklis, J.N. SIAM Journal on Control and Optimization 23(2), 242–266 (1985), Gordon, G.: Stable function approximation in dynamic programming. References were also made to the contents of the 2017 edition of Vol. 424–431 (2003), Lewis, R.M., Torczon, V.: Pattern search algorithms for bound constrained minimization. Reinforcement learning and its relationship to supervised learning. IEEE Transactions on Systems, Man, and Cybernetics 13(5), 833–846 (1983), Baxter, J., Bartlett, P.L. Journal of Machine Learning Research 6, 503–556 (2005), Ernst, D., Glavic, M., Capitanescu, F., Wehenkel, L.: Reinforcement learning versus model predictive control: a comparison on a power system problem. Value iteration, policy iteration, and policy search approaches are presented in turn. Rep. CUED/F-INFENG/TR166, Engineering Department, Cambridge University, UK (1994), Santos, M.S., Vigo-Aguiar, J.: Analysis of a numerical dynamic programming algorithm applied to economic models. : Interpolation-based Q-learning. 2036, pp. Athena Scientific, Belmont (2007), Bertsekas, D.P., Shreve, S.E. Such techniques typically compute an approximate observation ^vn= max x C(Sn;x) + Vn 1 SM;x(Sn;x), (2) for the particular state Sn of the dynamic program in the nth time step. The most extensive chapter in the book, it reviews methods and algorithms for approximate dynamic programming and reinforcement learning, with theoretical results, discussion, and illustrative numerical examples. In: Proceedings 8th Yale Workshop on Adaptive and Learning Systems, New Haven, US, pp. Machine Learning 49(2-3), 291–323 (2002), Nakamura, Y., Moria, T., Satoc, M., Ishiia, S.: Reinforcement learning for a biped robot based on a CPG-actor-critic method.
approximate dynamic programming vs reinforcement learning