Q-learning vs dynamic programming












5















Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?










share|improve this question





























    5















    Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?










    share|improve this question



























      5












      5








      5


      2






      Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?










      share|improve this question
















      Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?







      machine-learning dynamic-programming reinforcement-learning q-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 22 '18 at 16:01









      nbro

      5,76195198




      5,76195198










      asked Aug 17 '16 at 5:16









      D_WillsD_Wills

      8319




      8319
























          3 Answers
          3






          active

          oldest

          votes


















          5














          Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



          Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



          Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



          Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






          share|improve this answer

































            12














            From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




            The term dynamic programming (DP) refers to a collection of algorithms
            that can be used to compute optimal policies given a perfect model of
            the environment as a Markov decision process (MDP). Classical DP
            algorithms are of limited utility in reinforcement learning both
            because of their assumption of a perfect model and because of their
            great computational expense, but they are still important
            theoretically.




            So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



            On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



            Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



            NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






            share|improve this answer

































              -1














              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






              share|improve this answer



















              • 1





                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                – cel
                Aug 17 '16 at 6:16











              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                – Don Reba
                Aug 17 '16 at 11:22











              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38988582%2fq-learning-vs-dynamic-programming%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              5














              Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



              Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



              Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



              Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






              share|improve this answer






























                5














                Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



                Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



                Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



                Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






                share|improve this answer




























                  5












                  5








                  5







                  Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



                  Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



                  Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



                  Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).






                  share|improve this answer















                  Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.



                  Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.



                  Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.



                  Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 16 '17 at 15:54

























                  answered Nov 16 '17 at 15:48









                  mimoraleamimoralea

                  6,27344146




                  6,27344146

























                      12














                      From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                      The term dynamic programming (DP) refers to a collection of algorithms
                      that can be used to compute optimal policies given a perfect model of
                      the environment as a Markov decision process (MDP). Classical DP
                      algorithms are of limited utility in reinforcement learning both
                      because of their assumption of a perfect model and because of their
                      great computational expense, but they are still important
                      theoretically.




                      So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                      On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                      Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                      NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






                      share|improve this answer






























                        12














                        From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                        The term dynamic programming (DP) refers to a collection of algorithms
                        that can be used to compute optimal policies given a perfect model of
                        the environment as a Markov decision process (MDP). Classical DP
                        algorithms are of limited utility in reinforcement learning both
                        because of their assumption of a perfect model and because of their
                        great computational expense, but they are still important
                        theoretically.




                        So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                        On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                        Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                        NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






                        share|improve this answer




























                          12












                          12








                          12







                          From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                          The term dynamic programming (DP) refers to a collection of algorithms
                          that can be used to compute optimal policies given a perfect model of
                          the environment as a Markov decision process (MDP). Classical DP
                          algorithms are of limited utility in reinforcement learning both
                          because of their assumption of a perfect model and because of their
                          great computational expense, but they are still important
                          theoretically.




                          So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                          On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                          Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                          NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.






                          share|improve this answer















                          From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)




                          The term dynamic programming (DP) refers to a collection of algorithms
                          that can be used to compute optimal policies given a perfect model of
                          the environment as a Markov decision process (MDP). Classical DP
                          algorithms are of limited utility in reinforcement learning both
                          because of their assumption of a perfect model and because of their
                          great computational expense, but they are still important
                          theoretically.




                          So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.



                          On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).



                          Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.



                          NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 22 '18 at 15:57









                          nbro

                          5,76195198




                          5,76195198










                          answered Aug 17 '16 at 8:24









                          Pablo EMPablo EM

                          2,73021628




                          2,73021628























                              -1














                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






                              share|improve this answer



















                              • 1





                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                                – cel
                                Aug 17 '16 at 6:16











                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                                – Don Reba
                                Aug 17 '16 at 11:22
















                              -1














                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






                              share|improve this answer



















                              • 1





                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                                – cel
                                Aug 17 '16 at 6:16











                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                                – Don Reba
                                Aug 17 '16 at 11:22














                              -1












                              -1








                              -1







                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.






                              share|improve this answer













                              If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.







                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Aug 17 '16 at 5:49









                              Don RebaDon Reba

                              10.9k23452




                              10.9k23452








                              • 1





                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                                – cel
                                Aug 17 '16 at 6:16











                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                                – Don Reba
                                Aug 17 '16 at 11:22














                              • 1





                                no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                                – cel
                                Aug 17 '16 at 6:16











                              • @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                                – Don Reba
                                Aug 17 '16 at 11:22








                              1




                              1





                              no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                              – cel
                              Aug 17 '16 at 6:16





                              no it is not equivalent - you cannot compare these two things that way. Dynamic programming is a very general algorithmic pattern that is used in many algorithms. Your algorithm may make use of a dynamic programming strategy, but it certainly is not dynamic programming.

                              – cel
                              Aug 17 '16 at 6:16













                              @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                              – Don Reba
                              Aug 17 '16 at 11:22





                              @cel, you have to consider the context. With respect to RL, it is safe to assume that DP refers to value or policy iteration.

                              – Don Reba
                              Aug 17 '16 at 11:22


















                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f38988582%2fq-learning-vs-dynamic-programming%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              Xamarin.form Move up view when keyboard appear

                              Post-Redirect-Get with Spring WebFlux and Thymeleaf

                              Anylogic : not able to use stopDelay()