Epsilon and learning rate decay in epsilon greedy q learning











up vote
0
down vote

favorite












I understand that epsilon marks the trade off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher qvalues youve found.



However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay.



How do we set our epsilon and alpha such that values converge?










share|improve this question




























    up vote
    0
    down vote

    favorite












    I understand that epsilon marks the trade off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher qvalues youve found.



    However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay.



    How do we set our epsilon and alpha such that values converge?










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I understand that epsilon marks the trade off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher qvalues youve found.



      However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay.



      How do we set our epsilon and alpha such that values converge?










      share|improve this question















      I understand that epsilon marks the trade off between exploration and exploitation. At the beginning, you want epsilon to be high so that you take big leaps and learn things. As you learn about future rewards, epsilon should decay so that you can exploit the higher qvalues youve found.



      However, does our learning rate also decay with time in a stochastic environment? The posts on SO that I've seen only discuss epsilon decay.



      How do we set our epsilon and alpha such that values converge?







      machine-learning reinforcement-learning q-learning decay






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 7 at 22:35

























      asked Nov 7 at 22:00









      Matt

      4751624




      4751624
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted











          At the beginning, you want epsilon to be high so that you take big leaps and learn things




          I think you have have mistaken epsilon and learning rate. This definition is actually related to the learning rate.



          Learning rate decay



          Learning rate is how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.



          enter image description here



          Higher alpha means you are updating your Q values in big steps. When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.



          Epsilon Decay



          Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.



          Therefore we introduce a randomness using epsilon. As an example if epsilon = 0.3 then we are selecting random actions with 0.3 probability regardless of the actual q value.



          Find more details on epsilon-greedy policy here.



          In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198503%2fepsilon-and-learning-rate-decay-in-epsilon-greedy-q-learning%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted











            At the beginning, you want epsilon to be high so that you take big leaps and learn things




            I think you have have mistaken epsilon and learning rate. This definition is actually related to the learning rate.



            Learning rate decay



            Learning rate is how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.



            enter image description here



            Higher alpha means you are updating your Q values in big steps. When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.



            Epsilon Decay



            Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.



            Therefore we introduce a randomness using epsilon. As an example if epsilon = 0.3 then we are selecting random actions with 0.3 probability regardless of the actual q value.



            Find more details on epsilon-greedy policy here.



            In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.






            share|improve this answer

























              up vote
              1
              down vote



              accepted











              At the beginning, you want epsilon to be high so that you take big leaps and learn things




              I think you have have mistaken epsilon and learning rate. This definition is actually related to the learning rate.



              Learning rate decay



              Learning rate is how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.



              enter image description here



              Higher alpha means you are updating your Q values in big steps. When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.



              Epsilon Decay



              Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.



              Therefore we introduce a randomness using epsilon. As an example if epsilon = 0.3 then we are selecting random actions with 0.3 probability regardless of the actual q value.



              Find more details on epsilon-greedy policy here.



              In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.






              share|improve this answer























                up vote
                1
                down vote



                accepted







                up vote
                1
                down vote



                accepted







                At the beginning, you want epsilon to be high so that you take big leaps and learn things




                I think you have have mistaken epsilon and learning rate. This definition is actually related to the learning rate.



                Learning rate decay



                Learning rate is how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.



                enter image description here



                Higher alpha means you are updating your Q values in big steps. When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.



                Epsilon Decay



                Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.



                Therefore we introduce a randomness using epsilon. As an example if epsilon = 0.3 then we are selecting random actions with 0.3 probability regardless of the actual q value.



                Find more details on epsilon-greedy policy here.



                In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.






                share|improve this answer













                At the beginning, you want epsilon to be high so that you take big leaps and learn things




                I think you have have mistaken epsilon and learning rate. This definition is actually related to the learning rate.



                Learning rate decay



                Learning rate is how big you take a leap in finding optimal policy. In the terms of simple QLearning it's how much you are updating the Q value with each step.



                enter image description here



                Higher alpha means you are updating your Q values in big steps. When the agent is learning you should decay this to stabilize your model output which eventually converges to an optimal policy.



                Epsilon Decay



                Epsilon is used when we are selecting specific actions base on the Q values we already have. As an example if we select pure greedy method ( epsilon = 0 ) then we are always selecting the highest q value among the all the q values for a specific state. This causes issue in exploration as we can get stuck easily at a local optima.



                Therefore we introduce a randomness using epsilon. As an example if epsilon = 0.3 then we are selecting random actions with 0.3 probability regardless of the actual q value.



                Find more details on epsilon-greedy policy here.



                In conclusion learning rate is associated with how big you take a leap and epsilon is associated with how random you take an action. As the learning goes on both should decayed to stabilize and exploit the learned policy which converges to an optimal one.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 8 at 7:03









                Vishma Dias

                863




                863






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53198503%2fepsilon-and-learning-rate-decay-in-epsilon-greedy-q-learning%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    Post-Redirect-Get with Spring WebFlux and Thymeleaf

                    Xamarin.form Move up view when keyboard appear

                    JBPM : POST request for execute process go wrong