{"id":169,"date":"2025-01-30T12:00:00","date_gmt":"2025-01-30T12:00:00","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/?p=169"},"modified":"2025-01-27T18:26:54","modified_gmt":"2025-01-27T18:26:54","slug":"learning-about-q-learning-part-2-double-q-learning","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/2025\/01\/30\/learning-about-q-learning-part-2-double-q-learning\/","title":{"rendered":"Learning about Q-Learning (Part 2): Double Q-Learning"},"content":{"rendered":"\n
In the previous blog, we talked briefly about tabular Q-learning, however this method can be prone to noises within reward realisations. In this blog, we briefly cover two extensions to Q-learning, and how these ideas can be further extended in a more complex setting.<\/p>\n\n\n\n
One way we can hedge against overestimation bias caused by noise is to use a method known as double Q-learning<\/strong>. This method is analogous to tabular Q-learning, with the extension being having two tables to store Q<\/span> values instead of one. <\/p>\n\n\n The difference between this and tabular Q-learning other than having two tables to store Q<\/span> values is within the update function, where we randomly update one Q<\/span> function using the “optimal” future action which is derived from the other Q<\/span> table. This can be defined by,<\/p>\n\n\nQ_1(s,a) \\leftarrow Q_1(s,a) + \\alpha \\left[R(s,a) + \\gamma Q_1\\left(s', \\arg\\max_{a'} Q_2(s',a')\\right) - Q_1(s,a) \\right], <\/span>\n\n\nQ_2(s,a) \\leftarrow Q_2(s,a) + \\alpha \\left[R(s,a) + \\gamma Q_2\\left(s', \\arg\\max_{a'} Q_1(s',a')\\right) - Q_2(s,a) \\right]. <\/span>\n\n\n\n This method has been shown to have the same computational time as tabular Q-learning, but requires double the memory. Hence, this method struggles even further in scalability. However, the idea has been fundamental when applied to further extensions of Q-learning which aimed to deal with large state and\/or action space.<\/p>\n\n\n\n References:<\/strong> In the previous blog, we talked briefly about tabular Q-learning, however this method can be prone to noises within reward…<\/p>\n","protected":false},"author":85,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[4,5,3],"class_list":["post-169","post","type-post","status-publish","format-standard","hentry","category-uncategorised","tag-dynamic-programming","tag-q-learning","tag-reinforcement-learning"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/users\/85"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/comments?post=169"}],"version-history":[{"count":6,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/169\/revisions"}],"predecessor-version":[{"id":178,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/posts\/169\/revisions\/178"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/media?parent=169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/categories?post=169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/jimmy-lin\/wp-json\/wp\/v2\/tags?post=169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press, 2nd edition.
Van Hasselt, H. (2010). Double Q-learning. Advances in neural information processing systems, 23.<\/p>\n","protected":false},"excerpt":{"rendered":"