Notice
Recent Posts
Recent Comments
Link
«   2025/06   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30
Tags
more
Archives
Today
Total
관리 메뉴

제목 없음

What is the difference between model-based and model-free reinforcement learning? 본문

인터넷글(ReinforcementLearning)

What is the difference between model-based and model-free reinforcement learning?

아니짜 2018. 5. 27. 19:57

https://ai.stackexchange.com/questions/4456/whats-the-difference-between-model-free-and-model-based-reinforcement-learning


Model based reinforcement learning has an agent try to understand the world and create a model to represent it. Here the model is trying to capture 2 functions, the transition function from states T and the reward function R. From this model the agent has a reference and can plan accordingly. 

However, it is not necessary to learn a model and the agent can instead learn a policy directly using algorithms like Q-learning or policy gradient. 

A simple check to see if an RL algorithm is model-based or model-free is: 

If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it’s a model-based RL algorithm.


`정리


model based RL은 transition function T와 reward function R을 추정하여, 이 두 함수로 부터 계획을 짜는 모델.


model free RL은 중간단계의 모델(transition function T와 reward function R)을 학습하지 않고, 바로 policy(정책)를 배움으로서 행동을 결정하는 모델.

ex) Q-learning, policy gradient.


쉽게 확인하는 방법은,

학습 이후, agent가 액션을 취하지 않고도, 다음 state와 reward를 예측할 수 있다면, (즉, 일종의 수렴단계가 되어 정형화가 되어있다.)  이는 model-based RL.

그렇지 않으면, (예측할 수 없고 여전히 stochastic? 하면?) model free RL.



------------------------------------------아래 볼필요 없음-----------------------------------------


원글(https://www.quora.com/What-is-the-difference-between-model-based-and-model-free-reinforcement-learning)


What is the difference between model-based and model-free reinforcement learning?


To answer this question, Lets revisit the components of an MDP, the most typical decision making framework for RL.

#이 질문에 답하기 위해, RL을 위한 대표적인 의사 결정 프레임워크인 MDP의 요소를 살펴봅시다.


An MDP is typically defined by a 4-tuple (S,A,R,T) where


S is the state/observation space of an environment

A is the set of actions the agent can choose between

R(s, a) is a function that returns the reward received for taking action a in state s

T(s' | s, a) is a transition probability function, specifying the probability that the environment will transition to state s' if the agent takes action a in state s.


Our goal is to find a policy   that maximizes the expected future (discounted) reward.


Now if we know what all those elements of an MDP are, we can just compute the solution before ever actually executing an action in the environment.

In AI, we typically call computing the solution to a decision-making problem before executing an actual decision planning.

#AI에서 실제 'decision planning'을 실행하기 전에 decision-making problem에 대한 해결책을 'computing'이라고 부릅니다.

Some classic planning algorithms for MDPs include Value Iteration, Policy Iteration, and whole lot more.


But the RL problem isn't so kind to us.

What makes a problem an RL problem, rather than a planning problem, is the agent does *not* know all the elements of the MDP, precluding(못하게 하다, 불가능하게 하다) it from being able to plan a solution.

Specifically, the agent does not know how the world will change in response to its actions (the transition function T),

nor what immediate reward it will receive for doing so(the reward function R).

The agent will simply have to try taking actions in the environment, observe what happens, and somehow, find a good policy from doing so.


So, if the agent does not know the transition function T nor the reward function R,

preventing it from planning a solution out, how can it find a good policy?

Well, it turns out there are lots of ways!


One approach that might immediately strike you, after framing the problem like this, is for the agent to learn a model of how the environment works from its observations and then plan a solution using that model.

# 이렇게 문제를 구성한 후 너에게 즉시 말할 수 있는 한 방법은, 환경이 관찰로 부터 어떻게 작동하는지에 대한 모델을 학습한 다음, 그 모델을 이용해 솔루션을 계획하는 것이다.


That is, if the agent is currently in state s1, takes action a1,

and then observes the environment transition to state s2 with reward r2,

that information can be used to improve its estimate of T(s2|s1, a1) and R(s1, a1), 

which can be performed using supervised learning approaches.

Once the agent has adequately modeled the environment, 



Actor-critic and policy search methods directly search over policy space to find policies that result in better reward from the environment.

'인터넷글(ReinforcementLearning)' 카테고리의 다른 글

Reinforcement Learning Concepts  (0) 2018.05.29
수식 작성 테스트  (0) 2018.05.29
actor-critic methods  (0) 2018.05.27
on policy vs. off policy  (0) 2018.05.27
Comments