Your peers in the course ADP & RL at the TU München create and share summaries, flashcards, study plans and other learning materials with the intelligent StudySmarter learning app.

Get started now!

ADP & RL

What is MDP and how is it defined?

Markov decision process, a tuple {S, A, p, g, T}
states, actions, transition probabilities, reward function, finite horizon

ADP & RL

What is the principle of optimality for finite horizon problems?

A policy is optimal if and only if all future tail problems are optimal.

ADP & RL

How do we ensure the boundedness of the value function for infinite horizon problems?

add a discount factor \gamma (geometric series -> 1/(1-\gamma)) and make reward function bounded |g(..)| <=M

ADP & RL

What are the properties of the Bellman operator?

- Monotonicity
- Constant shift
- Contraction

ADP & RL

When do VI and PI terminate?

VI usually requires an infinite number of iterations
PI terminates after a finite number of steps (because there is a finite number of policies for a finite number of states)

ADP & RL

What is the optimality condition?

A stationary policy is optimal if and only if it attains the minimum of Bellman's equation

ADP & RL

What are characteristics of contraction mappings?

- They have a unique fixed point J* that satisfies: J*=TJ*
- T^k converges to J* for k->inf

ADP & RL

What are the characteristics of the monotonicity property?

It implies the optimality of J*
J* = min J_mu

ADP & RL

What is the constant shift property important?

Monotonicity and contraction only hold, if constant shift property holds
also relevant for error bounds

ADP & RL

How does optimistic PI differ from regular PI?

The policy evaluation step is different: The value function for the policy gets computed approximately (apply finite number of T^k).
Policy iteration stays the same.
It converges to the optimal policy much faster

ADP & RL

What is one issue of simulation-based PI? And how do you solve it?

inadequate exploration: generating cost samples using the policy might bias the simulations and underrepresent some states.
Two possibilities:
- Break down the simulation into multiple short trajectories to have different initial states
- artificially induce extra randomization

ADP & RL

What are the advantages of Dynamic Programming (as opposed to optimization algorithms)?

DP divides problems into sub problems and solves each one separately.

Check out courses similar to ADP & RL at other universities

Back to TU München overview pageStudySmarter is an intelligent learning tool for students. With StudySmarter you can easily and efficiently create flashcards, summaries, mind maps, study plans and more. Create your own flashcards e.g. for ADP & RL at the TU München or access thousands of learning materials created by your fellow students. Whether at your own university or at other universities. Hundreds of thousands of students use StudySmarter to efficiently prepare for their exams. Available on the Web, Android & iOS. It’s completely free.

Best EdTech Startup in Europe

X

X
## Good grades at university? No problem with StudySmarter!

### 89% of StudySmarter users achieve better grades at university.

50 Mio Flashcards & Summaries

Create your own content with Smart Tools

Individual Learning-Plan