Quantcast
Channel: Are the state-action values and the state value function equivalent for a given policy? - Artificial Intelligence Stack Exchange
Viewing all articles
Browse latest Browse all 2

Are the state-action values and the state value function equivalent for a given policy?

$
0
0

Are the state-action values and the state value function equivalent for a given policy? I would assume so as the value function is defined as $V(s)=\sum_a \pi(a|s)Q_{\pi}(s,a)$. If we are operating a greedy policy and hence acting optimally, doesn't this mean that in fact the policy is deterministic and then $\pi(a|s)$ is $1$ for the optimal action and $0$ for all others? Would this then lead to an equivalence between the two?

Here is my work to formulate some form of proof where I start with the idea that a policy is defined to be better than a current policy if for all states then $Q_{\pi}(S,\pi^∗(s))\geq Vπ_{\pi}(s)$ :

I iteratively apply the optimal policy to each time step until I eventually get to a fully optimal time step of rewards

$$Vπ_{\pi}(s)≤Q_{\pi}(S,\pi^∗(s))$$$$=Eπ[R_{t+1}+\gamma V_{\pi}(St+1)|St=s]$$$$\leq E[Rt+1+\gamma Q_{\pi}(S_{t+1},\pi^∗(S_{t+1})|S_t=s]$$$$\leq E[Rt+1+\gamma Rt+2+\gamma 2Q \pi^*(S_{t+2},\pi^∗(S_{t+2})|S_t=s]$$$$\leq E[R_{t+1}+\gamma R_{t+2}+....|S_t=s]$$$$=V\pi^∗(s)$$

I would say that our final two lines are in fact inequalities, and for me this makes intuitive sense in that if we are always taking a deterministic greedy action our value function and Q function are the same. As detailed here, for a given policy and state we have that $V(s)=\sum_a \pi(a|s)Q_{\pi}(s,a)$ and if the policy is optimal and hence greedy then $\pi(a|s)$ is deterministic.


Viewing all articles
Browse latest Browse all 2

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>