31 Aug 2025 4 min read Reinforcement Learning

Vanilla Actor-Critic (VAC)

We have previously discussed the Actor Critic and Soft Actor Critic (SAC) frameworks. Today, I would like to revisit its most fundamental form in order to build a deeper understanding and connect it to real-world applications. Specifically, I will focus on the Vanilla Actor–Critic (VAC) algorithm, explaining its details step by step, and then demonstrate how it can be applied to a simple stock prediction task.

In brief, actor is the one who takes actions, and critic is the one who evaluates each action.

Formal Definition of Vanilla Actor-Critic

In a Markov Decision Process (MDP), elements are defined by the tuple $(S,A,P,R,\gamma)$, where:

$S$: Set of states
$A$: Set of actions
$P(s'|s,a)$: Transition probability to next state given state and action
$R$: Reward function
$\gamma \in [0, 1)$: Discount factor

same as before, the objective function which maximize the expected cumulative discounted reward is defined as:

$$J(\theta) = 𝔼_{π_\theta}[\sum_{t=0}^∞ \gamma^t r_t]$$

$π_{\theta}$: parameterized policy (controlled by actor network) with θ
$r_t = R(s_t,a_t,s_{t+1})$

Components

Actor: represents the policy $π_{\theta}(a|s)$, which maps state $s \in S$ to a probability distribution over actions $a \in A$, usually updated with Policy Gradient Method to increase the likelihood of actions.
Critic: estimates the state-value function $V_{\phi}(s)$, parameterized by $\phi$, which represetns the expected cumulative reward starting from state $s$ under policy $π_{\theta}$:

$$V_{\phi}(s) \approx 𝔼_{π_{\theta}}[\sum_{t=0}^∞ \gamma^t r_t | s_0 = s]$$

Updates Rules

Here shows the rules simply.

Policy Gradient (Actor Network)

$$\nabla_{\theta}J(\theta) = 𝔼_{π_{\theta}}[\nabla_{\theta} \log_{π_{\theta}}(a_t|s_t) \cdot A(s_t, a_t)]$$

$$A(s_t,a_t) \approx \delta_t = r_t + \gamma V_{\phi}(s_{t+1} - V_{\phi}(s_t)$$

Notice the advantage function is the basic temporal difference zero (TD-0) form. Once everything is set up, simply update the parameter θ.

$$\theta \leftarrow \theta + \alpha_1 \cdot \nabla_{\theta} \log {π_{\theta}(a_t|s_t)} \cdot \delta_t$$

Value Function (Critic Network)

Use gradient descent on the square TD-error to approximate the wanted value function under optimal policy $π^{*}$.

$$\delta_t = r_t + \gamma V_{\phi} (s_{t+1}) - V_{\phi}(s_t)$$

$$\phi \leftarrow \phi + \alpha_2 \cdot \delta_t \cdot \nabla_{\phi} V_{\phi}(s_t)$$

Check detail for Policy Gradient Theorem at the following posts.

Real-World Application - Stock Prediction

I implemented stock prediction using the VAC algorithm with a fixed window size of 30 days, trained on two years of data. For this experiment, I selected SOFI and NDVA as the target stock. The dataset contains a total of 502 trading days, which I split into 386 days for training and 97 days for testing.

The feature set I used for training includes:

Returns
High–Low Percentage (High_Low_Pct)
Open–Close Percentage (Open_Close_Pct)
Relative Strength Index (RSI)
Bollinger Band Position (BB_Position)
Volume Ratio
Price-to-MA5 Ratio
Price-to-MA10 Ratio
Price-to-MA20 Ratio

Of course, additional indices or factors can be incorporated if doing so improves both the accuracy and interpretability of the model in capturing trend fluctuations.

The following results illustrate the overall portfolio changes and the corresponding trading actions.

Figure A: SOFI

Figure B: NVDA

Figure A: SOFI

Figure B: NVDA

However, for SOFI, we observe that the majority of actions involve simply holding the position. This is largely a result of its trending behavior, as such a non-stationary environment is inherently difficult to predict. Nevertheless, this implies that for similar stocks, the optimal strategy may often be to hold rather than trade frequently.

In contrast, NVDA exhibits a distinctly different distribution of actions. Understanding the differences between these two cases is crucial for interpreting how the model adapts to varying market dynamics.

Codebase

CSY

Nagoya, Japan

Formal Definition of Vanilla Actor-Critic

Components

Updates Rules

Real-World Application - Stock Prediction

CSY

You might also like...