However, standard adaptive methods rely on exponential moving averages (EMA) with fixed decay hyperparameters ($\beta_1, \beta_2$). A fixed decay rate fails to account for the dynamic nature of the optimization landscape. In early training, high variance requires robust averaging; in later training, fine-tuning requires higher sensitivity to recent gradients.

In the stochastic setting, we observe a random function $F(\theta, \xi)$ where $\xi$ is a random variable representing data samples. At iteration $t$, we compute the stochastic gradient $g_t = \nabla_\theta F(\theta_t, \xi_t)$.

Standard EMA-based optimizers update the first moment $m_t$ and second moment $v_t$ as: $$ m_t = \beta_1 m_t-1 + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_t-1 + (1 - \beta_2) g_t^2 $$

Helping collectors and viewers identify specific production houses or series.