[논문 리뷰] Sample-Efficient Reinforcement Learning with Action Chunking

논문 정보

제목: Sample-Efficient Reinforcement Learning with Action Chunking

저자: Qiyang Li, Zhiyuan Zhou, Sergey Levine, UC Berkeley.

학회: NeurIPS 2025

링크: https://arxiv.org/abs/2507.07969

Overview

Target task: Offline2Online, sample efficient RL, Exploration
Algorithm class: FQL (Flow Q-learning; [1]) with an extended action space
Motivation
1. slow 1-step TD backups
2. temporally incoherent actions for exploration
  - [Comment] Offline2online에서 발생하는 문제점을 해결하는 논문은 아니다. 1번은 sample efficient RL과 연관되고, 2번은 exploration과 연관되기 때문에 일반적인 online RL 분야에서 발생하는 문제를 다루는 논문
Solution: Action chunking
- 현재 시점의 행동만 출력하는 기존 정책 대신 현재부터 $h$ steps 동안 사용할 action sequence를 출력하는 정책 사용:
  \[\pi_\psi(\mathbf{a}_{t:t+h-1} \mid s_t) \quad\text{where}\quad \mathbf{a}_{t:t+h-1}= [\overbrace{a_t \;\; a_{t+1} \;\; \ldots \;\; a_{t+h-1}}^{h \;\text{times}}]^\top\]
- Open-loop control: $s_t$만 보고 $a_t$부터 $a_{t+h-1}$을 순서대로 환경과 상호작용
$h=5$ 예시.

Introduction

가장 큰 motivation

최근 imitation learning (IL) 분야에서 action chunking을 사용한 연구 등장 (ALOHA 논문; [2])

❓ Action chunking이 IL 분야에서는 사용되지만 RL에서는 사용되지 않은 이유

IL 데이터셋 (특히 인간이 직접 수집한 데이터셋)은 non-Markovian behavior을 포함하고 있을 수 있음
하지만 MDP로 문제를 formulate하는 RL의 경우 optimal policy가 Markovian poilcy이기 때문에 action chunking policy는 suboptimal하다.

💡 하지만 online RL에서 action chunking이 효과적일 수도 있는 이유

Optimal policy을 찾기 위해 꼭 필요한 과정인 exploration의 경우 non-Markovian 성격을 띠며 temporally extended actions가 더 효과적일 수 있다고 주장
1-step TD의 경우 value backup가 느려 sample inefficient하다.

Q-Learning with Action Chunking (Q-LAC)

현재 시점의 행동만 출력하는 기존 정책 대신 현재부터 $h$ steps 동안 사용할 action sequence를 출력하는 정책 사용:
\[\pi_\psi(\mathbf{a}_{t:t+h-1} \mid s_t) \quad\text{where}\quad \mathbf{a}_{t:t+h-1}= [\overbrace{a_t \;\; a_{t+1} \;\; \ldots \;\; a_{t+h-1}}^{h \;\text{times}}]^\top\]

Open-loop control: $s_t$만 보고 $a_t$부터 $a_{t+h-1}$을 순서대로 환경과 상호작용

이 decision process를 기존 Markov decision process (MDP)로부터 다시 formulation 가능

기존 MDP $\mathcal{M}=(\mathcal{S}, \mathcal{A}, r, p, \gamma)$ ⇒ 새로운 MDP $\bar{\mathcal{M}}=(\mathcal{S}, \bar{\mathcal{A}}, \bar{r}, \bar{p}, \gamma^h)$
원래 행동공간 $\mathcal{A}\subseteq\mathbb{R}^{d_a}$ 대신 extended action space
\[\bar{\mathcal{A}}\subseteq\mathbb{R}^{\overbrace{d_a \times d_a \times \ldots \times d_a}^{h \, \text{times}}}\]
보상함수 ($h$ steps 동안 받은 보상의 discounted sum)
\[\bar{r}(s_t, \mathbf{a}_{t:t+h-1})=\sum_{k=0}^{h-1}\gamma^{k} r(s_{t+k},a_{t+k})\]
전이확률분포 ($s_t$의 다음 상태가 $s_{t+h}$이 됨)

\[\bar{p}(s_{t+h} \mid s_t, \mathbf{a}_{t:t+h-1})=\prod_{k=0}^{h-1}p(s_{t+k+1} \mid s_{t+k},a_{t+k})\]

매 $h$ steps마다 transition $(s, a, r, s’)$ 을 저장하는 방식으로 쉽게 구현됨
\[\mathcal{D}=\lbrace \left(s_t,\mathbf{a}_{t:t+h-1},\sum_{k=0}^{h-1}\gamma^{k} r_{t+k}, s_{t+h}\right) : t=1, h,2h,\ldots, T.\rbrace\]

위 transition을 사용하면 다음과 같은 critic loss와 actor loss 를 얻음
\[\mathcal{L}(\theta)=\mathbb{E}_{\mathcal{D}}\left[ \left( Q_\theta(s_t,\mathbf{a}_{t:t+h-1}) - \sum_{k=0}^{h-1}\gamma^{k} r_{t+k} -\gamma^hQ_{\bar{\theta}}(s_{t+h},\mathbf{a}_{t+h:t+2h-1}) \right)^2 \right]\] \[\mathcal{L}(\psi)= - \operatorname*{\mathbb{E}}_{s_t\sim\mathcal{D}, \mathbf{a}_{t:t+h-1}^\pi\sim\pi(\cdot \mid s_t)} \left[ Q_\theta(s_t, \mathbf{a}_{t:t+h-1}^\pi) \right]\]

Offline2online을 할 때 일반적인 online RL알고리즘을 선택하고 위처럼 학습하면 성능이 오히려 떨어진다. (문제의 원인 설명 부족)
- RLPD (Reinforcement Learning with Prior Data): replay buffer에 offline dataset 넣고 online RL하는 offline2online 기법.

$\pi_\psi(\mathbf{a}_{t:t+h-1} \mid s_t)$이 충분히 복잡한 분포면 좋겠음 ⇒ FQL 손실함수 사용
- FQL 손실함수: offline dataset으로부터 behavior cloning한 pretrained flow matching network와 $\pi_\psi(\mathbf{a}_{t:t+h-1} \mid s_t)$의 behavior constrained term 추가된 형태

Experiment

Benchmark environment:

OGBench에서 5개 domain $\times$ 각 domain에서 2개 tasks
Robomimic에서 2개 tasks

Learning curves

Ablations studies

왼쪽: Chunk size에 따른 ablation
오른쪽: Action chunking이 exploration을 잘했다는 증거
- cube-triple 도메인에서 FQL 알고리즘과 Q-LAC 알고리즘을 사용하여 각각 1M steps 동안 offline pretraining한 후 데이터 1M 개 수집. $\mathcal{D}_1, \mathcal{D}_5$
- $\mathcal{D}_1^{\text{comp}}=$ Offline dataset + $\mathcal{D}_1$, $\mathcal{D}_5^{\text{comp}}=$ Offline dataset + $\mathcal{D}_5$ 으로 다시 offline ptraining한 결과

Reference

[1] Park, Seohong, Qiyang Li, and Sergey Levine. “Flow q-learning.” arXiv preprint arXiv:2502.02538 (2025).

[2] Zhao, Tony Z., et al. “Learning fine-grained bimanual manipulation with low-cost hardware.” arXiv preprint arXiv:2304.13705 (2023). https://tonyzhaozh.github.io/aloha/