Generating Animated Layouts
as Structured Text Representations

Yeonsang Shin1*, Jihwan Kim1*, Yumin Song1, Kyungseung Lee2, Hyunhee Chung2, Taeyoung Na2
1Seoul National University, 2SK telecom
*Equal contribution

CVPR 2025 Workshop on AI for Content Creation (AI4CC)
Teaser image


Given text prompts
that specify the content and style,
VAKER creates animated layouts for video advertisements.

Abstract

Despite the remarkable progress in text-to-video models, achieving precise control over text elements and animated graphics remains a significant challenge, especially in applications such as video advertisements. To address this limitation, we introduce Animated Layout Generation, a novel approach to extend static graphic layouts with temporal dynamics. We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements.

To demonstrate the effectiveness of our approach, we present VAKER (Video Ad maKER), a text-to-video advertisement generation pipeline that combines a three-stage generation process with Unstructured Text Reasoning for seamless integration with LLMs. VAKER fully automates video advertisement generation by incorporating dynamic layout trajectories for objects and graphics across specific video frames. Through extensive evaluations, we demonstrate that VAKER significantly outperforms existing methods in generating video advertisements.

Results

Overview

Model Overview

VAKER transforms text prompts into video ads through a sequential pipeline where specialized LoRA-adapted LLMs generate Structured Text representations for each component (Banner, Mainground, and Animation). These components work together by converting user inputs into detailed specifications that define layout, visual elements, and motion sequences for cohesive, programmatically-generated video advertisements.

Dataset Construction Pipeline

Dataset Construction Pipeline

Given a video advertisement: (1) Extracts the last frame, (2) Detects and classifies objects using fine-tuned detection models, (3) Generates Banner ST and Mainground ST from the spatial layout, (4) Tracks object movements using tracking models to generate Animation ST, (5) Uses template-based prompting with LLMs to convert these ST-Representations into UT Reasonings and natural language prompts.