Generating Animated Layouts as Structured Text Representations

Generating Animated Layouts
as Structured Text Representations

Yeonsang Shin^1*, Jihwan Kim^1*, Yumin Song¹, Kyungseung Lee², Hyunhee Chung², Taeyoung Na²

¹Seoul National University, ²SK telecom
^*Equal contribution
CVPR 2025 Workshop on AI for Content Creation (AI4CC)

Abstract

Despite the remarkable progress in text-to-video models, achieving precise control over text elements and animated graphics remains a significant challenge, especially in applications such as video advertisements. To address this limitation, we introduce Animated Layout Generation, a novel approach to extend static graphic layouts with temporal dynamics. We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements.

To demonstrate the effectiveness of our approach, we present VAKER (Video Ad maKER), a text-to-video advertisement generation pipeline that combines a three-stage generation process with Unstructured Text Reasoning for seamless integration with LLMs. VAKER fully automates video advertisement generation by incorporating dynamic layout trajectories for objects and graphics across specific video frames. Through extensive evaluations, we demonstrate that VAKER significantly outperforms existing methods in generating video advertisements.

Results

Overview

VAKER transforms text prompts into video ads through a sequential pipeline where specialized LoRA-adapted LLMs generate Structured Text representations for each component (Banner, Mainground, and Animation). These components work together by converting user inputs into detailed specifications that define layout, visual elements, and motion sequences for cohesive, programmatically-generated video advertisements.

Dataset Construction Pipeline

Given a video advertisement: (1) Extracts the last frame, (2) Detects and classifies objects using fine-tuned detection models, (3) Generates Banner ST and Mainground ST from the spatial layout, (4) Tracks object movements using tracking models to generate Animation ST, (5) Uses template-based prompting with LLMs to convert these ST-Representations into UT Reasonings and natural language prompts.