Doodle Agent: Exploring Freeform Visual Generation with Multimodal LLMs

Motivation

How do LLMs engage in a creative act as unstructured and instinctive as doodling? We answer this question through exploring how large multimodal language models can engage in open-ended visual creation through Doodle Agent, a system that translates natural language prompts into drawing actions. The agent accesses a drawing environment: iteratively selects brushes, colors, and coordinates - doodling without explicit instructions on what to depict.

What is Doodle Agent

Method

🎨 Drawing Environment

Doodle Agent uses a web-based canvas built in p5.js with five brush types and a palette of 36 colors. At each step, the multimodal LLM receives the current canvas and instructions, and outputs a JSON command specifying brush, color, and stroke coordinates. These commands are rendered onto the canvas, and the process repeats for up to 15 strokes.

🤖 LLM Backbone

We explore both Claude 3.5 Sonnet and GPT-4o as frozen LLM agents. The models are prompted with open-ended instructions and canvas feedback, encouraging freeform exploration rather than predefined goals.

Doodle Agent pipeline overview

🔄 Iterative Process

The agent generates up to 15 strokes in sequence. Each stroke is defined as a tuple containing:

Brush type (marker, crayon, wiggle, spray, or fountain)
Color (selected from the 36-color palette)
Coordinates (sequence of x,y points defining the stroke path)

The JSON output format allows the agent to express both its reasoning process and precise drawing instructions:

{
  "thinking": "I want to draw a simple flower...",
  "brush": "marker",
  "color": "#6BB9A4",
  "strokes": [{
    "x": [100, 120, 140],
    "y": [200, 180, 200]
  }]
}

🎨 Choose Your Color

Keppel Sky Blue Tea Rose Light Red Jasmine Wisteria

Try different brushes and see how they feel! Each one has its own personality.

Drawing Modes

1. Unconstrained Doodling

The agent is prompted with: “You are a creative artist who loves to doodle! Draw whatever feels fun and interesting to you right now.” This mode evaluates spontaneous, casual doodling without restrictions.

2. Mood-Constrained Doodling

The agent receives an additional emotional context: “You express your emotions through your doodles. You are feeling very {mood} today.” where {mood} ∈ {happy, sad, angry}. This encourages the agent to align color and form choices with emotional states.

Gallery

Claude 3.5 Sonnet

GPT-4o

Mood-Constrained

We conducted quantitative and qualitative analysis comparing agent doodles to human and random baselines, which is displayed below.

Human Baseline

Random Baseline

BibTeX


        @inproceedings{cao2025doodleagent,
          title     = {Doodle Agent: Exploring Freeform Visual Generation with Multimodal LLMs},
          author    = {Cao, Dingning and Kang, Yifan and Torralba, Antonio and Vinker, Yael},
          booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)},
          year      = {2025},
          note      = {AI4VA Workshop},
        }