GPT-4 DRIVEN CINEMATIC MUSIC GENERATION THROUGH TEXT PROCESSING

Muhammad Taimoor Haseeb*, Ahmad Hammoudeh*, Gus Xia

Abstract This paper presents Herrmann-1, a multi-modal framework to generate background music tailored to movie scenes, by integrating state-of-the-art vision, language, music, and speech processing models. Our pipeline begins by extracting visual and speech information from a movie scene, performing emotional analysis on it, and converting these into descriptive texts. Then, GPT-4 translates these high-level descriptions into low-level music conditions. Finally, these text-based music conditions guide a text-to-music model to generate music that resonates with input movie scenes. Comprehensive objective and subjective evaluations attest to the high synthesis quality, congruence, and superiority of our pipeline.

Herrmann-1 Pipeline
Fig. 1 Our background music generation pipeline takes a movie scene as input and generates a tailored audio music file.

Scroll down to explore tailored background music for movie scenes generated by Herrmann-1. Compare it with the original music for the scene, as well as with music generated by Controllable Music Transformer (CMT) by Di et al., 2021.

*The first two authors contributed equally.

The Lion King

Original

Herrmann-1

CMT

The Titanic

Original

Herrmann-1

CMT

The Social Network

Original

Herrmann-1

CMT

Shot on iPhone 14 Pro | Cinematic Mode 4K

Original

Herrmann-1

CMT

Troy

Original

Herrmann-1

CMT

The Grand Budapest Hotel

Original

Herrmann-1

CMT

The Royal Tenenbaums

Original

Herrmann-1

CMT

The Pursuit of Happiness

Original

Herrmann-1

CMT

Psycho

Original

Herrmann-1

CMT

Mamma Mia!

Original

Herrmann-1

CMT

The Lone Ranger

Original

Herrmann-1

CMT