Jash Mota

Running HunyuanVideo Text to Video Model on AWS

Video generation has been a hot topic with various models releasing in the last couple weeks, including Hunyuan Video from Tencent, Veo2 from Google DeepMind, and Sora from OpenAI.

Essentially, video generation works similar to text generation and image generation, where the prompt and previous frames is taken as input to generate next frame repeatedly.

This post implements HunyuanVideo's Text-to-Video Model on AWS EC2.

What instance to choose?

Hunyuan requires at least 60GB VRAM GPU for 720x1280 pixels and minimum 45GB for 544x960 resolution.

I used a g6e.12xlarge instance type with Ubuntu's 20.04 Deep Learning AMI (Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20240521) to run this model. The g6e.12xlarge has 4x L40 GPUs, each of which is 48GB VRAM. This means it can run 544x960 video generation without any issue.

AWS g6e.12xlarge instance details showing 4x NVIDIA L40 GPUs
AWS g6e.12xlarge instance 4x NVIDIA L40 GPUs

First, install CUDA v11.8 on the instance. The instance by default might have nvidia-535 driver and CUDA 12.1, but it's best to downgrade to 11.8 which is supported by many projects you'd want to run. You can have multiple driver versions and choose which version to use.

wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
chmod +x cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run --toolkit --toolkitpath=/usr/local/cuda-11.8 --override

When prompted:

To use either of the versions, you can use these commands:

# CUDA 12.1 (current)
export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# CUDA 11.8 (new)
export PATH=/usr/local/cuda-11.8/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
git clone https://github.com/tencent/HunyuanVideo
cd HunyuanVideo
conda create -n HunyuanVideo python==3.10.9
conda activate HunyuanVideo
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r requirements.txt

# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

# 6. Install xDiT for parallel inference (It is recommended to use torch 2.4.0 and flash-attn 2.6.3)
python -m pip install xfuser==0.4.0

Download pre-trained Hunyuan model and text encoder models using these instructions

(if you're in singapore like me, use the mirror download command for faster download:

HF_ENDPOINT=https://hf-mirror.com huggingface-cli download tencent/HunyuanVideo --local-dir ./ckpts
cd HunyuanVideo
python3 sample_video.py \
    --video-size 720 1280 \
    --video-length 129 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --flow-reverse \
    --use-cpu-offload \
    --save-path ./results

Simple comparison between different models

Shortly after the models were out, I came across Veo's generation output. I used the same prompt on Hunyuan. I think Veo gets physics the best, while Sora seems the worst of all. Hunyuan usually represents an Asian person when prompted to generate a human, which might be because the model comes from Tencent which is based in Asia.

Sora also seems to be producing slow motion videos.