Generating videos, conditioned on text with GANs. Honours thesis. This implementation contains the following paper implementations:
- To Create What You Tell
- TGAN
- TGANv2
With modifications to the last two to condition on text. Text is encoded with a Bi-LSTM which has been pretrained to generate the next token - which from memmory is the same methodlogy as "To Create What You Tell".
Additionally to capture motion in the discriminator more effectively, non-local blocks are utilised (self attention).
Conditional information is introduced similar to StackGAN++. Relativisitc losses are used.
For discriminator we compare the pairs:
${(x_r, c_r), (x_f, c_r)}$ ${(x_r, c_f), (x_f, c_r)}$
For generator we only compare first pair above.
- x_r is real video
- x_f is fake video
- c_r is caption correctly associated to video
- c_f is caption not associated to video
Standard GAN loss is preferred due to 1 discrim step to 1 generator step.
Alternatively I did experiment with non-relativisitc loss, via the following intiution:
-
$(x_r, c_r)$ => [should be associated] -
$(x_f, c_r)$ => [should not be] -
$(x_r, c_f)$ => [should not be] -
$(x_f, c_f)$ => not used
The last could optionally be used to learn but doesn't seem to be necessary (at least emprically)
Three datasets are used.
- Synthetic MNIST for moving digits
- MSR Video to Text (MSRVDC) dataset
- Custom dataset with videos scraped from reddit
MNIST with generated data from txt2vid/data/synthetic/generate.py
From top to bottom:
'<start> digit 9 is left and right<end>'
'<start> digit 8 is right and left<end>'
'<start> digit 8 is bottom and top<end>'
'<start> digit 4 is top and bottom<end>'
Bottom is ground truth for both of below
<start> a woman is saying about how to make vegetable tofu <unk> <end>'
<start> the person is cooking <end>'
'<start> the man poured preserves over the chicken<end>'
'<start> a person is dicing and onion<end>'
'<start> a woman is peeling a large shrimp in a glass bowl of water<end>'
See https://github.com/miguelmartin75/reddit-videos
Didn't end up training on this dataset :/
Please see thesis.pdf for more details, references, etc.