|
| 1 | +# Pytorch Implementation of MoCoGAN |
| 2 | + |
| 3 | +## Usage |
| 4 | + |
| 5 | +We are using [this dataset](http://www.wisdom.weizmann.ac.il/%7Evision/SpaceTimeActions.html) which you need to extact and place all the files in a file named data. |
| 6 | + |
| 7 | +```bash |
| 8 | +$ python3 main.py --epochs 40000 |
| 9 | +``` |
| 10 | +> **_NOTE:_** on Colab Notebook use following command: |
| 11 | +```python |
| 12 | +!git clone link-to-repo |
| 13 | +%run main.py --epochs 40000 |
| 14 | +``` |
| 15 | + |
| 16 | +``` |
| 17 | +usage: main.py [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] |
| 18 | + [--pre-train PRE_TRAIN] [--img_size IMG_SIZE] |
| 19 | + [--channel CHANNEL] [--hidden HIDDEN] [--dc DC] [--de DE] |
| 20 | + [--lr LR] [--beta BETA] [--trained_path TRAINED_PATH] [--T T] |
| 21 | +
|
| 22 | +Start trainning MoCoGAN..... |
| 23 | +
|
| 24 | +optional arguments: |
| 25 | + -h, --help show this help message and exit |
| 26 | + --batch-size BATCH_SIZE |
| 27 | + set batch_size |
| 28 | + --epochs EPOCHS set num of iterations |
| 29 | + --pre-train PRE_TRAIN |
| 30 | + set 1 when you use pre-trained models |
| 31 | + --img_size IMG_SIZE set the input image size of frame |
| 32 | + --channel CHANNEL set the no. of channel of the frame |
| 33 | + --hidden HIDDEN set the hidden layer size for gru |
| 34 | + --dc DC set the size of motion vector |
| 35 | + --de DE set the size of randomly generated epsilon |
| 36 | + --lr LR set the learning rate |
| 37 | + --beta BETA set the beta for the optimizer |
| 38 | + --trained_path TRAINED_PATH |
| 39 | + set the path were to trained models are saved |
| 40 | + --T T set the no. of frames to be selected |
| 41 | +``` |
| 42 | + |
| 43 | +## Contributed by: |
| 44 | +* [Ayush Gupta](https://github.com/ayush12gupta) |
| 45 | + |
| 46 | +## References |
| 47 | + |
| 48 | +* **Title**: MoCoGAN: Decomposing Motion and Content for Video Generation |
| 49 | +* **Authors**: Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz |
| 50 | +* **Link**: https://arxiv.org/pdf/1707.04993.pdf |
| 51 | +* **Year**: 2017 |
| 52 | + |
| 53 | +# Summary |
| 54 | + |
| 55 | +## Introduction |
| 56 | + |
| 57 | +Visual signals in a video can be divided into content and motions. Content specifies which object is in the video, motion describes their dynamics. Based on this MoCoGAN framework was proposed. This proposed framework generates a video by mapping a sequence of randomly generated vectors to a sequence of video frames. Each randomly generated vector consists of a motion part, and a content part. |
| 58 | + |
| 59 | +To learn motion and content in an unsupervised manner we introduce an adverserial learning scheme utilizing both image and video discriminator. |
| 60 | + |
| 61 | +## GANs |
| 62 | + |
| 63 | +Generative adversarial nets were recently introduced as a novel way to train a generative model. |
| 64 | +They consists of two ‘adversarial’ models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training |
| 65 | +data rather than G. Both G and D could be a non-linear mapping function, such as a multi-layer perceptron. |
| 66 | + |
| 67 | +## Motion And Content Decomposition GAN |
| 68 | + |
| 69 | +In MoCoGAN, we assume a latent space of images Z<sub>I</sub>≡R<sup>d</sup> where each point z ∈ Z<sub>I</sub> represents an image, and a |
| 70 | +video of K frames is represented by a path of length K in the latent space, [z<sup>(1)</sup>, ..., z<sup>(K)</sup>]. By adopting this formulation, videos of different lengths can be generated by paths of |
| 71 | +different lengths. We further assume that Z<sub>I</sub> is decomposed into the content Z<sub>C</sub> and motion Z<sub>M</sub> subspace. |
| 72 | +The content subspace models motion-independent appearance in videos, while the motion subspace models motion-dependent appearance in videos. |
| 73 | + |
| 74 | +## Framework |
| 75 | + |
| 76 | +For a video, the content vector, zC, is sampled once and fixed. Then, a series of random variables[e<sup>(1)</sup>, ..., e<sup>(K)</sup>] is sampled and mapped to a series of motioncodes [z<sup>(1)</sup><sub>M</sub>,...,z<sup>(K)</sup><sub>M</sub>] via the recurrent neural network R<sub>M</sub>. We implement RM using a |
| 77 | +one-layer GRU network. A generator GI produces a frame, x˜<sup>(k)</sup>, using the content and the motion vectors {z<sub>C</sub>, z<sup>(K)</sup><sub>M</sub> }. The discriminators, D<sub>I</sub> and D<sub>V</sub>, are trained on real and fake images and videos, |
| 78 | +respectively, sampled from the training set v and the generated set v˜. The function S<sub>1</sub> samples a single frame from a |
| 79 | +video, S<sub>T</sub> samples T consequtive frames. |
| 80 | + |
| 81 | +<img src="https://github.com/ayush12gupta/model-zoo/blob/master/generative_models/MoCoGAN_PyTorch/assets/framework.jpg" height="390" width="400"> |
| 82 | + |
| 83 | +We train MoCoGAN using the alternating gradient update algorithm as in. In one step, we update D<sub>I</sub> and D<sub>V</sub> while fixing G<sub>I</sub> and R<sub>M</sub>. In the alternating step, we update G<sub>I</sub> and R<sub>M</sub> while fixing D<sub>I</sub> and D<sub>V</sub> using a min-max game with value function F<sub>V</sub>(D<sub>I</sub>,D<sub>V</sub>,G<sub>I</sub>,R<sub>M</sub>) |
| 84 | + |
| 85 | +<img src="https://github.com/ayush12gupta/model-zoo/blob/master/generative_models/MoCoGAN_PyTorch/assets/Value_fn.jpg" height="180" width="400"> |
| 86 | + |
| 87 | +In this objective function the first and second terms helps to train the Image Discriminator so that it can generate 1 for images samples from real videos and zero for those from fake videos. Similarly the third and fourth term help us to train the Video Discriminator. |
| 88 | + |
| 89 | +## Implementation and Model Architecture |
| 90 | + |
| 91 | +We implement this model on Weizmann database. |
| 92 | + |
| 93 | +- We train our model for 40000 epoch |
| 94 | +- We use BCE loss(Binary Crossentropy loss) with a learning rate of 0.0002 |
| 95 | +- We test the model by generating videos from a randomly generated set of epsilon and Z<sub>C</sub> |
| 96 | + |
| 97 | +## Generator |
| 98 | +``` |
| 99 | +---------------------------------------------------------------- |
| 100 | + Layer (type) Output Shape Param # |
| 101 | +================================================================ |
| 102 | + ConvTranspose2d-1 [-1, 512, 6, 6] 1,105,920 |
| 103 | + BatchNorm2d-2 [-1, 512, 6, 6] 1,024 |
| 104 | + ReLU-3 [-1, 512, 6, 6] 0 |
| 105 | + ConvTranspose2d-4 [-1, 256, 12, 12] 2,097,152 |
| 106 | + BatchNorm2d-5 [-1, 256, 12, 12] 512 |
| 107 | + ReLU-6 [-1, 256, 12, 12] 0 |
| 108 | + ConvTranspose2d-7 [-1, 128, 24, 24] 524,288 |
| 109 | + BatchNorm2d-8 [-1, 128, 24, 24] 256 |
| 110 | + ReLU-9 [-1, 128, 24, 24] 0 |
| 111 | + ConvTranspose2d-10 [-1, 64, 48, 48] 131,072 |
| 112 | + BatchNorm2d-11 [-1, 64, 48, 48] 128 |
| 113 | + ReLU-12 [-1, 64, 48, 48] 0 |
| 114 | + ConvTranspose2d-13 [-1, 3, 96, 96] 3,072 |
| 115 | + Tanh-14 [-1, 3, 96, 96] 0 |
| 116 | +================================================================ |
| 117 | +Total params: 3,863,424 |
| 118 | +Trainable params: 3,863,424 |
| 119 | +Non-trainable params: 0 |
| 120 | +---------------------------------------------------------------- |
| 121 | +Input size (MB): 0.00 |
| 122 | +Forward/backward pass size (MB): 6.75 |
| 123 | +Params size (MB): 14.74 |
| 124 | +Estimated Total Size (MB): 21.49 |
| 125 | +---------------------------------------------------------------- |
| 126 | +``` |
| 127 | + |
| 128 | +### Image Discriminator |
| 129 | +``` |
| 130 | +---------------------------------------------------------------- |
| 131 | + Layer (type) Output Shape Param # |
| 132 | +================================================================ |
| 133 | + Conv2d-1 [-1, 64, 48, 48] 3,072 |
| 134 | + LeakyReLU-2 [-1, 64, 48, 48] 0 |
| 135 | + Conv2d-3 [-1, 128, 24, 24] 131,072 |
| 136 | + LeakyReLU-4 [-1, 128, 24, 24] 0 |
| 137 | + Conv2d-5 [-1, 256, 12, 12] 524,288 |
| 138 | + LeakyReLU-6 [-1, 256, 12, 12] 0 |
| 139 | + Conv2d-7 [-1, 512, 6, 6] 2,097,152 |
| 140 | + LeakyReLU-8 [-1, 512, 6, 6] 0 |
| 141 | + Conv2d-9 [-1, 1, 1, 1] 18,432 |
| 142 | + Sigmoid-10 [-1, 1, 1, 1] 0 |
| 143 | +================================================================ |
| 144 | +Total params: 2,774,016 |
| 145 | +Trainable params: 2,774,016 |
| 146 | +Non-trainable params: 0 |
| 147 | +---------------------------------------------------------------- |
| 148 | +Input size (MB): 0.11 |
| 149 | +Forward/backward pass size (MB): 4.22 |
| 150 | +Params size (MB): 10.58 |
| 151 | +Estimated Total Size (MB): 14.91 |
| 152 | +---------------------------------------------------------------- |
| 153 | +``` |
| 154 | + |
| 155 | +### Video Discriminator |
| 156 | +``` |
| 157 | +---------------------------------------------------------------- |
| 158 | + Layer (type) Output Shape Param # |
| 159 | +================================================================ |
| 160 | + Conv3d-1 [-1, 64, 8, 48, 48] 12,288 |
| 161 | + LeakyReLU-2 [-1, 64, 8, 48, 48] 0 |
| 162 | + Conv3d-3 [-1, 128, 4, 24, 24] 524,288 |
| 163 | + BatchNorm3d-4 [-1, 128, 4, 24, 24] 256 |
| 164 | + LeakyReLU-5 [-1, 128, 4, 24, 24] 0 |
| 165 | + Conv3d-6 [-1, 256, 2, 12, 12] 2,097,152 |
| 166 | + BatchNorm3d-7 [-1, 256, 2, 12, 12] 512 |
| 167 | + LeakyReLU-8 [-1, 256, 2, 12, 12] 0 |
| 168 | + Conv3d-9 [-1, 512, 1, 6, 6] 8,388,608 |
| 169 | + BatchNorm3d-10 [-1, 512, 1, 6, 6] 1,024 |
| 170 | + LeakyReLU-11 [-1, 512, 1, 6, 6] 0 |
| 171 | + Linear-12 [-1, 1] 18,433 |
| 172 | + Sigmoid-13 [-1, 1] 0 |
| 173 | +================================================================ |
| 174 | +Total params: 11,042,561 |
| 175 | +Trainable params: 11,042,561 |
| 176 | +Non-trainable params: 0 |
| 177 | +---------------------------------------------------------------- |
| 178 | +Input size (MB): 1.69 |
| 179 | +Forward/backward pass size (MB): 26.86 |
| 180 | +Params size (MB): 42.12 |
| 181 | +Estimated Total Size (MB): 70.67 |
| 182 | +---------------------------------------------------------------- |
| 183 | +``` |
| 184 | + |
| 185 | +## Results |
| 186 | + |
| 187 | +Some samples of the generated videos are as follows: |
| 188 | + |
| 189 | +    |
0 commit comments