Skip to content

Commit 125ee33

Browse files
authored
Added MoCoGAN (pclubiitk#28)
* Add files via upload * Update main.py * Update main.py * Update README.md
1 parent cf306f8 commit 125ee33

File tree

9 files changed

+617
-0
lines changed

9 files changed

+617
-0
lines changed
+189
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Pytorch Implementation of MoCoGAN
2+
3+
## Usage
4+
5+
We are using [this dataset](http://www.wisdom.weizmann.ac.il/%7Evision/SpaceTimeActions.html) which you need to extact and place all the files in a file named data.
6+
7+
```bash
8+
$ python3 main.py --epochs 40000
9+
```
10+
> **_NOTE:_** on Colab Notebook use following command:
11+
```python
12+
!git clone link-to-repo
13+
%run main.py --epochs 40000
14+
```
15+
16+
```
17+
usage: main.py [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS]
18+
[--pre-train PRE_TRAIN] [--img_size IMG_SIZE]
19+
[--channel CHANNEL] [--hidden HIDDEN] [--dc DC] [--de DE]
20+
[--lr LR] [--beta BETA] [--trained_path TRAINED_PATH] [--T T]
21+
22+
Start trainning MoCoGAN.....
23+
24+
optional arguments:
25+
-h, --help show this help message and exit
26+
--batch-size BATCH_SIZE
27+
set batch_size
28+
--epochs EPOCHS set num of iterations
29+
--pre-train PRE_TRAIN
30+
set 1 when you use pre-trained models
31+
--img_size IMG_SIZE set the input image size of frame
32+
--channel CHANNEL set the no. of channel of the frame
33+
--hidden HIDDEN set the hidden layer size for gru
34+
--dc DC set the size of motion vector
35+
--de DE set the size of randomly generated epsilon
36+
--lr LR set the learning rate
37+
--beta BETA set the beta for the optimizer
38+
--trained_path TRAINED_PATH
39+
set the path were to trained models are saved
40+
--T T set the no. of frames to be selected
41+
```
42+
43+
## Contributed by:
44+
* [Ayush Gupta](https://github.com/ayush12gupta)
45+
46+
## References
47+
48+
* **Title**: MoCoGAN: Decomposing Motion and Content for Video Generation
49+
* **Authors**: Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz
50+
* **Link**: https://arxiv.org/pdf/1707.04993.pdf
51+
* **Year**: 2017
52+
53+
# Summary
54+
55+
## Introduction
56+
57+
Visual signals in a video can be divided into content and motions. Content specifies which object is in the video, motion describes their dynamics. Based on this MoCoGAN framework was proposed. This proposed framework generates a video by mapping a sequence of randomly generated vectors to a sequence of video frames. Each randomly generated vector consists of a motion part, and a content part.
58+
59+
To learn motion and content in an unsupervised manner we introduce an adverserial learning scheme utilizing both image and video discriminator.
60+
61+
## GANs
62+
63+
Generative adversarial nets were recently introduced as a novel way to train a generative model.
64+
They consists of two ‘adversarial’ models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training
65+
data rather than G. Both G and D could be a non-linear mapping function, such as a multi-layer perceptron.
66+
67+
## Motion And Content Decomposition GAN
68+
69+
In MoCoGAN, we assume a latent space of images Z<sub>I</sub>≡R<sup>d</sup> where each point z ∈ Z<sub>I</sub> represents an image, and a
70+
video of K frames is represented by a path of length K in the latent space, [z<sup>(1)</sup>, ..., z<sup>(K)</sup>]. By adopting this formulation, videos of different lengths can be generated by paths of
71+
different lengths. We further assume that Z<sub>I</sub> is decomposed into the content Z<sub>C</sub> and motion Z<sub>M</sub> subspace.
72+
The content subspace models motion-independent appearance in videos, while the motion subspace models motion-dependent appearance in videos.
73+
74+
## Framework
75+
76+
For a video, the content vector, zC, is sampled once and fixed. Then, a series of random variables[e<sup>(1)</sup>, ..., e<sup>(K)</sup>] is sampled and mapped to a series of motioncodes [z<sup>(1)</sup><sub>M</sub>,...,z<sup>(K)</sup><sub>M</sub>] via the recurrent neural network R<sub>M</sub>. We implement RM using a
77+
one-layer GRU network. A generator GI produces a frame, x˜<sup>(k)</sup>, using the content and the motion vectors {z<sub>C</sub>, z<sup>(K)</sup><sub>M</sub> }. The discriminators, D<sub>I</sub> and D<sub>V</sub>, are trained on real and fake images and videos,
78+
respectively, sampled from the training set v and the generated set v˜. The function S<sub>1</sub> samples a single frame from a
79+
video, S<sub>T</sub> samples T consequtive frames.
80+
81+
<img src="https://github.com/ayush12gupta/model-zoo/blob/master/generative_models/MoCoGAN_PyTorch/assets/framework.jpg" height="390" width="400">
82+
83+
We train MoCoGAN using the alternating gradient update algorithm as in. In one step, we update D<sub>I</sub> and D<sub>V</sub> while fixing G<sub>I</sub> and R<sub>M</sub>. In the alternating step, we update G<sub>I</sub> and R<sub>M</sub> while fixing D<sub>I</sub> and D<sub>V</sub> using a min-max game with value function F<sub>V</sub>(D<sub>I</sub>,D<sub>V</sub>,G<sub>I</sub>,R<sub>M</sub>)
84+
85+
<img src="https://github.com/ayush12gupta/model-zoo/blob/master/generative_models/MoCoGAN_PyTorch/assets/Value_fn.jpg" height="180" width="400">
86+
87+
In this objective function the first and second terms helps to train the Image Discriminator so that it can generate 1 for images samples from real videos and zero for those from fake videos. Similarly the third and fourth term help us to train the Video Discriminator.
88+
89+
## Implementation and Model Architecture
90+
91+
We implement this model on Weizmann database.
92+
93+
- We train our model for 40000 epoch
94+
- We use BCE loss(Binary Crossentropy loss) with a learning rate of 0.0002
95+
- We test the model by generating videos from a randomly generated set of epsilon and Z<sub>C</sub>
96+
97+
## Generator
98+
```
99+
----------------------------------------------------------------
100+
Layer (type) Output Shape Param #
101+
================================================================
102+
ConvTranspose2d-1 [-1, 512, 6, 6] 1,105,920
103+
BatchNorm2d-2 [-1, 512, 6, 6] 1,024
104+
ReLU-3 [-1, 512, 6, 6] 0
105+
ConvTranspose2d-4 [-1, 256, 12, 12] 2,097,152
106+
BatchNorm2d-5 [-1, 256, 12, 12] 512
107+
ReLU-6 [-1, 256, 12, 12] 0
108+
ConvTranspose2d-7 [-1, 128, 24, 24] 524,288
109+
BatchNorm2d-8 [-1, 128, 24, 24] 256
110+
ReLU-9 [-1, 128, 24, 24] 0
111+
ConvTranspose2d-10 [-1, 64, 48, 48] 131,072
112+
BatchNorm2d-11 [-1, 64, 48, 48] 128
113+
ReLU-12 [-1, 64, 48, 48] 0
114+
ConvTranspose2d-13 [-1, 3, 96, 96] 3,072
115+
Tanh-14 [-1, 3, 96, 96] 0
116+
================================================================
117+
Total params: 3,863,424
118+
Trainable params: 3,863,424
119+
Non-trainable params: 0
120+
----------------------------------------------------------------
121+
Input size (MB): 0.00
122+
Forward/backward pass size (MB): 6.75
123+
Params size (MB): 14.74
124+
Estimated Total Size (MB): 21.49
125+
----------------------------------------------------------------
126+
```
127+
128+
### Image Discriminator
129+
```
130+
----------------------------------------------------------------
131+
Layer (type) Output Shape Param #
132+
================================================================
133+
Conv2d-1 [-1, 64, 48, 48] 3,072
134+
LeakyReLU-2 [-1, 64, 48, 48] 0
135+
Conv2d-3 [-1, 128, 24, 24] 131,072
136+
LeakyReLU-4 [-1, 128, 24, 24] 0
137+
Conv2d-5 [-1, 256, 12, 12] 524,288
138+
LeakyReLU-6 [-1, 256, 12, 12] 0
139+
Conv2d-7 [-1, 512, 6, 6] 2,097,152
140+
LeakyReLU-8 [-1, 512, 6, 6] 0
141+
Conv2d-9 [-1, 1, 1, 1] 18,432
142+
Sigmoid-10 [-1, 1, 1, 1] 0
143+
================================================================
144+
Total params: 2,774,016
145+
Trainable params: 2,774,016
146+
Non-trainable params: 0
147+
----------------------------------------------------------------
148+
Input size (MB): 0.11
149+
Forward/backward pass size (MB): 4.22
150+
Params size (MB): 10.58
151+
Estimated Total Size (MB): 14.91
152+
----------------------------------------------------------------
153+
```
154+
155+
### Video Discriminator
156+
```
157+
----------------------------------------------------------------
158+
Layer (type) Output Shape Param #
159+
================================================================
160+
Conv3d-1 [-1, 64, 8, 48, 48] 12,288
161+
LeakyReLU-2 [-1, 64, 8, 48, 48] 0
162+
Conv3d-3 [-1, 128, 4, 24, 24] 524,288
163+
BatchNorm3d-4 [-1, 128, 4, 24, 24] 256
164+
LeakyReLU-5 [-1, 128, 4, 24, 24] 0
165+
Conv3d-6 [-1, 256, 2, 12, 12] 2,097,152
166+
BatchNorm3d-7 [-1, 256, 2, 12, 12] 512
167+
LeakyReLU-8 [-1, 256, 2, 12, 12] 0
168+
Conv3d-9 [-1, 512, 1, 6, 6] 8,388,608
169+
BatchNorm3d-10 [-1, 512, 1, 6, 6] 1,024
170+
LeakyReLU-11 [-1, 512, 1, 6, 6] 0
171+
Linear-12 [-1, 1] 18,433
172+
Sigmoid-13 [-1, 1] 0
173+
================================================================
174+
Total params: 11,042,561
175+
Trainable params: 11,042,561
176+
Non-trainable params: 0
177+
----------------------------------------------------------------
178+
Input size (MB): 1.69
179+
Forward/backward pass size (MB): 26.86
180+
Params size (MB): 42.12
181+
Estimated Total Size (MB): 70.67
182+
----------------------------------------------------------------
183+
```
184+
185+
## Results
186+
187+
Some samples of the generated videos are as follows:
188+
189+
![gif](assets/gif2.gif) ![gif](assets/gif1.gif) ![gif](assets/gif3.gif)
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)