predict VIDEO #10

fzd9752 · 2017-11-09T07:19:38Z

目的：？？（不知道和公司发展有什么关系，也不知道做出来能干什么……）

zdx：序列预测是智能非常重要的能力，对于AI非常重要，完全符合公司目标通用智能，做出了能增强现有神经网络的智能。
具体场景：大家一起想！避障，其他车辆意图的预测，torcs游戏验证？机器人自己动作的预测，常识学习。
原型验证ok，完善中再继续找应用的场景和产品的具体完善。

目标：搭建一个视频生成网络
要求：pix2pix 框架，基于GAN技术

注：以上为主观因素

基本结构：

G：简易 3D_UNET 网络，初步大小64 x 64，目标大小 128 x 128
D：C3D 类似结构判别器

效果：输入10帧视频，输出5帧视频

预计时间：总用时 8 周

网络基础搭建 4 周：
- W1: 论文清单论文和相关代码
- W2 - W3: 简单主体结构G D搭建
- W4: 试训练，看能否收敛
网络调试：2周确认网络有潜力后进一步增加复杂度
- W5: 扩增网络
- W6: 大数据及测试，pipeline顺畅
Demo 训练 + 测试: 2周
- W7: 训练，调bug
- W8: 测试目标数据集

15 号最新更新:
按张总的意思，换 Pytorch 框架，基于 pix2pix 原始代码修改修改。参考如下：

pix2pix pytorch 源代码：
https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

一些 pytorch 3D 应用的实例：
https://github.com/shiba24/3d-unet
https://github.com/kenshohara/video-classification-3d-cnn-pytorch
https://github.com/kenshohara/3D-ResNets-PyTorch

pytorch 官方 Document：
http://pytorch.org/docs/master/nn.html
关键 operatoin：
3D deconvolution - torch.nn.ConvTranspose3d
3D convolution - torch.nn.Conv3d
3D maxpooling - torch.nn.MaxPool3d
3D dropout - torch.nn.Dropout3d

Keras 实现 已取消

计划注意：

计划列出的是最低时间，因为进度原因可能推迟

可能失败原因

1. 因为现有目标数据集不符合pix2pix coniditional gan 分布的原理，生成图像可能无法毫无价值
2. 3D convolution 耗费内存增大，最终模型以我们现有条件可能跑不起来
3. 技术能力不足，耦合失败
4. 公司调整方向，放弃

fzd9752 · 2017-11-10T04:03:34Z

必读论文及相关要求：

Image-to-Image Translation with Conditional Adversarial Nets
conditional GAN 基础框架
https://arxiv.org/pdf/1611.07004v1.pdf
https://github.com/costapt/vess2ret
https://github.com/createamind/pytorch-CycleGAN-and-pix2pix
Learning Spatiotemporal Features with 3D Convoutional Networks
3D 卷积
1412.0767
https://gist.github.com/albertomontesg/d8b21a179c1e6cca0480ebdf292c34d2
https://github.com/harvitronix/five-video-classification-methods/blob/master/models.py
我们将使用流行的UCF101数据集。我发现这个数据集在课堂和培训数据方面有很好的平衡，还有很多我们自己判断自己反对的有据可查的基准。与一些较新的视频数据集（请参阅YouTube-8M）不同，现代系统上的数据量是可管理的。
UCF很好地总结了他们的数据集：
UCF101在101个动作类别中提供了13,320个视频，在动作方面表现出最大的多样性，相机运动，物体外观和姿态，物体尺寸，视点，杂乱的背景，照明条件等等都存在很大的变化。具有挑战性的数据集迄今。
Generating Videos with Scene Dynamics
VideoGAN，3D卷积视频生成
1609.02612
U-Net: Convolutional Networks for Biomedical Image Segmentation
基础网络结构
1505.04597
3D U-net: Learning dense volumetric segmentation from sparse annotation
3D 化 UNET，代码参考
https://github.com/ellisdg/3DUnetCNN/blob/master/unet3d/model.py

zdx6 mocogan https://github.com/sergeytulyakov/mocogan 这个训练的硬件满足。
https://github.com/akuzeee/MoCoGAN/blob/master/models.py

7 Dual Motion GAN for Future-Flow Embedded Video Prediction

yushenxiang · 2017-11-10T10:01:26Z

2017-11-10
1)Image-to-Image Translation with Conditional Adversarial Nets
阅读总结：
训练input: 一个图片和随机高斯噪声（dropout) ，训练output:一个逼真的和输入图片相关的图片
测试input: 一个图片和随机高斯噪声（dropout) 。测试output:一个逼真的和输入图片相关的图片
G: U-NET是an encoder-decoder with skip connections**
(encoder:
C64-C128-C256-C512-C512-C512-C512-C512
U-Net decoder:
CD512-CD1024-CD1024-C1024-C1024-C512-C256-C128
After the last layer in the decoder, a convolution is ap- plied to map to the number of output channels (3 in general, except in colorization, where it is 2), followed by a Tanh function. As an exception to the above notation, Batch- Norm is not applied to the first C64 layer in the encoder. All ReLUs in the encoder are leaky, with slope 0.2, while ReLUs in the decoder are not leaky.)
D：论文采用70x70的patchGAN
(C64-C128-C256-C512,After the last layer, a convolution is applied to map to a 1 dimensional output, followed by a Sigmoid function. As an exception to the above notation, BatchNorm is not applied to the first C64 layer. All ReLUs are leaky, with slope 0.2.)
优势：可以用于处理任意大的图片
loss函数：G⇤ = arg min max LcGAN (G, D) + lamda*LL1(G).
（Adding both terms together (with lamda = 100) reduces these artifacts.）
训练细节：
1）Weights were initialized from a Gaussian distribution with mean 0 and standard deviation 0.02.
2） apply batch normalization，use batch size 1 for certain experiments and 4 for others，noting little difference

yushenxiang · 2017-11-10T11:24:46Z

2017-11-10
videoGAN 阅读总结：
功能：1）generate videos from scratch (not conditioned on the past）
2） generate a sequence of frames (32 frames)
训练输入： a large amount of unlabeled video, 噪声训练输出：video
测试输入：噪声测试输出：video
数据处理：
1）降低相机抖动影响We extract SIFT keypoints [22], use RANSAC to estimate a homography (rotation, translation, scale) between adjacent frames, and warp frames to minimize background motion.
2）The only other pre-processing we do is normalizing the videos to be in the range [−1, 1].
3）We extract frames at native frame rate (25 fps). We use 32-frame videos of spatial resolution 64 × 64.
结构：
G: 两个独立的部分，输入一个100维的高斯噪声作为隐变量，一个生成动态的前景，一个生成静态的背景，将两个通过一个mask进行相加得到video
D: We design the architecture to be reverse of the foreground stream in the generator, replacing fractionally strided convolutions with strided convolutions (to down-sample instead of up-sample), and replacing the last layer to output a binary classification (real or not).
参数：
1）We use the Adam optimizer and a fixed learning rate of 0.0002 and momentum term of 0.5
2）The latent code has 100 dimensions, which we sample from a normal distribution
3） a batch size of 64.
4）We initialize all weights with zero mean Gaussian noise with standard deviation 0.01
成果与不足：
the model usually learns to put motion on the right objects,
one common failure mode is that the objects lack resolution.
模型结果评价方法：
1）We quantitatively evaluate our generation using a psychophysical two-alternative forced choice with workers on Amazon Mechanical Turk. We show a worker two random videos,and ask them “Which video is more realistic?”
2）baseline:训练一个autoencoder进行对比
We train an autoencoder over our data. The encoder is similar to the discriminator network (except producing 100 dimensional code), while the decoder follows the two-stream generator network

拓展应用：使用一张静态图生成video
结构改动：
We utilize the same model as our two-stream model, however we must make one change in order to input the static image instead of the latent code. We can do this by attaching a five- layer convolutional network to the front of the generator which encodes the image into the latent space, similar to a conditional generative adversarial network.
The rest of the generator and discriminator networks remain the same.
loss函数改动：
we add an additional loss term that minimizes the L1 distance between the input and the first frame of the generated image.
效果：
1）Although the extrapolations are rarely correct, they often have fairly plausible motions.
2）most common failure is that the generated video has a scene similar but not identical to the input image, such as by changing colors or dropping/hallucinating objects.
改进方向：
the former could be solved by a color histogram normalization in post-processing ,the latter will require building more powerful generative models.

yushenxiang · 2017-11-11T07:53:16Z

2017-11-11
C3D阅读总结：
3维卷积网络能够处理时间和空间维度的信息，而2维卷积网络仅能处理空间维度信息，因此2维卷积网络只能生成一张图片，无法生成视频，而3维卷积网络的卷积层可以保留时间信息，因此可以生成视频。
文章贡献：
1）C3D能够同时对物体外表和运动进行建模
2）作者在UCF101上实验认为所有层都采用3x3x3的核学习效果最好
输入：Videos are split into non-overlapped 16-frame clips，all video frames are resized into 128 × 171.
作者探索最佳模型架构的实验：
1）所有层都有一样的时间深度，分别是1，3，5，7
2）采取不同时间深度，共两种模型，3-3-5-5-7，或7-5-5-3-3
实验时采用的框架信息具体可以参照论文Common network settings处的具体介绍

采用的架构C3D为：
we design our 3D ConvNet to have 8 convolution layers, 5 pooling layers, followed by two fully connected layers, and a softmax output layer;
细节信息：
All of 3D convolution filters are3×3×3 with stride1×1×1. All 3Dpooling layers are 2×2×2 with stride 2×2×2 except forpool1 which has kernel size of 1×2×2 and stride 1×2×2 with the intention of preserving the temporal information in the early phase.
Each fully connected layer has 4096 output units.
训练集：
Sports-1M dataset，
the dataset consists of 1.1 million sports videos. Each video belongs to one of 487 sports categories.
训练输入：
1）随机从每个视频中街截取5个2秒的短片，调整成 frame size of 128 × 171，
2）同时还会随机将一些输入视频调整成16 × 112 × 112 ，从而加入一些扰动，
3）并且会有一半的概率水平flip这些短片。
超参数：
Training is done by SGD with mini- batch size of 30 examples.
Initial learning rate is 0.003, and is divided by 2 every 150K iterations.

最后通过和其他方法进行对比，提出C3D在动作识别、场景和目标识别，以及运算速度上具有优势
拓展应用：
After training, C3D can be used as a feature extractor for other video analysis tasks.
输入：
to extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips.
发现：
使用deconvolution来看c3d如何学习特征。
发现C3D首先在前几针图片中关注物体外表，在随后的图片中关注物体的运动

yushenxiang · 2017-11-11T10:02:26Z

2017-11-11
3D U-NET阅读总结：
1）把2D U-NET网络里的2d 操作全部替换为3D操作；

架构：
1）分为两个部分：analysis and a synthesis path each with four resolution steps
2)In the analysis path, each layer contains two 3 × 3 × 3 convolutions each followed by a rectified linear unit (ReLu), and then a 2 × 2 × 2 max pooling with strides of two in each dimension.
3)In the synthesis path, each layer consists of an upconvolution of 2×2×2 by strides of two in each dimension, followed by two 3×3×3 convolutions each followed by a ReL
4) In the last layer a 1×1×1 convolution reduces the number of output channels to the number of labels which is 3 in our case
输入：The input to the network is a 132 × 132 × 116 voxel tile of the image with 3 channels.
输出： Our output in the final layer is 44×44×28 voxels in x, y, and z directions respectively.
架构优势：
weighted softmax loss function allows us to train on sparse annotations.
Setting the weights of unlabeled pixels to zero makes it possible to learn from only the labelled ones and, hence, to generalize to the whole volume.
训练：
Besides rotation, scaling and gray value augmentation, we apply a smooth dense deformation field on both data and ground truth label。
评价手段：
Intersection over Union (IoU) is used as accuracy measure to compare dropped out ground truth slices to the predicted 3D volume.