Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HI,imelekhov.I HAVE MEET SOME TRAIN PROBLEM FOR input_ and gradOutput_ shapes do not match #3

Open
TheBloodthirster opened this issue Aug 4, 2019 · 0 comments

Comments

@TheBloodthirster
Copy link

TheBloodthirster commented Aug 4, 2019

When i want to train the net for :
th main.lua -weights <path/to/downloaded_weights/model_snapshot_7scenes.t7> -dataset_src_path </path/to/7Scenes>
without -do_evaluation I have meet some problem.

Here is error:

{
val_batch_size : 40
beta1 : 0.9
do_evaluation : false
use_dropout : false
dataset_src_path : "/data/code/camera-relocalisation/7Scenes"
gamma : 0.001
image_size : 224
epoch_number : 1
weights : "/data/code/camera-relocalisation/downloaded_weights/model_snapshot_7scenes.t7"
train_batch_size : 64
validation_dataset_size : 10402
max_epoch : 250
dataset_name : "7-Scenes"
nGPU : 1
momentum : 0.9
logs : "./logs/7scenes.log"
beta : 1
manualSeed : 333
learning_rate : 0.1
beta2 : 0.999
model_zoo_path : "./pretrained_models"
precomputed_data_path : "./data"
results_filename : "./results/7scenes_res.bin"
snapshot_dir : "./snapshots"
GPU : 1
weight_decay : 1e-05
power : 0.5
training_dataset_size : 39999
}
this is a test for load_training_data
==> Training GT labels have been loaded successfully
==> Validation GT labels have been loaded successfully
==> loading model from pretained weights from file: /data/code/camera-relocalisation/downloaded_weights/model_snapshot_7scenes.t7
==> configuring optimizer
==> number of batches: 624
==> learning rate: 0.1
==> Number of parameters in the model: 22350215
==> online epoch # 1 [batchSize = 64]
==> time taken to randomize input training data: 2.7921199798584 ms
/torch/install/bin/luajit: /torch/install/share/lua/5.1/nn/Container.lua:67: ...........] ETA: 0ms | Step: 0ms
In 1 module of nn.Sequential:
In 1 module of nn.ParallelTable:
In 2 module of nn.Sequential:
/torch/install/share/lua/5.1/nn/THNN.lua:110: input_ and gradOutput_ shapes do not match: input_ [2 x 64 x 112 x 112], gradOutput_ [64 x 64 x 112 x 112] at /torch/extra/cunn/lib/THCUNN/generic/BatchNormalization.cu:74
stack traceback:
[C]: in function 'v'
/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_backward'
/torch/install/share/lua/5.1/nn/BatchNormalization.lua:154: in function </torch/install/share/lua/5.1/nn/BatchNormalization.lua:140>
[C]: in function 'xpcall'
/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/torch/install/share/lua/5.1/nn/Sequential.lua:70: in function </torch/install/share/lua/5.1/nn/Sequential.lua:63>
[C]: in function 'xpcall'
/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/torch/install/share/lua/5.1/nn/ParallelTable.lua:27: in function 'accGradParameters'
/torch/install/share/lua/5.1/nn/Module.lua:32: in function </torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
/data/code/camera-relocalisation/cnn_part/train.lua:68: in function 'opfunc'
/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
/data/code/camera-relocalisation/cnn_part/train.lua:72: in function 'train'
main.lua:97: in main chunk
[C]: in function 'dofile'
/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
/data/code/camera-relocalisation/cnn_part/train.lua:68: in function 'opfunc'
/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
/data/code/camera-relocalisation/cnn_part/train.lua:72: in function 'train'
main.lua:97: in main chunk
[C]: in function 'dofile'
/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

and i think the problem in local in here:
for t,v in ipairs(indices) do
xlua.progress(t, #indices)

    local mini_batch_info = make_training_minibatch(v)
    local mini_batch_data = mini_batch_info.data:cuda()
    local orientation_gt = mini_batch_info.quaternion_labels:cuda()
    local translation_gt = mini_batch_info.translation_labels:cuda()
    
    cutorch.synchronize()
    collectgarbage()
    
    feval = function(x)
        if x ~= parameters then parameters:copy(x) end
        model:zeroGradParameters()

        local outputs = model:forward({mini_batch_data[{{}, 1, {}, {}, {}}], mini_batch_data[{{}, 2, {}, {}, {}}]})
        local err = criterion:forward(outputs, {translation_gt, orientation_gt})
        meter_train_t:add(criterion.weights[1] * criterion.criterions[1].output)
        meter_train_q:add(criterion.weights[2] * criterion.criterions[2].output)
        
        local df_do = criterion:backward(outputs, {translation_gt, orientation_gt})
        model:backward(mini_batch_data, df_do)
        
        return err, gradParameters
    end
    optim.adam(feval, parameters, optimState)

============================================
especial when i note optim.adam(feval, parameters, optimState) ,the training can work well.

i don't know what's going on,could you please help me ?
THANKS ADVANCED!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant