strange view() operation in ReduceState module #11

jamestang0219 · 2018-10-18T08:06:25Z

We wanna reduce the forward and backward state, so we need to concatenate the forward state and backward state, then go through a nn.Linear module to let 2 * hidden_dim become hidden_dim.

In your code, h is 2 x B x hidden_dim, and you use view() operation directly on h and c. The results should be concatenating first example forward state and second example forward state, not concatenating first example forward state and first example backward state.

In my opinion, we should use h.transpose(0, 1).contiguous().view(-1, config.hidden_dim*2).

pointer_summarizer/training_ptr_gen/model.py

Line 75 in 454a2f6

hidden_reduced_h = F.relu(self.reduce_h(h.view(-1, config.hidden_dim * 2)))

atulkum · 2018-10-19T05:10:21Z

Thanks for reviewing the code. You are correct I fixed the code. Would update the result after the re run.

jamestang0219 · 2018-10-23T09:18:33Z

@atulkum
One more question, in your code, step_coverage_loss is the sum of the minimum of attn_dist and coverage in each element.

pointer_summarizer/training_ptr_gen/train.py

Line 99 in 5e51169

step_coverage_loss = torch.sum(torch.min(attn_dist, coverage), 1)

And coverage is coverage + attn_dist.

pointer_summarizer/training_ptr_gen/model.py

Line 124 in 5e51169

coverage = coverage + attn_dist

So, the step_converage_loss should always be the sum of attn_dist(1.0)?

atulkum · 2018-10-23T18:00:32Z

I think you are right. The order of updating the coverage is not correct. The coverage should be updated after coverage loss has been calculated. I think I might have made a mistake here while code refactoring. I will fix it.

        if config.is_coverage:
            coverage = coverage.view(-1, t_k)
            coverage = coverage + attn_dist

atulkum · 2018-10-23T18:51:38Z

Thanks for reviewing the code. I have fixed the bug.
https://github.com/atulkum/pointer_summarizer/blob/master/training_ptr_gen/train.py#L91
https://github.com/atulkum/pointer_summarizer/blob/master/training_ptr_gen/train.py#L100

jamestang0219 · 2018-10-30T03:52:47Z

@atulkum have you ever try to set is_coverage as True, it's extremely easy to cause loss became NaN, less learning rate is useless for this issue.

atulkum · 2018-10-30T06:01:58Z

I have turned on is_coverage=True after training for 500k iteration. Making is_coverage=True from the beginning makes the training unstable.

jamestang0219 · 2018-10-30T07:40:54Z

@atulkum I think this operation may cause NaN

pointer_summarizer/training_ptr_gen/model.py

Line 95 in fd8dda3

encoder_feature = self.W_h(h)

calculating the memory of attention in each decoder step may create many computation graph branch in torch backend, but in fact , we only need to calculate it once after get encoder_outputs.

atulkum · 2018-10-30T16:44:02Z

You are right about increasing branches computation graph but it won't cause NaN. If you are getting NaN then it might be somewhere else. I tested it on pytorch 0.4 and it does not give NaN.

I am changing this to be called only once (thanks for catching this) and test it again.

jamestang0219 · 2018-10-31T07:52:15Z

@atulkum I set is_coverage as True after 500k step, but at the beginning of retraining, it always gives NaN. I'll continue to test, thank you again.

atulkum · 2018-10-31T14:02:25Z

After how many iteration (with is_coverage = True) you are getting NaN?
Did you initialize the model_file_path in the code?
https://github.com/atulkum/pointer_summarizer/blob/master/training_ptr_gen/train.py#L141

You can try to debug it on CPU.

My GPU is busy right now, but I just tested it on CPU for around 20 epoch I did not get NaN or any exploding gradient etc.

jamestang0219 · 2018-11-01T08:41:58Z

Thanks for suggestion. I initialized the model_file_path, but after no more than 100 iter, it get NaN:(

atulkum · 2018-11-03T14:38:53Z

I have uploaded a model here. I retrain it with with is_coverage = True for 200k iteration but did not get NaN

For retraining you should do 3 things:

in the config.py make is_coverage = True
in the config.py make max_iterations = 600000
make sure in train.py you are calling trainIters with full absolute path model_file_path

I am also curious why you get NaN, can you debug it and pinpoint the code which is causing NaN?

BigwhiteRookie · 2018-12-20T09:29:56Z

Thanks for your code, but I have a question: when I run the train.py , one error : AttributeError: 'generator' object has no attribute 'next', i dont understand it , the system show it at the batcher.py 209.

karansaxena · 2019-03-20T13:11:48Z

@jamestang0219 were you able to point out the nan problem?
I trained without coverage till 200k and beyond that, I am using the coverage but running into nan soon after.

atulkum closed this as completed Oct 23, 2018

atulkum reopened this Oct 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange view() operation in ReduceState module #11

strange view() operation in ReduceState module #11

jamestang0219 commented Oct 18, 2018

atulkum commented Oct 19, 2018

jamestang0219 commented Oct 23, 2018 •

edited

Loading

atulkum commented Oct 23, 2018

atulkum commented Oct 23, 2018

jamestang0219 commented Oct 30, 2018

atulkum commented Oct 30, 2018

jamestang0219 commented Oct 30, 2018

atulkum commented Oct 30, 2018

jamestang0219 commented Oct 31, 2018

atulkum commented Oct 31, 2018

jamestang0219 commented Nov 1, 2018

atulkum commented Nov 3, 2018 •

edited

Loading

BigwhiteRookie commented Dec 20, 2018

karansaxena commented Mar 20, 2019

strange view() operation in ReduceState module #11

strange view() operation in ReduceState module #11

Comments

jamestang0219 commented Oct 18, 2018

atulkum commented Oct 19, 2018

jamestang0219 commented Oct 23, 2018 • edited Loading

atulkum commented Oct 23, 2018

atulkum commented Oct 23, 2018

jamestang0219 commented Oct 30, 2018

atulkum commented Oct 30, 2018

jamestang0219 commented Oct 30, 2018

atulkum commented Oct 30, 2018

jamestang0219 commented Oct 31, 2018

atulkum commented Oct 31, 2018

jamestang0219 commented Nov 1, 2018

atulkum commented Nov 3, 2018 • edited Loading

BigwhiteRookie commented Dec 20, 2018

karansaxena commented Mar 20, 2019

jamestang0219 commented Oct 23, 2018 •

edited

Loading

atulkum commented Nov 3, 2018 •

edited

Loading