changes

sgrvinod · sgrvinod · commit 091c967e3349 · 2020-02-15T11:56:00.000+05:30
diff --git a/README.md b/README.md
@@ -621,7 +621,7 @@ Specfically, you will need to download the following VOC datasets –
 
 - [2007 _test_](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar) (451MB)
 
-Consistent with the paper, the two _trainval_ datasets are to be used for training, while the VOC 2007 _test_ will serve as our validation and testing data.  
+Consistent with the paper, the two _trainval_ datasets are to be used for training, while the VOC 2007 _test_ will serve as our test data.  
 
 Make sure you extract both the VOC 2007 _trainval_ and 2007 _test_ data to the same location, i.e. merge them.
 
@@ -712,7 +712,7 @@ As mentioned in the paper, these transformations play a crucial role in obtainin
 
 #### PyTorch DataLoader
 
-The `Dataset` described above, `PascalVOCDataset`, will be used by a PyTorch [`DataLoader`](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) in `train.py` to **create and feed batches of data to the model** for training or validation.
+The `Dataset` described above, `PascalVOCDataset`, will be used by a PyTorch [`DataLoader`](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) in `train.py` to **create and feed batches of data to the model** for training or evaluation.
 
 Since the number of objects vary across different images, their bounding boxes, labels, and difficulties cannot simply be stacked together in the batch. There would be no way of knowing which objects belong to which image.
 
@@ -788,7 +788,7 @@ The **Multibox Loss is the aggregate of these two losses**, combined in the rati
 
 # Training
 
-Before you begin, make sure to save the required data files for training and validation. To do this, run the contents of [`create_data_lists.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/create_data_lists.py) after pointing it to the `VOC2007` and `VOC2012` folders in your [downloaded data](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#download).
+Before you begin, make sure to save the required data files for training and evaluation. To do this, run the contents of [`create_data_lists.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/create_data_lists.py) after pointing it to the `VOC2007` and `VOC2012` folders in your [downloaded data](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection#download).
 
 See [`train.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/train.py).
 
@@ -800,8 +800,6 @@ To **train your model from scratch**, run this file –
 
 To **resume training at a checkpoint**, point to the corresponding file with the `checkpoint` parameter at the beginning of the code.
 
-Note that we perform validation at the end of every training epoch.
-
 ### Remarks
 
 In the paper, they recommend using **Stochastic Gradient Descent** in batches of `32` images, with an initial learning rate of `1e−3`, momentum of `0.9`, and `5e-4` weight decay.
@@ -810,11 +808,9 @@ I ended up using a batch size of `8` images for increased stability. If you find
 
 The authors also doubled the learning rate for bias parameters. As you can see in the code, this is easy do in PyTorch, by passing [separate groups of parameters](https://pytorch.org/docs/stable/optim.html#per-parameter-options) to the `params` argument of its [SGD optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD).
 
-The paper recommends training for 80000 iterations at the initial learning rate. Then, it is decayed by 90% for an additional 20000 iterations, _twice_. With the paper's batch size of `32`, this means that the learning rate is decayed by 90% once at the 155th epoch and once more at the 194th epoch, and training is stopped at 232 epochs.
-
-In practice, I just decayed the learning rate by 90% when the validation loss stopped improving for long periods. I resumed training at this reduced learning rate from the best checkpoint obtained thus far, not the most recent.
+The paper recommends training for 80000 iterations at the initial learning rate. Then, it is decayed by 90% (i.e. to a tenth) for an additional 20000 iterations, _twice_. With the paper's batch size of `32`, this means that the learning rate is decayed by 90% once at the 155th epoch and once more at the 194th epoch, and training is stopped at 232 epochs. I followed the same schedule.
 
-On a TitanX (Pascal), each epoch of training required about 6 minutes. My best checkpoint was from epoch 186, with a validation loss of `2.515`.
+On a TitanX (Pascal), each epoch of training required about 6 minutes.
 
 ### Model checkpoint
 
@@ -834,32 +830,32 @@ To begin evaluation, simply run the `evaluate()` function with the data-loader a
 
 We will use `calculate_mAP()` in [`utils.py`](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py) for this purpose. As is the norm, we will ignore _difficult_ detections in the mAP calculation. But nevertheless, it is important to include them from the evaluation dataset because if the model does detect an object that is considered to be _difficult_, it must not be counted as a false positive.
 
-The model scores **77.1 mAP**, against the 77.2 mAP reported in the paper.
+The model scores **77.2 mAP**, same as the result reported in the paper.
 
-Class-wise average precisions are listed below.
+Class-wise average precisions (not scaled to 100) are listed below.
 
 | Class | Average Precision |
 | :-----: | :------: |
-| aeroplane |  78.9|
-|  bicycle | 83.7|
-|  bird |  76.9|
-|  boat |  72.0|
-|  bottle |  46.0|
-|  bus |  86.7|
-|  car |  86.9|
-|  cat |  89.2|
-|  chair |  59.6|
-|  cow |  82.7|
-|  diningtable |  75.2|
-|  dog |  85.6|
-|  horse |  87.4|
-|  motorbike |  82.9|
-|  person |  78.8|
-|  pottedplant |  50.3|
-|  sheep |  78.7|
-|  sofa |  80.5|
-|  train |  85.7|
-|  tvmonitor |  75.0|
+| _aeroplane_ | 0.7887580990791321 |
+| _bicycle_ | 0.8351995348930359 |
+| _bird_ | 0.7623348236083984 |
+| _boat_ | 0.7218425273895264 |
+| _bottle_ | 0.45978495478630066 |
+| _bus_ | 0.8705356121063232 |
+| _car_ | 0.8655831217765808 |
+| _cat_ | 0.8828985095024109 |
+| _chair_ | 0.5917483568191528 |
+| _cow_ | 0.8255912661552429 |
+| _diningtable_ | 0.756867527961731 |
+| _dog_ | 0.856262743473053 |
+| _horse_ | 0.8778411149978638 |
+| _motorbike_ | 0.8316892385482788 |
+| _person_ | 0.7884440422058105 |
+| _pottedplant_ | 0.5071538090705872 |
+| _sheep_ | 0.7936667799949646 |
+| _sofa_ | 0.7998116612434387 |
+| _train_ | 0.8655905723571777 |
+| _tvmonitor_ | 0.7492395043373108 |
 
 You can see that some objects, like bottles and potted plants, are considerably harder to detect than others.
 
diff --git a/detect.py b/detect.py
@@ -5,11 +5,10 @@
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
 # Load model checkpoint
-checkpoint = 'BEST_checkpoint_ssd300.pth.tar'
+checkpoint = 'checkpoint_ssd300.pth.tar'
 checkpoint = torch.load(checkpoint)
 start_epoch = checkpoint['epoch'] + 1
-best_loss = checkpoint['best_loss']
-print('\nLoaded checkpoint from epoch %d. Best loss so far is %.3f.\n' % (start_epoch, best_loss))
+print('\nLoaded checkpoint from epoch %d.\n' % start_epoch)
 model = checkpoint['model']
 model = model.to(device)
 model.eval()
diff --git a/eval.py b/eval.py
@@ -12,7 +12,7 @@
 batch_size = 64
 workers = 4
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-checkpoint = './BEST_checkpoint_ssd300.pth.tar'
+checkpoint = './checkpoint_ssd300.pth.tar'
 
 # Load model checkpoint that is to be evaluated
 checkpoint = torch.load(checkpoint)
diff --git a/train.py b/train.py
@@ -18,13 +18,12 @@
 # Learning parameters
 checkpoint = None  # path to model checkpoint, None if none
 batch_size = 8  # batch size
-start_epoch = 0  # start at this epoch
-epochs = 200  # number of epochs to run without early-stopping
-epochs_since_improvement = 0  # number of epochs since there was an improvement in the validation metric
-best_loss = 100.  # assume a high loss at first
+iterations = 120000  # number of iterations to train
 workers = 4  # number of workers for loading data in the DataLoader
-print_freq = 200  # print training or validation status every __ batches
+print_freq = 200  # print training status every __ batches
 lr = 1e-3  # learning rate
+decay_lr_at = [80000, 100000]  # decay learning rate after these many iterations
+decay_lr_to = 0.1  # decay learning rate to this fraction of the existing learning rate
 momentum = 0.9  # momentum
 weight_decay = 5e-4  # weight decay
 grad_clip = None  # clip if gradients are exploding, which may happen at larger batch sizes (sometimes at 32) - you will recognize it by a sorting error in the MuliBox loss calculation
@@ -34,12 +33,13 @@
 
 def main():
     """
-    Training and validation.
+    Training.
     """
-    global epochs_since_improvement, start_epoch, label_map, best_loss, epoch, checkpoint
+    global start_epoch, label_map, epoch, checkpoint, decay_lr_at
 
     # Initialize model or load checkpoint
     if checkpoint is None:
+        start_epoch = 0
         model = SSD300(n_classes=n_classes)
         # Initialize the optimizer, with twice the default learning rate for biases, as in the original Caffe repo
         biases = list()
@@ -56,9 +56,7 @@ def main():
     else:
         checkpoint = torch.load(checkpoint)
         start_epoch = checkpoint['epoch'] + 1
-        epochs_since_improvement = checkpoint['epochs_since_improvement']
-        best_loss = checkpoint['best_loss']
-        print('\nLoaded checkpoint from epoch %d. Best loss so far is %.3f.\n' % (start_epoch, best_loss))
+        print('\nLoaded checkpoint from epoch %d.\n' % start_epoch)
         model = checkpoint['model']
         optimizer = checkpoint['optimizer']
 
@@ -70,28 +68,22 @@ def main():
     train_dataset = PascalVOCDataset(data_folder,
                                      split='train',
                                      keep_difficult=keep_difficult)
-    val_dataset = PascalVOCDataset(data_folder,
-                                   split='test',
-                                   keep_difficult=keep_difficult)
     train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True,
                                                collate_fn=train_dataset.collate_fn, num_workers=workers,
                                                pin_memory=True)  # note that we're passing the collate function here
-    val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=True,
-                                             collate_fn=val_dataset.collate_fn, num_workers=workers,
-                                             pin_memory=True)
+
+    # Calculate total number of epochs to train and the epochs to decay learning rate at (i.e. convert iterations to epochs)
+    # To convert iterations to epochs, divide iterations by the number of iterations per epoch
+    # The paper trains for 120,000 iterations with a batch size of 32, decays after 80,000 and 100,000 iterations
+    epochs = iterations // (len(train_dataset) // 32)
+    decay_lr_at = [it // (len(train_dataset) // 32) for it in decay_lr_at]
+
     # Epochs
     for epoch in range(start_epoch, epochs):
-        # Paper describes decaying the learning rate at the 80000th, 100000th, 120000th 'iteration', i.e. model update or batch
-        # The paper uses a batch size of 32, which means there were about 517 iterations in an epoch
-        # Therefore, to find the epochs to decay at, you could do,
-        # if epoch in {80000 // 517, 100000 // 517, 120000 // 517}:
-        #     adjust_learning_rate(optimizer, 0.1)
-
-        # In practice, I just decayed the learning rate when loss stopped improving for long periods,
-        # and I would resume from the last best checkpoint with the new learning rate,
-        # since there's no point in resuming at the most recent and significantly worse checkpoint.
-        # So, when you're ready to decay the learning rate, just set checkpoint = 'BEST_checkpoint_ssd300.pth.tar' above
-        # and have adjust_learning_rate(optimizer, 0.1) BEFORE this 'for' loop
+
+        # Decay learning rate at particular epochs
+        if epoch in decay_lr_at:
+            adjust_learning_rate(optimizer, decay_lr_to)
 
         # One epoch's training
         train(train_loader=train_loader,
@@ -100,24 +92,8 @@ def main():
               optimizer=optimizer,
               epoch=epoch)
 
-        # One epoch's validation
-        val_loss = validate(val_loader=val_loader,
-                            model=model,
-                            criterion=criterion)
-
-        # Did validation loss improve?
-        is_best = val_loss < best_loss
-        best_loss = min(val_loss, best_loss)
-
-        if not is_best:
-            epochs_since_improvement += 1
-            print("\nEpochs since last improvement: %d\n" % (epochs_since_improvement,))
-
-        else:
-            epochs_since_improvement = 0
-
         # Save checkpoint
-        save_checkpoint(epoch, epochs_since_improvement, model, optimizer, val_loss, best_loss, is_best)
+        save_checkpoint(epoch, model, optimizer)
 
 
 def train(train_loader, model, criterion, optimizer, epoch):
@@ -180,55 +156,5 @@ def train(train_loader, model, criterion, optimizer, epoch):
     del predicted_locs, predicted_scores, images, boxes, labels  # free some memory since their histories may be stored
 
 
-def validate(val_loader, model, criterion):
-    """
-    One epoch's validation.
-
-    :param val_loader: DataLoader for validation data
-    :param model: model
-    :param criterion: MultiBox loss
-    :return: average validation loss
-    """
-    model.eval()  # eval mode disables dropout
-
-    batch_time = AverageMeter()
-    losses = AverageMeter()
-
-    start = time.time()
-
-    # Prohibit gradient computation explicity because I had some problems with memory
-    with torch.no_grad():
-        # Batches
-        for i, (images, boxes, labels, difficulties) in enumerate(val_loader):
-
-            # Move to default device
-            images = images.to(device)  # (N, 3, 300, 300)
-            boxes = [b.to(device) for b in boxes]
-            labels = [l.to(device) for l in labels]
-
-            # Forward prop.
-            predicted_locs, predicted_scores = model(images)  # (N, 8732, 4), (N, 8732, n_classes)
-
-            # Loss
-            loss = criterion(predicted_locs, predicted_scores, boxes, labels)
-
-            losses.update(loss.item(), images.size(0))
-            batch_time.update(time.time() - start)
-
-            start = time.time()
-
-            # Print status
-            if i % print_freq == 0:
-                print('[{0}/{1}]\t'
-                      'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
-                      'Loss {loss.val:.4f} ({loss.avg:.4f})\t'.format(i, len(val_loader),
-                                                                      batch_time=batch_time,
-                                                                      loss=losses))
-
-    print('\n * LOSS - {loss.avg:.3f}\n'.format(loss=losses))
-
-    return losses.avg
-
-
 if __name__ == '__main__':
     main()
diff --git a/utils.py b/utils.py
@@ -93,12 +93,12 @@ def create_data_lists(voc07_path, voc12_path, output_folder):
     print('\nThere are %d training images containing a total of %d objects. Files have been saved to %s.' % (
         len(train_images), n_objects, os.path.abspath(output_folder)))
 
-    # Validation data
+    # Test data
     test_images = list()
     test_objects = list()
     n_objects = 0
 
-    # Find IDs of images in validation data
+    # Find IDs of images in the test data
     with open(os.path.join(voc07_path, 'ImageSets/Main/test.txt')) as f:
         ids = f.read().splitlines()
 
@@ -119,7 +119,7 @@ def create_data_lists(voc07_path, voc12_path, output_folder):
     with open(os.path.join(output_folder, 'TEST_objects.json'), 'w') as j:
         json.dump(test_objects, j)
 
-    print('\nThere are %d validation images containing a total of %d objects. Files have been saved to %s.' % (
+    print('\nThere are %d test images containing a total of %d objects. Files have been saved to %s.' % (
         len(test_images), n_objects, os.path.abspath(output_folder)))
 
 
@@ -602,7 +602,7 @@ def transform(image, boxes, labels, difficulties, split):
     new_boxes = boxes
     new_labels = labels
     new_difficulties = difficulties
-    # Skip the following operations if validation/evaluation
+    # Skip the following operations for evaluation/testing
     if split == 'TRAIN':
         # A series of photometric distortions in random order, each with 50% chance of occurrence, as in Caffe repo
         new_image = photometric_distort(new_image)
@@ -666,29 +666,19 @@ def accuracy(scores, targets, k):
     return correct_total.item() * (100.0 / batch_size)
 
 
-def save_checkpoint(epoch, epochs_since_improvement, model, optimizer, loss, best_loss, is_best):
+def save_checkpoint(epoch, model, optimizer):
     """
     Save model checkpoint.
 
     :param epoch: epoch number
-    :param epochs_since_improvement: number of epochs since last improvement
     :param model: model
     :param optimizer: optimizer
-    :param loss: validation loss in this epoch
-    :param best_loss: best validation loss achieved so far (not necessarily in this checkpoint)
-    :param is_best: is this checkpoint the best so far?
     """
     state = {'epoch': epoch,
-             'epochs_since_improvement': epochs_since_improvement,
-             'loss': loss,
-             'best_loss': best_loss,
              'model': model,
              'optimizer': optimizer}
     filename = 'checkpoint_ssd300.pth.tar'
     torch.save(state, filename)
-    # If this checkpoint is the best so far, store a copy so it doesn't get overwritten by a worse checkpoint
-    if is_best:
-        torch.save(state, 'BEST_' + filename)
 
 
 class AverageMeter(object):