-
Notifications
You must be signed in to change notification settings - Fork 115
Design decision query: recommended pattern for auxiliary loss terms that should be ignored during evaluation? #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Chirag, thank you for your interest in Texar-PyTorch! I'm the main developer behind the Executor module, and I can address your questions. For your first question, I'm not sure I completely understood, but I guess that your intention was to perform evaluation (under executor = Executor(train_data=datasets["train"], ...)
@executor.on(cond.epoch(1))
def eval_on_train(executor):
executor.test(datasets["train"])
# obtain results from the status dictionary
metrics = executor.status["eval_metric"]
# compare it with your stored validation results ... However, please pay attention to your validation/testing settings as For your second question, the design principle of the Executor is to leave all model-related computations to the model, so that the Executor itself can be as general as possible. Our recommended approach is to create a "wrapper" model class that takes care of everything, including re-structuring of inputs/outputs, and the regularization as you mentioned. This is also what we did in the examples (e.g., XLNet). This might seem like a bit of boilerplate, which I agree, but I currently don't see any way of avoiding this. Most built-in (PyTorch or Texar-PyTorch) or third-party modules does not take a batch as input, nor do they return a dictionary, so it's probably necessary to have this wrapper to at least re-structure inputs and outputs. I would say that Executor is still in its early stages of development, and is far from complete. I encounter many tough spots while using the module myself, and I think it can be further improved. If you have any suggestions or are willing to contribute, please feel free to ask any questions and create pull requests! |
Zecong! Small world! I think I ran into you at the LTI just before I left haha, wonderful to know you are developing the Executor :) Thanks for getting back to me so quickly! I think the two problems are related; even if I wrap the model, the loss would still need to be computed by the model, which is the larger issue. I'd argue that this is just an incorrect class design. I agree that the Executor shouldn't be responsible for this, but neither should the model. I think a better abstraction would be to have another object responsible for computing the loss; the base behaviour would be to do this using the predictions and labels, but one could extend this behaviour to accept the model itself too, and compute the regularization terms for instance. The executor's behaviour is then to call As an example usecase, say you are doing something related to domain generalization and need to induce orthogonality in the representation space, and you choose to do this by adding some cosine similarity terms to the loss. As it stands, this needs to be the responsibility of the model wrapper, but conceptually, it isn't really a wrapper, and there ought to be some point at which the end user can extend the behaviour without opening up the model and changing it's behaviour. Which brings me to the XLNet example. While you can indeed implement this by using inheritance, it seems like an incorrect usage. A far better example of a model wrapper is in the BERT example, where the wrapper accepts the model as a dependency. I think the doc might benefit from using a similar XLNet wrapper that doesn't need to call super(), but can still perform it's computations by wrapping the loss into a dictionary. Thoughts? On Return Types It's great to know you are willing to accept pull requests, I can modify the parts of the executor to fix this design assumption and issue a pull request. We can figure out the final design in this thread if it's the best way to keep track of things. |
Wow, I didn't realize that! It is a small world indeed :D On the topic of whether the model should compute the loss, I guess it's more of a personal preference. I've seen quite a lot of code (albeit research code) that computes the loss within the model, so I would say habit-wise this is acceptable to most people. You example of adding additional terms outside the model is a very reasonable use case, but there are also cases that favor having the model compute the loss, such as losses that consist of several parts. Indeed, the XLNet example is not a good example in terms of class hierarchy, because as you said, it's weird to inherit the wrapper from the model. I think the better term to describe what we want here is an "adapter" that somehow connects the model and the Executor. Since the user will probably need to write such an adapter anyway, I think it would be better to keep the current design, instead of creating a new general adapter module. What do you think? I've used the I'm happy to know that you'd be willing to contribute! Let's keep the discussion within this issue. In the meantime, you're more than welcome to take a look at other outstanding issues as well 😄 |
Ah I should probably have explained it better; I am not against the model computing the loss if the programmer so intends, I am mostly against the framework being prescriptive about requiring the model to compute it, and offering no other obvious opportunities to inject the logic. The current way of doing this would indeed require the user to write a wrapper for the model. For a running example, say the loss consists of three terms, followed by a regularization term, and we want to compare L1/L2/L1+L2 regularization. Let the model compute the first three terms. In the current design I could write one model class and three regularization modules that accepts the model, compute the forward pass, add the regularization term to the loss term and return the dictionary with the composite loss. The only nuance is that this module is masquerading as the model itself, while it isn't. So in essence, what I am proposing is something like a outputs = model(batch, **model_kwargs)
# Currently subsumed by the previous step. Here, one could even follow
# the pred_name, label_name setup for metrics and pass in only the relevant
# items, allowing for a Metric object to serve as a criterion, but the following
# allows the user to pass in other items to the criterion by forwarding the outputs
# and the batch. This seems like the safest abstraction for now.
loss = criterion(outputs, batch)
if regularizer:
loss += regularizer(model)
loss.backward() Indeed, this essentially performs the steps that an adapter would, like you said, without needing the user to write an adapter in the most common usecase, and just passing in some default criterion and regularizer. I'm advocating for the criterion to be a first class entity in the pipeline, since it conceptually matches the training process in my opinion, and allows for . As for the XLNet example, my comment was mostly about updating the docs with a non-inheritance example, so as to not encourage people to adopt that pattern when the examples already execute a better pattern. Thoughts? [Edit] Note that this division of responsibility would solve the problem of not having the regularization term added to the validation loss because the regularization shouldn't be called during the validation / test loop. Currently I don't think there is a way to achieve this at all in any simple way. I'd say this is important for correctly comparing the training curves. Also, to make things a little clearer using the gp2 example, the criterion would encapsulate the following: outputs = model(inputs=input_ids, decoding_strategy='train_greedy')
# Following encapsulated by the criterion object
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['text_ids'][:, 1:],
logits=outputs.logits[:, :-1, :],
sequence_length=batch['length'] - 1,
average_across_timesteps=True,
sum_over_timesteps=False) Notice how the model does not output the loss but only the logits. The abstraction here would be that the criterion object accepts the batch and outputs, gets the appropriate items from these, and returns the dictionary with the loss. Porting the above code to the executor in the current state would require this loss computation to be a part of the model. Ah yes, I agree that the inability to have extra keys is extremely restrictive, but I found that the expected behaviour is supported for function arguments. For convenience, from that link: from mypy_extensions import TypedDict
A = TypedDict('A', {'x': int})
B = TypedDict('B', {'x': int, 'y': str})
def f(x: A) -> None: ...
b: B = {'x': 1, 'y': 'foo'}
f(b) # Ok The restriction on extra keys applies to creating the dict, but I imagine the way to use it in the executor loop would be similar to the above example. There's probably a few other considerations I am missing, but this one seems manageable in terms of the extra keys? |
I understand your concern, but in my opinion this is more of a naming problem instead of a design problem. If you think the name "model" is confusing, I'm willing to change it to "module" or something else. The problem with the criterion approach that you're proposing is that the user would probably have to write their own "criterion adapter" as well. I know PyTorch provides module-style losses, but in order to use them you'd need to pass in the label tensor, not the batch object. The metric approach could work for simple losses, but not for stuff like the My point is, we want to make this simple for the user, and having one custom class is probably better than having two. For switching between training and validation modes, the model adapter can use the For the XLNet example, I agree with you that using non-inheritance adapter-style is better, and will create a PR to change that. The support for function arguments does not completely solve the problem. Take this snippet as an example: from mypy_extensions import TypedDict
import torch
from torch import nn
class ReturnDict(TypedDict, total=False):
loss: torch.Tensor
class ModelAdapter(nn.Module):
def forward(self, input: torch.Tensor) -> ReturnDict:
return {
"loss": torch.sum(input),
"something_else": input,
}
|
Hey Zecong, thanks for getting back to me. On side-effects of the model adapter So I'm assuming we have the overarching goal that simple things should be simple, but complex things should still be doable in preferably not complicated ways. Ideally, in the simplest use-cases, the user shouldn't have to write any adapter at all. This is covered by having the model's forward method return the loss. So far so good. Now if you're writing an adapter, my concern is not so much to do with the name, but with the responsibility. As far as possible, I think it's better to have a class have a single well-defined responsibility. So what is the responsibility of a model adapter? Currently, this is to wrap any outputs into a dictionary that has a loss item. It's well-defined. If you start using it as a catch-all to add branching logic about training and validation, it's purpose is going to quickly start bloating. Also, if you recommend that the adapter can change model state, it's harder to have any guarantees about the side-effects of the adapter down the road. Will dependent methods be able to work with the guarantee that the model adapter does not alter the model state as a side effect? Besides, it seems appropriate that the executor should encapsulate any branching logic of that nature because it has to do with execution anyways. As of now, the executor handles the model mode via calles to For these reasons, I am of the opinion that the adapter shouldn't be responsible for more complex logic than simple transforming of output data from the forward pass into an appropriate data structure. If you believe it's alright for the adapter to alter model state, I'd request you to consider how these side effects might affect the building of future behaviour using these interfaces. On regularization I'm a little confused about what you're referring to as a not commonly used feature; is it turning off regularization during computation of the validation loss? If that's what you're referring to, I think how commonly it is used depends on whether people want to compare the validation and training losses. If only the training loss is being looked at, sure it's not as important. But the difference between the curves is something that's taught in Intro to ML courses, so it ought to be supported without having to right custom logic to turn off regularization. To me this is a similar responsibility to calling On losses as a separate module I agree that the user shouldn't have to write a criterion wrapper in the simple cases, and I now that I think of it more, they need to be modules as such; what I was proposing is that the executor accept any callable object. This could be a function itself or an nn.module. In the simple cases, any metric can be passed in. As for the arguments being labels or batch, this is easily solved for the simple cases by having a default decorator that extracts labels from the batch object and passes it in. But having a general interface for passing in the batch as well as the outputs allows for easier extension of custom behaviour that is not currently possible since you might need more information than the labels and pred in computing losses; you can think of variational inference applications as an example where you need to compute KL terms between different things that's not just the labels of preds. Yes of course, that's what I meant by the problem only being during creation of the dictionary. The reason I said this isn't a problem while using is something like this: # ... imports ...
## LIBRARY CODE ----
# Somewhere in the Executor
# Not quite sure why you have `total=False`, I believe we don't want to omit the loss key
DictWithLoss = TypedDict("DictWithLoss", {"loss": torch.Tensor}, total=True)
...
def some_executor_method(d: DictWithLoss):
loss = d["loss"] # should be okay according the example below
loss.backward()
...
## -----
## USER CODE ----
UserReturnDict = TypedDict("UserReturnDict", {"loss":torch.Tensor, "something_else":torch.Tensor})
class ModelAdapter(nn.Module):
def forward(self, input: torch.Tensor) -> UserReturnDict:
return {
"loss": torch.sum(input),
"something_else": input,
} # ---> should be okay
## ---- So I think in terms of usage, this should be fine. I don't have a strong opinion on TypedDict being used, but coming from a C++ background I liked the typing constraints and having users think about what they put in their dictionaries to avoid bloat. But I'll admit this is me probably reacting to looking at code updates where people start sticking more things in dictionaries just because they can. |
Sorry for getting back to you late; been busy during the holidays. I think the word "statefulness" better describes the case than "having side-effects". I don't think the adapter being stateful is a bad thing, since all PyTorch modules have states built in them (e.g., the train/eval flag). However, I do agree that the example I showcased is not the best design, as it requires users to manually switch states outside of the executor, and they might forget to do so. The purpose of the example was to provide an easy solution for the goal of "computing training loss under eval mode but keep regularization terms", without modifying any of the executor internals. After reading your replies again, I realize that I have probably misunderstood your requirements. Let me try to rephrase the question you were asking:
For 1., I guess the better way to do it would be to check for For 2., my argument of this not being a very commonly used feature, is mainly limited to deep learning applications, where the training set is often too large for this to be practical. I'm less familiar with applications outside of the NLP domain, but I can see how this would be useful on smaller datasets or computationally cheaper models. But, as you said, this can be done in the current framework by manually evaluating on the training set in a custom validation action. What could be improved here is to allow the custom action to return metric values so the training losses (w/o regularization) can also be registered as a validation metric. For 3., I stand with my previous opinion that coupling the loss/regularization with the model makes more sense. My major points are:
Of course, it is also possible, for instance, to allow the user to add regularization in the executor arguments. However, this gives the user less control over the training process. For example, the user might want to also return the loss value without the regularization terms in a separate dictionary entry, but it's not possible (or less straightforward) to do so with regularizers handled by the executor. Similarly, it wouldn't be difficult to observe individual loss components in a complex loss module, or to enable/disable loss computation for evaluation. I see your point with the I'll look into this and submit a separate issue, but you're also more than welcome to propose changes! |
No worries about the delay in response, enjoy your holidays! I think we are ultimately saying the same thing on this subpoint, but for clarity:
Before I try and distill the essence of why I think the former is the poorer choice, let me restate the principle of the executor from your comment above:
With this in mind, I think the essence of the design flaw as I see it this:
The exception to this is that the model has some gradients sure, but it shouldn't need to know how those are computed. That is the job of the training apparatus, so at the very least it should happen outside the model. If you can concede this, then it makes sense that the the measures required for making a model better at its task, the loss, regularization, metric, etc. live outside the model, because while they relate to the model, they aren't inherently a part of the model's task, and so not a part of "model-related computations". This is more than a naming problem; if you think of the adapter as a model subclass, it's a behavioural semantic. I agree these computations shouldn't be within the executor, and that it should be possible to modify the executor's behaviour by injecting such logic. So option 2 above, with the adapter wrapping the model is a better option, and still in line with the design principle. For ease of consumption, I'll respond to your specific proposed solutions in a follow-up comment, but I wanted to leave the core argument here, and use it as context next. |
Right, so hopefully we agree that the computations pertaining to training the model should be outside the core model, and outside the executor itself, with the adapter being the candidate to hold this logic. With that,
I agree. If this is done within the adapter though, it needs to branch on the model state, which isn't too bad, but not the most elegant I think since the check needs to be performed in every forward pass.
That's an interesting idea, registering the loss w/o regularization as a validation metric. The way I had envisioned it, the regularization bit comes from a separate callable, so the same core loss function could be reused across the train and val loops.
I can partially agree if you mean the coupling occurs with the adapter and not the core model for reasons covered in my previous comment. However, even with the adapter, this solution condemns the user to having to write an adapter in almost all cases, which to me is a symptom of a design flaw. In the most simple case, the user ought to be able to write the core model, specify a core loss, an optional regularizer, and optimization procedure and let the executor handle the rest. No adapters needed, since the criterion returns the loss anyways. The need for an adapter is itself in some way boilerplate code to me.
I should clarify that I'm not advocating for module losses. In my pseudocode above I assumed any callable object that accepts the labels and outputs and computes the loss. This could be a function, an object, an
True. Although, there are some papers that argue that dropout doesn't necessarily perform the regularization it is thought to perform. However, this is irrelevant to the spirit of your argument, which holds. I was indeed referring to reg terms that are computed from model parameters and used in the optimization.
I think I have to disagree here. In the most common case with what I am proposing, the user doesn't have to write any adapter at all, or at best a thin adapter that wraps the loss in a dictionary. The branching logic, etc. is handled by the executor. So the user has to write a custom regularizer if there is a need, such as for maintaining orthogonality of RNN weights, and pass that in. Even with similar work, I'd argue configuring the executor is a more elegant solution than branching logic in the adapter, with the user having to write one in most situations.
This statement seems a little self-contradictory, perhaps? An optional argument to inject logic ought to afford the user more flexibility without being prescriptive. The other use-cases you describe are still supported by the library in my proposed alteration, and the user can choose to completely ignore the optional regularizer and compute all those terms in an adapter if they choose to write one. It's just not required in most cases by the relaxation of the need for the model wrapper to return a loss. Sorry for the long comments, I thought it best to err on the side of being more specific to avoid confusion. In any case, thanks for engaging in this discussion! |
Thanks for the reply. I appreciate you putting so much thought into the executor design! I think your core argument is that loss computation and regularization is not part of the model's responsibility, and should not be done in the model class. However, I believe this is more of a personal preference than a rule to follow. True, in machine learning definitions the model is separate from the loss, but in practice they're often coupled together. Forcing them to be separate modules or functions seems, to me, a bit like abstraction for abstraction's sake. I can give examples where changing loss functions would require modifying the model's internal computations. Consider word-level language models with a large vocabulary size. At each time step, the model predicts the next word given previous context. The most straightforward way to do this is compute the entire logits as a tensor of
As you can see, each different technique requires the model to compute logits differently. These losses can be used interchangeably in the same model, but would require non-trivial modifications to the model itself. If you're interested, check out this article for more information on these techniques. And let me address your refutations to my claims in the previous reply:
This is true, and is something I'm trying to get rid of. The problem is mainly with the return values -- most existing modules would just return the output as a tensor or a tuple of tensors, but not as a dictionary. However, the adapter is not for the sole purpose of computing the loss, so separating model and loss would not get rid of the adapter.
My argument is not limited to module-like losses. My point was that having the loss computed outside of the model would require the user to write additional code to adapt between the model and the loss, be it in the form of a class or a function.
I'm not sure what you mean by "the branching logic is handled by the executor". I would assume that if the loss or regularizer is set in the executor arguments, it would be something like this: executor = Executor(
...,
loss=tx.losses.sequence_sparse_softmax_cross_entropy,
regularizer=tx.core.L1L2(l1=0.5),
...) Then if the user were to choose between multiple losses, they would still have to write something like: if args.loss_type == "mle":
loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
elif args.loss_type == "adversarial":
loss_func = some_adversarial_loss_function
# or, something like:
# loss_func = tx.utils.get_function(args.loss_func_name, tx.losses)
executor = Executor(loss=loss_func, ...) I don't see how this would be too different from doing things in the model class. And this is assuming the losses can be used as is, without adapting to your model or setting additional arguments.
What I was talking about is if we only allowed specifying the regularizer through executor arguments. I agree that if we allow both choices, it would be more flexible for the user. But in my opinion it might also become a source of confusion to the user -- the user might apply regularization in both places and not realize so. |
Thanks for the link! I am aware of some of these, and the thrust of the argument seems to be that the model computations might have to change depending on the loss computation. This is true, and in these cases the model ought to change anyways so it's okay to write multiple model classes. However, this doesn't seem to me to be a reason to couple the responsibilities, because it makes the simpler cases significantly more convoluted than needed. I also don't think I explained my point very clearly. For having a common reference, here's a complete-ish example with suitable adaptation of the ConditionalGPT2 Model example for illustration. Say I would like to train the model with two losses (sequence_sparse_softmax_cross_entropy, sequence_softmax_cross_entropy), and orthogonal regularization (arbitrary choice, might not make most sense, but works for illustration): # ... imports ...
## ------------ Ideal minimum user required code ---------------
class ConditionalGPT2Model(nn.Module):
"""An encoder-decoder model with GPT-2 as the decoder."""
def __init__(self, vocab_size):
super().__init__()
# Use hyperparameter dict for model configuration
self.embedder = tx.modules.WordEmbedder(vocab_size, hparams=...)
self.encoder = tx.modules.TransformerEncoder(hparams=...)
self.decoder = tx.modules.GPT2Decoder("gpt2-small") # With pre-trained weights
def _get_decoder_output(self, batch, train=True):
"""Perform model inference, i.e., decoding."""
enc_states = self.encoder(inputs=self.embedder(batch['source_text_ids']),
sequence_length=batch['source_length'])
if train: # Teacher-forcing decoding at training time
return self.decoder(
inputs=batch['target_text_ids'], sequence_length=batch['target_length'] - 1,
memory=enc_states, memory_sequence_length=batch['source_length'])
else: # Beam search decoding at prediction time
start_tokens = torch.full_like(batch['source_text_ids'][:, 0], BOS)
return self.decoder(
beam_width=5, start_tokens=start_tokens,
memory=enc_states, memory_sequence_length=batch['source_length'])
def forward(self, batch): ## modified to return only outputs
"""Compute training loss."""
outputs = self._get_decoder_output(batch)
return {"outputs": outputs.logits}
def predict(self, batch):
"""Compute model predictions."""
sequence, _ = self._get_decoder_output(batch, train=False)
return {"gen_text_ids": sequence}
# Adapted from github.com/kevinzakka/pytorch-goodies into a module for illustration
class OrhogonalRegularizer(nn.Module):
def __init__(self, reg=1e-6):
self.reg = reg
def forward(self, model):
orth_loss = torch.zeros(1)
for name, param in model.named_parameters():
if 'bias' not in name:
param_flat = param.view(param.shape[0], -1)
sym = torch.mm(param_flat, torch.t(param_flat))
sym -= torch.eye(param_flat.shape[0])
orth_loss = orth_loss + (self.reg * sym.abs().sum())
return orth_loss
# Losses
loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
loss_func_sparse = tx.losses.sequence_softmax_cross_entropy
# Instantiate
model = ConditionalGPT2Model(...)
regularizer = OrthogonalRegularizer() Ideal Scenario executor = tx.run.Executor(
model=model,
loss=loss_func,
regularizer=regularizer,
keys = {pred_name:"outputs", labels:"target_text_ids"} # something like this,
....
)
executor_sparse = tx.run.Executor(
model=model,
loss=loss_func_sparse,
regularizer=regularizer,
keys = {pred_name:"outputs", labels:"target_text_ids"} # something like this,
....
) I'm using keys as some dictionary that specifies how the executor can unpack the data returned by the model, and the batch, and pass those into the loss in the simplest case for provided losses. Of course, the user can write a custom loss and pass that in too. Alternatively, the functionality can be extended via helpers provided by the Executor, without needing to modify the Executor class (Open-Close principle). Current reality class GPT2WithSoftmaxCrossEntropy(nn.Module)
def __init__(self, gpt2, regularizer=None):
self.gpt2 = gpt2
self.regularizer = regularizer
def forward(self, batch):
logits = self.gpt2(batch)["outputs"]
loss = tx.losses.sequence_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:], logits=logits,
sequence_length=batch['target_length'] - 1)
if self.training: ## Branching
if regularizer:
loss += regularizer(self.gpt2)
return {"loss": loss}
def predict(...):
....
class GPT2WithSparseSoftmaxCrossEntropy(nn.Module)
def __init__(self, gpt2, regularizer=None):
self.gpt2 = gpt2
self.regularizer = regularizer
def forward(self, batch):
logits = self.gpt2(batch)["outputs"]
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:], logits=logits,
sequence_length=batch['target_length'] - 1)
if self.training: ## Branching
if regularizer:
loss += regularizer(self.gpt2)
return {"loss": loss}
def predict(...):
.... Issues
From a user's perspective, the overhead of adding a dependency to a project ought to be overshadowed by the benefits. Currently if I advocate of adding this library for the repositories I am responsible for, I would have to invest in a more robust testing suite for the user code, owing to easily avoidable issues in an otherwise excellent library interface. Hopefully I have done a slightly better job of explaining the bigger picture and considerations that go into adopting a new library, and the potential issues of dropping texar-pytorch into the workflows I am responsible for. |
Technically, you have answered my original questions within the current interface design, so feel free to close this issue. However, we have wandered into the territory of proposed changes, and there might be valuable discussions here, so perhaps it's easier if we continue this discussion if you find it useful by changing the labels? I might test a few interface changes in my fork and am happy to issue a PR if you find it useful too. |
Hey, so sorry for the late reply. Been caught up by work piled up during the holidays. I'm fine with continuing the discussion here, and I've changed the label from "question" to "discussion". Thank you for explaining your point in detail with code examples. However, I'm afraid I can't agree with you on all these points.
In your example you assumed that the class GPT2WithSoftmaxCrossEntropy(nn.Module)
def __init__(self, gpt2, loss_func: str):
self.gpt2 = gpt2
self.loss_func = loss_func
def forward(self, batch):
logits = self.gpt2(batch)["outputs"]
if self.loss_func == "sparse":
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
labels=batch['target_text_ids'][:, 1:], logits=logits,
sequence_length=batch['target_length'] - 1,
label_smoothing=0.1)
else:
loss = tx.losses.sequence_softmax_cross_entropy(
labels=batch['target'][:, 1:], logits=logits,
sequence_length=batch['target_length'] - 1)
return {"loss": loss} In your alternative design, you'd have to write (or put these in a config): if args.loss_func == "sparse":
loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
keys = {"logits": "logits", "labels": "target_text_ids",
"sequence_length": "target_length",
"label_smoothing": 0.1} # additional kwargs too
else:
loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
keys = {"logits": "logits", "labels": "target",
"sequence_length": "target_length"}
executor = tx.run.Executor(
model=model,
loss=loss_func,
regularizer=regularizer,
keys=keys,
....
) which I think is more or less the same. It doesn't really enable you to test less code -- you'd still need to cover both branches; if you're loading everything from a config, you'll need to test that the config is written correctly. Either way, even a well-tested library doesn't give you that benefit, it's the user code you'd still have to test. Also, the code you showcased was not exactly fair comparison. In the current design the model outputs cannot be directly used in the loss, it needs to be sliced or subtracted. So, in order to specify the loss in the executor arguments, you'd need to return a bunch more in the model adapter: def forward(self, batch):
outputs = self.gpt2(batch)
return {"outputs": outputs["logits"],
"target_text_ids": batch["target_text_ids"][:, 1:],
"target_length": batch["target_length"] - 1} which also makes it longer, or at least not much shorter compared to the current version. To summarize, as I said before, I think your design is reasonable and would make the code shorter and perhaps more readable in certain cases, and those are probably the cases that you often work with. However, the design would make other cases harder to implement, and does not really serve the purpose it's supposed to serve. Since we want Texar-PyTorch to be a general-purpose library, we probably have to choose the current design over yours. But still, thank you for bringing up the topic and engaging in the discussion. It made me think about the current design and gave me ideas for improvement. |
Closing due to lack of activity. |
Hello!
I came across texar-pytorch while in the process of writing my own version of the Executor, and am really happy that someone's already done the work for it, so firstly, thanks for the excellent repository.
Broad Query
One of the main questions I have pertains to the design requirement of including the loss as an item in the dictionary returned by the
forward
method of the model. Essentially, I'm wondering what's the recommended pattern for including terms in the loss (for eg. regularization terms) that ought to not be a part of the validation loss? Conceptually, I think of the forward pass as being completed with the model making its predictions, which is what I believe thepredict()
method is for. However, having the computation of the loss as a responsibility of the forward pass could lead to certain problems.Context
Ideally, the training loss ought to be computed in a separate forward pass after the training epoch is completed. I'm aware that most people use an average over the training batches as an approximation of the training loss. However, this becomes an issue when comparing against a validation loss curve, where the difference between the training and validation curve indicates generalization error. This is for two reasons in the typical case:
As far as I can tell, the training loss is computed within the training loop in the executor, as opposed to an additional forward pass over the training set with
model.eval()
. Is my understanding correct?On regularization
Typically, any regularization terms over the model parameters is added to the loss before the call to
backward
. With the given interface, it seems like the right place to do this would be in the forward pass. However, I find it a little weird to make the model responsible for regularizing itself. I usually have a separatenn.Module
subclass responsible for regularizing a model and dependency-inject the model into this class. That way I can swap out different regularizers without changing the model, in compliance with the Open-Close SOLID principle.Could you please explain how to achieve these two things (loss computation over the training set after the train loop, and computing auxiliary loss terms outside the model) with the current setup of the executor? It seems like this is a direct consequence of requiring the model to compute the loss, which seems to be a little problematic. This is currently the main two problems precluding my use of this otherwise awesome repo, so I'd appreciate any insight.
Thanks!
The text was updated successfully, but these errors were encountered: