Skip to content

Commit

Permalink
fix: make dataset robust to empty samples
Browse files Browse the repository at this point in the history
Signed-off-by: Mehant Kammakomati <[email protected]>
  • Loading branch information
kmehant committed Sep 5, 2024
1 parent 6448afb commit ef3191d
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions tuning/utils/data_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,11 @@ def __iter__(self):
sample[self.tokens_field] = self.tokenizer.encode(
sample[self.text_field]
)
if not sample[self.tokens_field]:
logger.warning(
f"skipping an empty sample : {sample[self.tokens_field]}"
)
continue
except Exception as e: # pylint: disable=broad-exception-caught
logger.warning(
"failed to tokenize the data {} of type {}.".format(
Expand Down

0 comments on commit ef3191d

Please sign in to comment.