Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Orion handle training a 2TB dataset? #567

Open
bigmisspanda opened this issue Sep 19, 2024 · 4 comments
Open

Can Orion handle training a 2TB dataset? #567

bigmisspanda opened this issue Sep 19, 2024 · 4 comments
Labels
question Further information is requested

Comments

@bigmisspanda
Copy link

bigmisspanda commented Sep 19, 2024

  • Orion version:latest
  • Python version:3.11
  • Operating System:Linux

Description

In my case, the training data is very large and cannot be loaded into memory all at once. It seems that time_segments_aggregate, SimpleImputer, MinMaxScaler, and rolling_window_sequences in the pipeline all require the data to be stored in memory. Can Orion handle training a 2-10TB dataset?

@sarahmish
Copy link
Collaborator

Hi @bigmisspanda – thank you for your question!

You are right, all the preprocessing primitives require to be in memory.

One work around can be to replace these primitives with your own scalable functions and then start the Orion pipeline from the modeling primitive directly. Another can be to chunk up your training data and training the pipeline on each chunk.

@sarahmish sarahmish added the question Further information is requested label Sep 20, 2024
@bigmisspanda
Copy link
Author

bigmisspanda commented Sep 24, 2024

Hi @bigmisspanda – thank you for your question!

You are right, all the preprocessing primitives require to be in memory.

One work around can be to replace these primitives with your own scalable functions and then start the Orion pipeline from the modeling primitive directly. Another can be to chunk up your training data and training the pipeline on each chunk.

Yes, thank you for your help. I understand what you mean. My plan is to use TadGAN to train an anomaly detection model. My data comes from power equipment sensors and has over 20 features. If I train in chunks, the MinMaxScaler results will not be globally distributed. I referred to the information in this document,
image
and my plan is:

  1. To use MinMaxScaler's partial_fit for global calculations on the dataset in advance
  2. split the data into chunks.
  3. To remove the MinMaxScaler from the third step in the primitives
  4. To train the model

Is my approach feasible? Can TadGAN perform similar partial_fit training from a continuous stream data?

@sarahmish
Copy link
Collaborator

Your plan looks logical to me!

I'm not too familiar with what partial_fit does under the hood, however, calling fit on multiple times on different data chunks seems analogous to their concept of "incremental learning".

@bigmisspanda
Copy link
Author

Your plan looks logical to me!

I'm not too familiar with what partial_fit does under the hood, however, calling fit on multiple times on different data chunks seems analogous to their concept of "incremental learning".

The concept of partial_fit is consistent with incremental learning. I will follow this approach for testing and training. Thank you for your great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants