This project contains a set of examples that demonstrate how raw data in various formats can be loaded, split and preprocessed to build serializable (and hence reproducible) ETL pipelines using the DataVec library.
Go back to the main repository page to explore other features/functionality of the Eclipse DeeplearningJ ecosystem. File an issue here to request new features.
The examples in this project and what they demonstrate are briefly described below. This is also the recommended order to explore them in.
InputSplit and its implementations are utility classes for defining and managing a catalog of loadable locations (paths/files), in memory, that can later be exposed through an Iterator. In simple terms, they define where your data comes from or should be saved to, when building a data pipeline with DataVec.
- Ex01_FileSplitExample.java
Using FileSplit which loads files in a given location. Constructor overloading allows for varying functionality like filtering files to load, loading recursively etc - Ex02_CollectionSplitExample.java
Create a split from a collection of URIs - Ex03_NumberedFileInputSplitExample.java
Create a split from numbered files, following a common pattern like file1.txt, file2.txt ... file100.txt - Ex04_TransformSplitExample.java
Maps URIs of a given split to new URIs. Useful when features and labels are in different files sharing a common naming scheme, and the name of the output file can be determined given the name of the input file. Eg. a-in.csv and a-out.csv - Ex05_SamplingBaseInputSplitExample.java
Generate several splits from the main split say for training, validation and testing. - Ex06_KFoldIteratorFromDataSet.java
Generate a K-Fold iterator from a dataset
- IrisCSVTransform.java
A basic example that introduces users to important concepts like Schema and TransformProcess with categoricalToInteger. - CSVMixedDataTypesLocal.java
Common preprocessing steps like removing unnecessary columns, filtering based on column value, replacing invalid values, parsing date time etc - CSVMixedDataTypes.java
Same as the above but with Apache Spark - PrintSchemasAtEachStep.java
How to print schema at each step which would be useful for debugging transform scripts in a complicated pipeline - IrisAnalysis.java
Basic Analysis of the dataset saved and presented as an html file - IrisNormalizer.java
Proper useage of preprocessors with min max scaler - JoinExample.java
Perform joins on datasets - PivotExample.java
Combine multiple independent records by key. - WebLogDataExample.java
Preprocessing/aggregation operations on some web log data - CustomReduceExample.java
Custom Reduction example for operations on some simple CSV data that involve a custom reduction. - MultiOpReduceExample.java
Reduce example with multiple ops on one column
- CSVtoMapFileConversion.java
A simple example on how to convert a CSV text file to a Hadoop MapFile format for better performance and the convenience of randomization supported by the MapFileRecordReader - SVMLightExample.java
MNIST SVMLight example - ImagePipelineExample.java
An imagepipeline that also demonstrates using transforms to augment a small dataset