EVF Tutorial

This tutorial shows how to use the Extended Vector Framework to create a simple format plugin. The EVF framework has also been called the "row set framework" and the "new scan framework". Here we focus on using the framework. Other pages in this section provide background information for when you need features beyond those shown here.

The Log Plugin

The Drill log plugin is the focus of this tutorial. A simplified version of this plugin is explained in the Learning Apache Drill book. The version used here is the one which ships with Drill.

Plugin Design

Most format plugins are based on the "easy" framework. EVF still uses the "easy" framework, but the implementation differs.

"Legacy" plugins are based on the idea of a "record reader" (a concept borrowed from Hive.) Unlike the hive record readers, Drill's never read a single record: they all read a batch of records. In EVF, the reader changes to be a "row batch reader" which implements a new interface.

In Drill 1.16 and earlier, the LogRecordReader uses a typical method to write to value vectors using the associated Mutator class.

Other readers are more clever: the "V2" text reader (Drill 1.16 and earlier) worked with direct memory itself, handling its own buffer allocation, offset vector calculations and so on.

With the EVF, we'll replace the Mutator with a ColumnWriter. We'll first do the simplest possible conversion, then look at how to use advanced features, such as type conversions, schema and table properties.

Let's work though the needed changes one-by-one.

Revise the Plugin Definition

Prior to the EVF, Easy format plugins were based on the original ScanBatch. At the start of execution, the plugin creates a set of record readers which are passed to the scan batch. With EVF, we use the new scan framework. The new framework focuses on batches, and uses a new type of reader called a "batch reader." We provide a factory method to create batch readers on the fly. The batch reader itself does what the old record reader used to do, but using the EVF.

Simplify the Plugin Definition

Without the EVF, plugins must pass a bunch of options to the EasyFormatPlugin base class, and must define a number of method to further define behavior. Here is the Drill 1.16 LogFormatPlugin version:

public LogFormatPlugin(String name, DrillbitContext context,
                         Configuration fsConf, StoragePluginConfig storageConfig,
                         LogFormatConfig formatConfig) {
    super(name, context, fsConf, storageConfig, formatConfig,
        true,  // readable
        false, // writable
        true, // blockSplittable
        true,  // compressible
        Lists.newArrayList(formatConfig.getExtension()),
        DEFAULT_NAME);
    this.formatConfig = formatConfig;
  }

  @Override
  public boolean supportsPushDown() {
    return true;
  }

  @Override
  public int getReaderOperatorType() {
    return UserBitShared.CoreOperatorType.REGEX_SUB_SCAN_VALUE;
  }

  @Override
  public int getWriterOperatorType() {
    throw new UnsupportedOperationException("unimplemented");
  }

  @Override
  public boolean supportsStatistics() {
    return false;
  }

So, first step is to convert this to the EVF version:

  public LogFormatPlugin(String name, DrillbitContext context,
                         Configuration fsConf, StoragePluginConfig storageConfig,
                         LogFormatConfig formatConfig) {
    super(name, easyConfig(fsConf, formatConfig), context, storageConfig, formatConfig);
  }

  private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) {
    EasyFormatConfig config = new EasyFormatConfig();
    config.readable = true;
    config.writable = false;
    config.blockSplittable = true;
    config.compressible = true;
    config.supportsProjectPushdown = true;
    config.extensions = Lists.newArrayList(pluginConfig.getExtension());
    config.fsConf = fsConf;
    config.defaultName = DEFAULT_NAME;
    config.readerOperatorType = CoreOperatorType.REGEX_SUB_SCAN_VALUE;
    return config;
  }

The other methods shown earlier (and a few more "boilerplate methods") can be removed as they're now handled by the base class based on the "config" created above. The only other method that should still exist is getRecordReader().

This change is orthogonal to the other changes: we should now be able to compile and run the unit tests for the plugin.

Tidy Up the Plugin Class

For this plugin, we can also remove the formatConfig member, since the base class already tracks that for us.

This is also a good time to tidy up the plugin name. In LogFormatPlugin change

  public static final String DEFAULT_NAME = "logRegex";

to

  public static final String PLUGIN_NAME = "logRegex";

Then, in LogFormatConfig, change

@JsonTypeName("logRegex")

to

@JsonTypeName(LogFormatPlugin.PLUGIN_NAME)

Define the Record Batch Creator

We'll now start making some breaking changes; the code won't build during the next several steps.

The Drill 1.16 plugin creates record readers as follows:

 @Override
  public RecordReader getRecordReader(FragmentContext context,
                                      DrillFileSystem dfs, FileWork fileWork, List<SchemaPath> columns,
                                      String userName) throws ExecutionSetupException {
    return new LogRecordReader(context, dfs, fileWork,
        columns, userName, formatConfig);
  }

The EVF uses batch readers, created using a batch reader creator:

((Insert code))

The LogBatchReader doesn't exist yet; so our next step is to create it from our record reader.

Convert the Record Reader to a Batch Reader

Rename the existing record reader from LogRecordReader to LogBatchReader. Then, change the base class:

public class LogRecordReader extends AbstractRecordReader {

To

public class LogBatchReader extends ManagedReader<FileSchemaNegotiator> {

The FileSchemaNegotiator is a new concept: we will use it to define the schema of our file as we learn the schema on read.

Convert to Use Column Writers

Current Implementation

The simplest way to convert this reader is to continue to use the ColumnDefn structure shown earlier, but replace the implementation to use the row set mechanisms:

The log reader code uses a ColumnDefn class to convert from the String value provided by the regex parser to the Java type needed by the Mutator.

For example, for a nullable VarChar vector:

  private static class VarCharDefn extends ColumnDefn {

    private NullableVarCharVector.Mutator mutator;

    public VarCharDefn(String name, int index) {
      super(name, index);
    }

    @Override
    public void define(OutputMutator outputMutator) throws SchemaChangeException {
      MaterializedField field = MaterializedField.create(getName(),
          Types.optional(MinorType.VARCHAR));
      mutator = outputMutator.addField(field, NullableVarCharVector.class).getMutator();
    }

    @Override
    public void load(int rowIndex, String value) {
      byte[] bytes = value.getBytes();
      mutator.setSafe(rowIndex, bytes, 0, bytes.length);
    }
  }

private static class VarCharDefn extends ColumnDefn {

    private ScalarWriter colWriter;

    public VarCharDefn(String name, int index) {
      super(name, index);
    }

    @Override
    public void define(ResultSetLoader loader) {
      int colIndex = loader.addColumn(getName(), Types.optional(MinorType.VARCHAR));
      colWriter = loader.writer().scalar(colIndex);
    }

    @Override
    public void load(int rowIndex, String value) {
      colWriter.setString(value);
    }
  }

Some things to note:

We use a ScalarWriter in place of the Mutator. (There are various kinds of writers for arrays, for maps and so on. Here, we only need simple scalars.)
We use the ResultSetLoader to define a new column on the fly and to obtain the writer for that column.
We use the setString() method of the writer to set the string. Note that we no longer need to specify the row position; the EVF tracks that for us.

Next, we make the same changes for the other column defns; not shown here for brevity. You can see the new form in [this branch](need link) NEED LINK.

EVF Tutorial

The Log Plugin

Plugin Design

Revise the Plugin Definition

Simplify the Plugin Definition

Tidy Up the Plugin Class

Define the Record Batch Creator

Convert the Record Reader to a Batch Reader

Convert to Use Column Writers

Current Implementation

Test

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!