Skip to content
Paul Rogers edited this page May 16, 2019 · 16 revisions

This tutorial shows how to use the Extended Vector Framework to create a simple format plugin. The EVF framework has also been called the "row set framework" and the "new scan framework". Here we focus on using the framework. Other pages in this section provide background information for when you need features beyond those shown here.

The Log Plugin

The Drill log plugin is the focus of this tutorial. A simplified version of this plugin is explained in the Learning Apache Drill book. The version used here is [the one which ships with Drill|https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/log].

Current Design

In Drill 1.16 and earlier, the LogRecordReader uses a typical method to write to value vectors using the associated {{Mutator}} class. For example, for a nullable VarChar vector:

  private static class VarCharDefn extends ColumnDefn {

    private NullableVarCharVector.Mutator mutator;

    public VarCharDefn(String name, int index) {
      super(name, index);
    }

    @Override
    public void define(OutputMutator outputMutator) throws SchemaChangeException {
      MaterializedField field = MaterializedField.create(getName(),
          Types.optional(MinorType.VARCHAR));
      mutator = outputMutator.addField(field, NullableVarCharVector.class).getMutator();
    }

    @Override
    public void load(int rowIndex, String value) {
      byte[] bytes = value.getBytes();
      mutator.setSafe(rowIndex, bytes, 0, bytes.length);
    }
  }

Other readers are more clever: the "V2" text reader (Drill 1.16 and earlier) worked with direct memory itself, handling its own buffer allocation, offset vector calculations and so on.

The log reader code uses a {{ColumnDefn}} class to convert from the String value provided by the regex parser to the Java type needed by the {{Mutator}}.

Revised Design

With the EVF, we'll replace the {{Mutator}} with a {{ColumnWriter}}. We'll first do the simplest possible conversion, then look at how to use advanced features, such as type conversions, schemas and table properties.

In order to use the EVF, we must also change the way that the plugin is structured, using the new version of the "easy" plugin implementation to define the reader using the new scan framework.

Revise the Plugin Definition

Prior to the EVF, Easy format plugins were based on the original ScanBatch. At the start of execution, the plugin creates a set of record readers which are passed to the scan batch. With EVF, we use the new scan framework. The new framework focuses on batches, and uses a new type of reader called a "batch reader." We provide a factory method to create batch readers on the fly. The batch reader itself does what the old record reader used to do, but using the EVF.

Simplify the Plugin Definition

Without the EVF, plugins must pass a bunch of options to the EasyFormatPlugin base class, and must define a number of method to further define behavior. Here is the Drill 1.16 LogFormatPlugin version:

public LogFormatPlugin(String name, DrillbitContext context,
                         Configuration fsConf, StoragePluginConfig storageConfig,
                         LogFormatConfig formatConfig) {
    super(name, context, fsConf, storageConfig, formatConfig,
        true,  // readable
        false, // writable
        true, // blockSplittable
        true,  // compressible
        Lists.newArrayList(formatConfig.getExtension()),
        DEFAULT_NAME);
    this.formatConfig = formatConfig;
  }

  @Override
  public boolean supportsPushDown() {
    return true;
  }

  @Override
  public boolean supportsStatistics() {
    return false;
  }

So, first step is to convert this to the EVF version:

((Insert code here))

Note that the code won't compile yet; we need to do the other conversion steps as well.

Define the Record Batch Creator

The Drill 1.16 plugin creates record readers as follows:

 @Override
  public RecordReader getRecordReader(FragmentContext context,
                                      DrillFileSystem dfs, FileWork fileWork, List<SchemaPath> columns,
                                      String userName) throws ExecutionSetupException {
    return new LogRecordReader(context, dfs, fileWork,
        columns, userName, formatConfig);
  }

The EVF uses batch readers, created using a batch reader creator:

((Insert code))

The LogBatchReader doesn't exist yet; so our next step is to create it from our record reader.

Convert the Record Reader to a Batch Reader

Rename the existing record reader from LogRecordReader to LogBatchReader. Then, change the base class:

public class LogRecordReader extends AbstractRecordReader {

To

public class LogBatchReader extends ManagedReader<FileSchemaNegotiator> {

The FileSchemaNegotiator is a new concept: we will use it to define the schema of our file as we learn the schema on read.

Convert to Use Column Writers

The simplest way to convert this reader is to continue to use the {{ColumnDefn}} structure shown earlier, but replace the implementation to use the row set mechanisms:

private static class VarCharDefn extends ColumnDefn {

    private ScalarWriter colWriter;

    public VarCharDefn(String name, int index) {
      super(name, index);
    }

    @Override
    public void define(ResultSetLoader loader) {
      int colIndex = loader.addColumn(getName(), Types.optional(MinorType.VARCHAR));
      colWriter = loader.writer().scalar(colIndex);
    }

    @Override
    public void load(int rowIndex, String value) {
      colWriter.setString(value);
    }
  }

Some things to note:

  • We use a ScalarWriter in place of the Mutator. (There are various kinds of writers for arrays, for maps and so on. Here, we only need simple scalars.)
  • We use the ResultSetLoader to define a new column on the fly and to obtain the writer for that column.
  • We use the setString() method of the writer to set the string. Note that we no longer need to specify the row position; the EVF tracks that for us.

Next, we make the same changes for the other column defns; not shown here for brevity. You can see the new form in [this branch](need link) NEED LINK.

Test

Next Steps

Clone this wiki locally