| Status | Obsolete |
|---|---|
| RFC # | 266 |
| Author(s) | Anna Revinskaya (annarev@google.com), Jeremy Lau (lauj@google.com) |
| Sponsor | Jeremy Lau (lauj@google.com) |
| Updated | 2020-09-09 |
This proposal focuses on getting a majority of "well-behaved" TensorFlow ops running efficiently on mobile devices by removing the need to execute them via the TensorFlow eager runtime, instead calling kernels directly from the new TFRT TensorFlow runtime.
Note that there is an effort to call existing kernels by delegating to TensorFlow eager runtime instead. This approach is called Runtime Fallback. The goals of the two fallback mechanisms are as follows:
- Runtime Fallback aims to reuse all current TensorFlow kernels in TFRT.
- Kernel Fallback (focus of this document) aims to get a large number of existing kernels working in TFRT while reducing binary size to support mobile devices.
| Runtime Fallback | Kernel Fallback |
|---|---|
![]() |
![]() |
High level goals of the project:
- Call existing kernels from new TensorFlow runtime
- Reduce size and overhead to make this a feasible option for mobile
We address the first goal by implementing a new fallback mechanism that directly calls TensorFlow kernels without going through Eager runtime first. We plan to address the second high level goal by trimming down dependencies, switching to more compact proto representation, etc.
Note that TensorFlow's current mobile solution is called TensorFlow Lite. At the same time, there is a work-in-progress effort to enable TFRT to run on mobile. This document focuses on the way TFRT would call kernels when running on mobile devices. Details of the way TFRT itself would be executed on mobile platforms are outside of the scope of this document.
First of all, we plan to target all the easier-to-support ops that don’t require implementing extensive pieces of infrastructure.
We analysed how many kernels we can support in the future and include our
findings in the following spreadsheets. As we describe in
Design Proposal below, Kernel Fallback depends on
customizing
OpKernelConstruction
and
OpKernelContext
classes. Number of supported kernels will depend on the surface we manage to
customize. (Note that I have already started prototyping the implementation that
includes a few common methods such as input, output. The spreadsheet below
considers these methods to be already supported).
- List of kernels and
OpKernelConstruction/OpKernelContextmethods they require: here - Proposed implementation order for these methods: here
Based on these estimates, we can support >= 423 kernels. Note that this number
is just based on the OpKernelConstruction/OpKernelContext coverage that we
can provide. It doesn't take into consideration other issues we might face.
We want to support executing a BEF file on mobile device that calls kernels using Kernel Fallback mechanism. Users will be able to generate a BEF file based on a saved model and we will provide a script to create it.
We might also want to support running ops using TFRT eager mode (that is, add a custom OpHandler).
- Supporting all existing ops.
OpKernelContextsurface is quite large and implementing all of it would require a significant amount of time. Instead, we will start by adding most common and easy functionality. If certain functionality is only used by a handful of kernels, it might make sense to implement TFRT native kernels or rely on runtime fallback instead. One notable example is ResourceMgr. We might support it later, but it is definitely not first priority due to extra effort required. - Gradients would not be supported by the first iteration of Kernel Fallback, but we might revisit it later.
- Exact details of TFRT integration are still being worked out by TFRT and TensorFlow mobile teams. Since these teams might change the plan, exact details are not a part of this doc. The take away is that we will integrate kernel fallback following the approach they decide on.
Currently, TF Lite supports a limited set of ops. As the range and variety of applications grows, it becomes essential to grow the pool of available ops on mobile devices, ideally supporting everything that fully-fledged TensorFlow supports now.
However, supporting TensorFlow ops on mobile devices presents some challenges. Specifically, binary size on mobile platforms should be restricted. TensorFlow mobile team provided us with the following ideal numbers:
- 100-200k overhead to call TF kernels
- 20k / kernel marginal size
To get closer to the size restrictions we plan to define a call path from TFRT to TensorFlow kernels that minimizes the amount of generated code.
Running more kernels on mobile devices would allow TensorFlow users to implement a wider range of models for mobile devices. Reduced binary size will also benefit users that currently use TensorFlow Lite's experimental [TensorFlow Select ops] (https://www.tensorflow.org/lite/guide/ops_select), or users that do not use the experimental feature because of that reason.
We propose to call the kernel’s Compute method directly from TFRT without going through TensorFlow Eager C API first. We introduce kernel context and registration implementation that support core kernel functionality with minimal dependencies.
High-level diagram of the proposed design:
We will use a separate registry for kernels supported by TFRT forwarding. To do
so, we will define a TFRTOpKernelFactories class that would keep a map from
kernel name to a list of registrations.
class TFRTOpKernelFactories {
public:
TFRTOpKernelFactories();
void RegisterFactory(StringPiece kernel_class_name,
TFRTOpKernelReg kernel_info);
// Creates a kernel with the given name and passes op_kernel_construction
// to kernel constructor.
// Returns the constructed kernel on success.
// In case of failure, returns a nullptr. Kernel creation can fail in one
// of the following cases:
// 1. Kernel with the given name is not found.
// 2. Attributes in op_kernel_construction don't match type constraints
// for any of the kernels with this name.
// Note that we consider a constraint to be "not matched" if the attribute
// it applies to is not in op_kernel_construction.
std::unique_ptr<TFRTOpKernel> CreateKernel(
StringPiece kernel_class_name,
TFRTOpKernelConstruction* op_kernel_construction) const;
private:
llvm::StringMap<std::vector<TFRTOpKernelReg>> factories_;
};
extern llvm::ManagedStatic<TFRTOpKernelFactories> fallback_kernel_factories;Similar to the current TensorFlow kernel registration, we will introduce a
registration macro that adds a kernel to TFRTOpKernelFactories.
#define REGISTER_FALLBACK_KERNEL(name, ...) \
REGISTER_FALLBACK_KERNEL_UNIQ_HELPER(__COUNTER__, name, __VA_ARGS__)
#define REGISTER_FALLBACK_KERNEL_UNIQ_HELPER(ctr, name, ...) \
REGISTER_FALLBACK_KERNEL_UNIQ(ctr, name, __VA_ARGS__)
#define REGISTER_FALLBACK_KERNEL_UNIQ(ctr, name, ...) \
static bool global_fallback_kernel_##ctr##_registered_ = []() { \
::tensorflow::fallback_kernel_factories->RegisterFactory( \
name, TFRTOpKernelReg([](TFRTOpKernelConstruction* construction) \
-> std::unique_ptr<TFRTOpKernel> { \
return std::make_unique<__VA_ARGS__>(construction); \
})); \
return true; \
}();To support type specification, we will also provide a minimal Op registry and
corresponding macro REGISTER_KERNEL_FALLBACK_OP. Sample implementation:
// TFRTOpMetaBuilder class will provide ways to set input, output and
// attribute specifications.
class TFRTOpMetaBuilder {
public:
explicit TFRTOpMetaBuilder(StringPiece op_name);
TFRTOpMetaBuilder& Output(StringPiece output_spec);
...
};
// Registration will add the op to a static map.
class TFRTOpRegisterer {
public:
TFRTOpRegisterer(const TFRTOpMetaBuilder& op_builder);
};
#define REGISTER_KERNEL_FALLBACK_OP(name) \
REGISTER_KERNEL_FALLBACK_OP_UNIQ(__COUNTER__, name)
#define REGISTER_KERNEL_FALLBACK_OP_UNIQ(ctr, name) \
static TFRTOpRegisterer global_fallback_op_meta_builder_##ctr##_ = \
TFRTOpMetaBuilder(name)Usage example:
REGISTER_KERNEL_FALLBACK_OP("AddN").Output("out: int32");TensorFlow kernels inherit from the OpKernel class and depend on two key classes: OpKernelConstruction and OpKernelContext. We want to provide custom implementations of these two classes in terms of data we get from TFRT (for e.g. inputs, attributes).
There are two main approaches to customize class implementations:
- Use inheritance and define common interfaces.
- Use templates.
We ran multiple benchmarks to get an idea of the trade offs between inheritance and templating approaches. Key findings are summarized below:
- Time difference negligible for full model benchmarks.
- A simple scalar op benchmark with Kernel Fallback (runs scalar multiplication, division, addition) was only 0.3% slower on mobile with inheritance compared to templates. The benchmark was run on a real device (Pixel 3) with ABI: arm64-v8a and SDK version: 29.
- basic_ops_benchmark
with inheritance was originally measured to be significantly slower: ~7% (median). However, we determined that the regression goes away if we use
finalkeywords. (More details in Appendix 2.) - Binary size increase when using templates compared to inheritance is
estimated at 2.6% (based on adding
AddNop).
Right now, we are leaning towards using inheritance. Seems like time increase is only not significant. (See more details in Appendix 2)
To use inheritance, we will define OpKernelConstructionInterface and
OpKernelContextInterface interfaces. Ideally, these interfaces should be pure
virtual. However, we will have some exception - for e.g. templated eigen_device method
that calls per-device pure-virtual implementations.
We will then introduce TFRTOpKernelConstruction and TFRTOpKernelContext
subclasses that implement OpKernelConstructionInterface and
OpKernelContextInterface in terms of TFRT data structures. Here's an example of how
TFRTOpKernelConstruction might look like:
class TFRTOpKernelConstruction final : public OpKernelConstructionInterface {
public:
explicit TFRTOpKernelConstruction(AttrMap attributes);
~TFRTOpKernelConstruction() override {};
Status GetAttr(StringPiece attr_name, int32* value) const override;
Status GetAttr(StringPiece attr_name, DataType* value) const override;
void CtxFailure(const Status& s);
void CtxFailureWithWarning(const Status& s);
void CtxFailure(const char* file, int line, const Status& s);
void CtxFailureWithWarning(const char* file, int line, const Status& s);
...
};When running Kernel Fallback, we instantiate the kernel interfaces with TFRT’s lightweight OpKernel definitions, rather than TensorFlow’s heavyweight OpKernel definitions for example.
Example AddN kernel implementation using these new interfaces:
class AddNOp : public OpKernelBase {
public:
explicit AddNOp(OpKernelConstructionInterface* construction) :
OpKernelBase(construction) {}
void Compute(OpKernelContextInterface* ctx) override {
if(!ctx->ValidateInputsAreSameShape(this)) return;
...Here, OpKernelBase implementation will be minimal:
class OpKernelBase {
public:
explicit OpKernelBase(OpKernelConstructionInterface* context) {
}
virtual ~OpKernelBase() {}
virtual void Compute(OpKernelContextInterface* context) = 0;
};(For details how extending from OpKernelBase instead of OpKernel would work
with current TensorFlow runtime see Appendix 1)
Corresponding .cc file then registers the kernel using the correct kernel and
context classes. For example, this is how we register AddN kernel with TFRT:
REGISTER_FALLBACK_KERNEL( "AddN", AddNOp<CPUDevice, int32>);We add a new TFRT BEF kernel called tfrt_fallback.kernel_fallback. This kernel directly
calls a TF kernel’s Compute method by creating TFRTOpKernel* data structures
that forward to corresponding TFRT concepts. For example, the following code
accesses an input in llvm::ArrayRef<tfrt::RCReference<tfrt::AsyncValue>> which
we get from TFRT:
const Tensor& TFRTOpKernelContext::input(int index) {
return inputs_[index]->get<Tensor>();
}Simplified definition of tfrt_fallback.kernel_fallback:
// Instantiate a kernel. This would be a TensorFlow kernel converted to inherit
// from `OpKernelBase` instead of `OpKernel`.
std::unique_ptr<OpKernelBase> op = …;
// Create TFRTOpKernelContext. The variable exec_ctx here is the tfrt::ExecutionContext passed to the kernel handler.
TFRTOpKernelContext op_kernel_context(inputs, outputs.size(), op_meta, exec_ctx.host());
// Directly invoke the TF kernel's Compute() method.
op->Compute(&op_kernel_context);We will be using the following conventions (essentially, these are based on Runtime Fallback work):
- Attributes are passed as key-value pairs, where both key and value are represented as strings.
- Types have a specific string representation. We are trying to use names
consistent with BEF syntax as much as possible (for e.g.
f32representsfloat). - Inputs and outputs have type
tensorflow::Tensor. We will provide BEF kernels to construct these from BEF data (for e.g. constant values).
Example of invoking Conv3D kernel:
%tft_c = "tfrt_fallback.kernel_fallback"(%tft_a, %tft_b) {
_op_name = "Conv3D",
attr1_name="data_format", attr1_value="string$NDHWC",
attr2_name="strides", attr2_value="list(i32)$1,1,1,1,1",
attr3_name="dilations", attr3_value="list(i32)$1,1,1,1,1",
attr4_name="padding", attr4_value="padding$SAME"}: (!tfd.tensor, !tfd.tensor) -> !tfd.tensor
For example, dilations attribute here has a value of [1, 1, 1, 1, 1].
Note: TFRT orders attributes by name, alphabetically, which is why we use attrN_value and attrN_name pattern pair.
TensorFlow currently reuses kernels instantiated for a particular node in a graph. It would be nice to have this optimization for Kernel fallback as well.
BEF executor keeps track of offsets within a BEF file. We can use this offset to cache corresponding kernel objects.
We should make sure that Kernel Fallback is thread safe when reusing kernel
objects since Compute for the same kernel can be called from multiple threads.
We can take a simple approach and support kernel cache only for stateless
kernels. Stateless kernels only update OpKernelContext and not OpKernel
state itself.
Modular TensorFlow effort aims to break up giant monolithic TensorFlow binaries into smaller shared libraries. Specifically, James (@sjamesr) and Gunhan (@gunhan) looked at splitting out kernels out of TensorFlow core. Initial Kernel C API definition is at kernel.h and its implementation is at kernel.cc.
Kernel Fallback should support kernels migrated to C API as well. We can implement this support behind the C API, so that we don’t have to update individual kernels.
There are a few important takeaways from current kernel C API implementation that will impact decisions in the document:
- We register a COpKernel object (with TensorFlow op kernel registry) for any kernel defined using the C API.
OpKernelContextandOpKernelConstructionare passed around as opaque pointers on the C API surface (they get cast toTF_OpKernelContextandTF_OpKernelConstructionaliases).- Most of the functions just provide accessors into
OpKernelContext/OpKernelConstructiontypes.
Given current API structure, we can consider two approaches going forward:
- TFRT fully supports all functionality available in the C API. This way any kernel defined using the C API would be automatically available using either full TensorFlow or the TFRT kernel fallback.
- Certain functionality is only available with TF backend. TFRT C API implementation falls back to full TensorFlow in these cases.
I recommend that we prioritize option 1 and try to get it working (i.e. support all functionality with both TensorFlow and TFRT C API backend). It already takes a significant effort to support more kernels with C API, so we can put a little extra effort and make sure it is supported by both runtimes.
We propose to provide two implementations of the kernel C API. First implementation is the current one - implemented in terms of TensorFlow runtime. Second implementation will use TFRT Kernel Fallback instead. We can select between the two kernel C API implementations by adding a build config setting:
# Whether to use TFRT-based implementation of the kernel C API.
config_setting(
name = "tfrt_kernel_c_api",
define_values = {
"tfrt_kernel_c_api": "True",
},
)
Most of the kernel C API implementation will be the same between the two with a few notable exceptions:
- TFRT Kernel Fallback implementation will cast
TF_OpKernelContextandTF_OpKernelConstructiontoTFRTOpKernelContextandTFRTOpKernelConstructionrespectively. - TFRT Kernel Fallback implementation will use Kernel Fallback registration mechanism.
We plan to implement C API for TFRT kernel registration that calls TFRT Kernel Fallback registration mechanism. Note that this is analogous to TF Lite providing their own C API registration mechanism.
TF_KernelBuilder* TF_NewKernelBuilder(
const char* op_name, const char* device_name,
void* (*create_func)(TF_OpKernelConstruction*),
void (*compute_func)(void*, TF_OpKernelContext*),
void (*delete_func)(void*)) {
TF_KernelBuilder* result = new TF_KernelBuilder;
result->create_function = create_func;
result->compute_function = compute_func;
result->delete_function = delete_func;
return result;
}
void TF_RegisterKernelBuilder(const char* name,
TF_KernelBuilder* builder,
TF_Status* status) {
auto* create_fn = builder->create_function;
auto* compute_fn = builder->compute_function;
auto* delete_fn = builder->delete_function;
auto create_kernel = [create_fn, compute_fn, delete_fn](
TFRTOpKernelConstruction* construction) {
return std::make_unique<tensorflow::TFRTCOpKernel>(
construction, create_fn, compute_fn, delete_fn);
};
::tensorflow::TFRTKernelReg kernelinfo(create_kernel);
kernelinfo.type_constraints = builder->attr_to_type;
::tensorflow::tfrt_kernel_factories->RegisterFactory(
name, kernelinfo);
tensorflow::TFRTOpRegisterer(tensorflow::TFRTOpMetaBuilder(name));
TF_DeleteKernelBuilder(builder);
TF_SetStatus(status, TF_OK, "");
}Current preferred direction would generate a BEF file in advance and then run that file on a mobile device. Generated BEF file would have to call either native, TF Lite, runtime fallback or kernel fallback kernels and provide any glue logic in between (such as tensor conversions).
We also need to consider how kernel or runtime fallback will be selected. This could be a parameter at BEF file creation step. It might also be good to package both runtime and kernel fallback implementations in a BEF file to be selected at runtime (packaging both is only relevant for non-mobile usecase since it would prevent us from reducing binary size).
Since we want to run on a mobile platform, we need to look for any opportunity
to cut down size. First of all, we remove dependency on current TensorFlow
runtime (for e.g. we no longer depend on NodeDef and OpDef protos). We are
also looking at ways to reduce large size contributions of
absl libraries and
protos.
We are currently investigating the following options:
- Switch to micropb. This proto implementation provides C interfaces and is more compact.
- Remove dependency on protos.
We can hide ABSL references behind aliases (see tensorflow::StringPiece for example) to make it easier to replace all references to save binary size.
@gunhan is also starting an effort to define a library of STL utilities that helps us cut down on binary size.
We want to add a script to build configurations that can determine required kernels based on a model. We would then only build these kernels. For now, we will only support selective registration when building from source.
Script details still need to be worked out.
The main alterantive to TFRT Kernel Fallback is TFRT Runtime Fallback. TFRT Runtime Fallback will call TensorFlow Eager C API (corresponding RFC should be published soon). Main trade offs between the two fallbacks are described in the table below:
| Property | TFRT Kernel Fallback | TFRT Runtime Fallback |
|---|---|---|
| Generality | Support subset of ops (for e.g. no resources*) | Support all ops |
| Scalability | Requires per-kernel updates | No kernel changes |
| Performance | Lower overhead | Higher overhead |
| Binary size | Lower (no TF runtime) | Higher |
* Long term we might support resources, but we consider them lower priority due to significant work involved.
- Slow down due to adding inheritance for
OpKernelContextandOpKernelConsturction. - Speed up for lighter weight kernel calls.
No new dependencies.
- Build / startup time / binary size will be impacted by additional code added to implement Kernel Fallback. At the same time one of the goals of Kernel Fallback is to provide a lower-binary-size way to run existing TensorFlow kernels on mobile platforms.
- Code will be maintained by TensorFlow DevInfra and TFRT teams.
- We have a Kernel Fallback prototype
- Prototype support for two kernels:
AddNandConv3D - Current binary size estimates (based on Android arm64 build): 900k for framework and 200k per kernel per type (see Appendix 3).
- Finalize integration with TFRT.
- Convert a subset of TensorFlow kernels to support Kernel Fallback.
- Binary size small enough to run on mobile platforms.
- Increased kernel coverage on mobile platforms.
- Primarily geared towards mobile platforms but should work on non-mobile platforms as well.
- It might be preferrable to implement future kernels that extend
OpKernelBaseand takeOpKernelConstructionInterface/OpKernelContextinterface. This would allow new kernels to be used by Kernel Fallback. Currently, there is no plan to enforce it beyond providing advice at code review time.
- Would be useful to update Create an op documentation.
This proposal should not impact compatibility.
- There will be a new way to implement a kernel, but it will be optional. Current APIs should still work.
Seed this with open questions you require feedback on from the RFC process.
As discussed above, we want to convert (some) kernels to extend from
OpKernelBase instead of OpKernel. This lets us remove runtime-specific
information from kernel subclasses and lets us support both current and new
TensorFlow runtime.
However, TensorFlow runtime assumes that kernel extend OpKernel and support
all of its functionality. In other words we want kernels to extend
OpKernelBase but be added to existing TensorFlow registry as OpKernel
objects.
It seems easiest to me to wrap OpKernelBase with some class that extends OpKernel (I call this wrapper WrappedOpKernel below):
class WrappedOpKernel : public OpKernel {
public:
explicit WrappedOpKernel(OpKernelConstruction* context,
std::unique_ptr<OpKernelBase> impl)
: OpKernel(context), impl_(std::move(impl)) {}
void Compute(OpKernelContext* context) override {
impl_->Compute(context);
}
private:
std::unique_ptr<OpKernelBase> impl_;
};Kernels of type WrappedOpKernel will be created with corresponding WrappedOpKernelFactory in TensorFlow:
struct WrappedOpKernelFactory : public OpKernelFactory {
explicit WrappedOpKernelFactory(
OpKernelBase* (*create_func)(OpKernelConstructionInterface*))
: create_func_(create_func) {}
OpKernel* Create(OpKernelConstruction* context) override;
OpKernelBase* (*create_func_)(OpKernelConstructionInterface*);
};
OpKernel* OpKernelRegistrar::WrappedOpKernelFactory::Create(
OpKernelConstruction* context) {
std::unique_ptr<OpKernelBase> impl((*create_func_)(context));
return new WrappedOpKernel(context, std::move(impl));
}This approach has several benefits:
- Existing, non-converted kernels still extend
OpKernel, no code change needed. - Converted kernels registered with TensorFlow are still wrapped with OpKernel and therefore, TensorFlow runtime can access all fields currently supported by OpKernel.
- Converted kernels registered with TFRT only depend on
OpKernelBase(for example, they do not haveNodeDef-related properties that are not supported by TFRT).
This document proposes to have custom versions of OpKernel, OpKernelContext and OpKernelConstruction classes implemented in terms of TFRT primitives.
There are a few ways we can approach this implementation. OpKernel* classes can be customized using inheritance or templates.
Inheritance involves defining OpKernelBase base class and OpKernelConstructionInterface/OpKernelContextInterface interfaces. This approach is described in detail in the Kernel implementation section above.
Alternatively, we can customize kernel implementation using templates by adding a template header to each kernel (consecutively, moving kernel implementations to header files).
Example of AddN kernel implementation with templates:
template <typename Device, typename T, class OpKernelT,
class OpKernelConstructionT, class OpKernelContextT>
class AddNOp : public OpKernelT {
public:
explicit AddNOp(OpKernelConstructionT* construction)
: OpKernelT(construction) {}
void Compute(OpKernelContextT* ctx) override {
if (!ctx->ValidateInputsAreSameShape(this)) return;
...Note, this is the original approach we were thinking of going with, the actual AddN kernel implementation already follows this pattern.
Templates will be specialized at registration time:
REGISTER_FALLBACK_KERNEL(
"AddN",
AddNOp<CPUDevice, int32, TFRTOpKernel, TFRTOpKernelConstruction, TFRTOpKernelContext>);| Templates | Inheritance | |
| Latency | Same | We expect increase due to vtable lookups. However, increase is negligible (0-2%) in our benchmarks when using `final` keywords * |
| Binary size (one implementation linked in) | Same | Same |
| Binary size (two implementations linked in) | Increase the most (2.6% estimate for AddN) | Increase in some cases** |
| Requires kernel changes | Yes (move to header, add template declaration) | Yes (add include, change OpKernel to OpKernelBase, OpKernel* to OpKernel*Interface) |
| Requires kernel changes for kernels *unsupported* by TFRT Kernel Fallback | No | No |
| Effects unconverted kernels | No | Yes (OpKernelConstruction/OpKernelContext now implement interfaces) |
* We initially measured a ~7% increase in latency for basic_ops_benchmark . This benchmark runs a series of scalar multiplications and devisions and primarily measures kernel overhead. However, we determined that declaring OpKernelContext and OpKernelConstruction final gets read of this regression. final helps because a call made by a kernel is the tip of the iceberg - the called functions then make multiple calls to other functions in the same class. For example, OpKernelContext::forward_input_or_allocate_output implementation calls >10 other functions in OpKernelContext.
** Increase will happen when we have intermediate subclass of OpKernel. For example, AveragePoolingOp extends UnaryOp and UnaryOp extends OpKernel. In this case, UnaryOp is the intermediate subclass. Now that a kernel can inherit either from OpKernel or OpKernelBase, we would need two implementations: UnaryOp and UnaryOpBase respectively. Kernels that support Kernel Fallback and inherit UnaryOp now will instead switch to inherit UnaryOpBase. Addition of UnaryOpBase increases binary size.
Currently we are thinking of proceeding with the inheritance approach as it doesn't seem to cause a significant performance regression based on our benchmarks.
Therefore, we expect that using inheritance would not add a noticeable overhead in most real world models. At the same time, inheritance can simplify code structure and debugging.
To benchmark size, we created a git branch that contains Kernel Fallback prototype: https://github.com/annarev/tensorflow/tree/kernel_fallback/tensorflow/core/tfrt_fallback/kernel (Note we had to make some other changes: branch comparison).
Android settings used when running ./configure:
- NDK: r18b
- NDK API level: 19
- Android build tools version: 30.0.1
- Android SDK API level: 28
We check size of a dependency by adding it to //tensorflow/lite/java:libtensorflowlite_jni.so target and running
bazel build -c opt tensorflow/lite/java/libtensorflowlite_jni.so --config=android_arm64 --define=disable_rtti_and_exceptions=true --define disable_eigen_mkldnn_contraction_kernel=true --define=TENSORFLOW_PROTOS=lite
ls -lh bazel-bin/tensorflow/lite/java/libtensorflowlite_jni.so
Findings are presented in the table below:
| Deps | Size |
|---|---|
| Existing TF Lite | 2.3M |
| Existing TF Lite + Kernel Fallback framework | 3.2M |
| Existing TF Lite + Kernel Fallback framework + 2 kernels* | 3.6M |
* Kernels used for benchmarking: AddN registered for int32, Conv3d registered for int32.
Therefore, we estimate the following current size measurements:
- Kernel Fallback framework: 900k
- Per-kernel per-type: 200k


