Skip to content

Architecture

Istemi Ekin Akkus edited this page Mar 30, 2021 · 3 revisions

This page describes the architecture of KNIX. At a high-level, the KNIX platform consists of the following components.

1. Sandbox

This component provides the application-level sandbox for KNIX. Each application runs in one or more sandbox instances. Inside each sandbox, there are additional components that are responsible for various functionalities.

1.1 Sandbox agent

The sandbox agent is the main component that runs inside the application sandbox. It is initiated with relevant parameters, such as the owner of the sandbox (i.e., userid), the sandbox and workflow id, the addresses of other components that need to be contacted (i.e., the data layer service and the ElasticSearch) and the endpoint key, so that the sandbox agent can signal other components that it has successfully started.

The sandbox agent is responsible for downloading the workflow description as well as the required resources (i.e., function code) to start the application sandbox. According to these resources, the necessary dependencies are also taken care of by the sandbox agent (e.g., pip dependencies for python, maven dependencies for Java). As the next step, the sandbox agent starts up the necessary components. These components include:

  • the sandbox frontend that is responsible for serving as an entry point to the sandbox,
  • for each state in the workflow, a function worker that is responsible for initializing the function code and its libraries and handling requests,
  • the queue service that is responsible for enabling local communication mechanisms for the functions,
  • the fluent-bit process that is responsible for collecting the logs of the sandbox and sending them to ElasticSearch.

1.2 Sandbox frontend

The sandbox frontend is responsible for providing an entry point to the workflow execution via HTTP requests. It receives an HTTP request, transforms it into a local queue service message and publishes it to the entry point of the workflow. It also subscribes to the "end" topic, such that when the workflow execution finishes, the final result will be published there. The frontend then transforms the result and sends it to the HTTP client that triggered the execution. In addition, the frontend can perform similar actions on the control messages sent to session function instances running inside this particular sandbox instance. In Knative, the collection of sandboxes belonging to the same workflow have a single HTTP point, and the requests are load balanced among all instances of the sandboxes. In bare metal, each sandbox instance gets its own externally visible address and the clients are responsible to picking one at random (for now).

1.3 Function worker(s)

Each function in an application has a dedicated function worker. In order to ensure a low-latency startup, the function worker first loads the user code and its libraries. Afterwards, it subscribes to its own topic in the local queue service and awaits for requests. When a request is received, the function worker forks itself and goes back to waiting for more requests. Meanwhile, the forked process (i.e., function instance) continues to pre-process the user input, sets up the necessary objects (i.e., API objects, KNIX-internal objects) and calls the user code. After user code finishes execution, the function instance finalizes the post-processing and continues to publish the result according to the workflow description (e.g., other functions, frontend). When this publication of the result is finished, the forked process exits.

1.4 Local queue service

The local queue service is responsible for providing the local shortcuts among the functions. It follows a simple publisher-subscriber model using topic names representing the queues for each function worker inside the sandbox. Each function has its own dedicated topic that it subscribes to via its function worker. Other functions interacting with this function publish messages to this topic. The local queue service also provides the means for communication of control messages between the sandbox agent and the function workers (e.g., updating local functions, graceful shutdown) as well as the control messages sent to session function instances via their own dedicated topic. As of #111, the local queue service is implemented using Redis streams.

1.5 Fluent-bit

This is a small program that is responsible for ensuring the logs created by the components inside the sandbox are sent to ElasticSearch.

2. Hierarchical Storage

The hierarchical storage provides the ephemeral and the persistent data storage for workflows. The hierarchy consists of two layers. A local data layer service runs on each host/node and serves the sandboxes running on that node for ephemeral data sharing. A global data layer can be a separate cluster and it persists data, which is accessible by all hosts/nodes, so that multiple instances of the same application can still share data.

2.1 Local data layer

Functions of an application running on a host can share data utilizing the local data layer service, which also asynchronously sends any item written to the local data layer to the global data layer (i.e., riak).

Technically, the data layer service should be host-local, so that accessing the data stored on it should be fast. There is also ongoing work that tries to improve the access latency to non-local data items (see #10).

2.2 Global data layer

The global data layer is provided by Riak. Riak provides a persistent storage that is globally accessible by the functions that might be running on different hosts (e.g., as part of another sandbox instance created for scalability). It supports key-value operations as well as provides support for conflict-free replicated data types (CRDTs), which are maps, sets and counters.

While the synchronization between the local data layer and the global data layer applies to key-value storage, work is going on providing the same principle for CRDTs.

3. ElasticSearch

The logs of the components are collected in a standard ElasticSearch instance. The fluent-bit component at each sandbox instance sends the logs to ElasticSearch. These logs are then made available via the management service, and can be accessed by the GUI, the SDK and the CLI. In bare metal, the logs also are available from the DataLayerService.

4. Management service

The management service is an application workflow that is written just like any other KNIX workflow. It consists of functions that enables the users to sign up, login and manage functions as well as workflows. The management service runs under the user admin@knix and works mostly as a regular application and accesses its own storage to manage application data. However, it stores function code and workflow descriptions on the logged-in user's storage; and therefore, it is considered privileged to have this access.

5. GUI

The GUI contains the dashboard for users to manage their functions and workflows from a browser. For this purpose, it interacts with the management service. It also provides access to the storage service (currently just KV store), where users can manage their storage objects.

6. nginx

The nginx serves as a web server for serving the static files of the GUI. It also acts as a proxy for the management service and the storage frontend, so that the GUI can make requests for the management of the functions and workflows, and storage objects, respecively.