Asynchronous Processing #129

norberttech · 2022-01-17T21:20:30Z

norberttech
Jan 17, 2022
Maintainer

Currently, the biggest limitation for this ETL is absolutely no support for parallel async processing.
So now even if the process would be launched at 12 cores machine with massive amount of RAM, processing of a 10GB CSV file would be not much faster than on a smaller machine due to a single process and one thread.

So this discussion is here to investigate if making async parallel processing is even possible.
Below few things that we need to consider/implement/decide even before we start implementing any async processing.

Things to think about:

Serialization
Communication
- Subprocess
- Cross Server

Serialization

flow-php is loading a batch of Rows at once in Loader and later those Rows are passed through all registered Transformers and Loaders which are added to the pipeline under common interface Pipeline/Element.

In order to even think about async processing we would need to be able to synchronize both:

So instead of processing row in the main process, ETL could launch a subprocess and pass to it serialize Rows and Pipeline/Elements

With Rows serialization seems to be pretty straightforward, we need to serialize each element of Rows:

Row
- Entries
  - Entry

In this case, since those are only Value objects, there should not be any surprises.

Pipeline/Element is a bit more tricky, lets look at this through example:

<?php 

$path = '~/file.csv';
$writer = Writer::createFromPath($path, 'w+');

$loader = new LeagueCSVLoader($writer);

With a single process results of transformations should be loaded into a single file ~/file.csv. If we choose for example JSON serialization strategy then serialized loader can look like this:

{"loader": "league_csv", "parameters":{ "path": "~/file.csv", "open_mode": "w+", "heders": true }}

And we are getting here to the first issue, how to load results from multiple processes into a single file?

One of the solutions would be to work around that problem by loading results into a folder:

~/file.csv/ 
  - subprocess_1_pid.csv
  - subprocess_2_pid.csv
  - subprocess_3_pid.csv

Not the most elegant solution but acceptable for the proof of concept. Loader would need to be aware of the runtime, so it would need to know if it's a single process ETL or multiprocess one and choose to save strategy (or always follow multiprocess convention)

Communication

In order to run the ETL process, we need to execute ETL::run() method, which would be a really nice entry point to the whole multiprocessing.

In the proof of concept async processing, we can start from using processes but we can try to design it in a way that would allow to spawn a processing cluster across multiple machines (yeah, probably overkill and there are better tools for that but lets at leas keep open minds).

So going back to the run method, in this step ETL could try to initialize processes, connect to workers etc.

We can call the main process coordinator and all subprocesses workers which should not suggest any implementation. In a single process configuration, coordinator and worker would be the same thing without a need to initialize any communication. It would just take the Pipeline/Element and use them to process Rows.

In the multiprocess communication coordinator would first need to get the workers pool size and pass to each available worker serialized Row with all Pipeline/Element.

Once again, this is a bit limiting, because each worker will get a single Rows element and all registered Pipeline/Element which won't let us optimize anything by grouping Transformers but for the proof of concept it should be more than enough.

Assuming that our Source comes with 10 rows this is how coordinator would allocate them across 4 workers

worker 1 - row 1
worker 2 - row 2
worker 3 - row 3
worker 4 - row 4

and now it would need to wait for any worker to finish processing in order to pass somewhere row 5.

worker 1 - row 5
worker 3 - row 6
worker 2 - row 7
worker 4 - row 8
worker 2 - row 9
worker 3 - row 10

This is where we are getting to the problematic part, what communication protocol should we use?

Using subprocesses coordinator can spawn them, pass them the serialized Rows, Pipeline/Elements and wait for worker to report the results (or kill after a timeout), but what about cross servers communication? In this case, the coordinator would need to connect to a worker which would be a long-running process that could be used to process chunks of data.

So maybe we should think about workers as a long-running process that would listen for a connection (sockets maybe?) and that would be able to tell coordinator if its' already busy, send back some heartbeats maybe?

There are 2 scenarios:

Single Server - multi-process
Many Servers - multi-process

In both cases, we can launch multiple worker processes and let ETL know how to communicate with them, on a Single Server it could be at localhost on a given port, on Many Servers through ip address?

There are still more answers than questions here, but if anyone would be interested in discussing this topic we can jump on a call or continue the conversation here. There is a very good chance it won't get implemented at all or in a very limited version (just subprocesses for example). There is also absolutely no timeline of when (if at all) it will get implemented.
I will also try to use this thread as a scratchpad and will try to note all thoughts I might have around this topic, feel free to do the same.

norberttech · 2022-01-22T11:48:34Z

norberttech
Jan 22, 2022
Maintainer Author

One of the possible ways to setup a reliable communication between processes and servers could be sockets.
In this case it should be possible to ise ReactPHP Socket Client/Server or amphp/socket that would provide an abstraction over sockets with does not even come with that many own dependencies.

It could be considered as a suggested dependency only when async processing is required which would make whole async concept optional.

Single Server Implementation

Here It would be ideal to limit the complexity of launching this implementation to absolute minimum, so dev would only decide how many processes he wants to spawn and let the ETL to do the rest.

With this approach coordinator would act as a server and worker would connect to the server as a clients. Launched processes would get a host/port requiered to connect with server.

Worker processes would die after processing is finished making space for new workers. Pool size and capacity would be calculated by number of working processes against maximum number of workers.

Multiple Servers Implementation

In this case it might be a bit more tricky, since we are not dealing with one server but a collection of servers. First thing that came to my mind is a concept of cluster, where multiple nodes can be registered as workers on which workers jobs (processes) would be launched.
coordinator would need to connect to the cluster, cluster would need to allocate some number of nodes for that given coordindator and once it’s all established process can start.

So in this case, coordinator would act as client, cluster would act as a server and workers would also act as a clients
Additionally there would need to be 2 extra long running processes:

Cluster process - managing worker nodes pool and assigning them to specific coordindators
Worker Node process - process that would register himself in a cluster by telling him how many processes it can handle at once.

Orchestration would look like this:

Spawn Cluster App on a server on a given port
Spawn Worker Node App on a server that would connect to the Clustetr on a given host/port
Launch ETL process that would send Tasks to the Cluster which would assign them to the Worker Nodes.

Common Parts

In both cases coordinator will need to exactly know Workers Pool size and status of each worker.
Workers would need to receive tasks from coordinator and report progress (maybe also a hearbeats) back to coordinator.

0 replies

norberttech · 2022-02-01T09:45:13Z

norberttech
Feb 1, 2022
Maintainer Author

Proof of Concept is available as flow-php/etl-async

0 replies

norberttech · 2022-04-26T20:35:30Z

norberttech
Apr 26, 2022
Maintainer Author

Done! In order to make this happen, pretty much the entire internal structure of the library was redesigned. New features were added, like for example:

aggregations
grouping
joins
caching
serialization

but finally, flow-php provides asynchronous processing abstraction out of the box that is provided by two adapters :

amp
reactphp

🎉

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous Processing #129

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Asynchronous Processing #129

norberttech Jan 17, 2022 Maintainer

Serialization

Communication

Replies: 3 comments

norberttech Jan 22, 2022 Maintainer Author

Single Server Implementation

Multiple Servers Implementation

Common Parts

norberttech Feb 1, 2022 Maintainer Author

norberttech Apr 26, 2022 Maintainer Author

norberttech
Jan 17, 2022
Maintainer

norberttech
Jan 22, 2022
Maintainer Author

norberttech
Feb 1, 2022
Maintainer Author

norberttech
Apr 26, 2022
Maintainer Author