Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36794] [cdc-composer/cli] pipeline cdc connector support multiple data sources #3844

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

linjianchang
Copy link

pipeline cdc connector support multiple data sources

@github-actions github-actions bot added docs Improvements or additions to documentation composer cli mysql-pipeline-connector labels Jan 8, 2025
stream = stream.union(streamBranch);
}
}
boolean isParallelMetadataSource = dataSource.isParallelMetadataSource();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multi data sources should be regarded as parallelized.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already modified


```yaml
source:
type: mysql_mutiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a new key like 'sources' to describe multiple sources? The '_multiple' suffix in value seems a bit odd. Because the YAML content does not correspond one-to-one with the PipelineDef.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already modified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sources:
 - type: mysql
    name: mysql-instance-00
    hostname: localhost
    port: 3306
    ....
 - type: mysql
    name: mysql-instance-01
    hostname: localhost
    port: 3307
    ....

And the corresponding PipelineDef looks like this:

public class PipelineDef {
    @Nullable private List<SourceDef> sources;
    private final SourceDef source;
    ...
}

If the sources is not null then we use these data sources, otherwise we use source to build up DataStream. In this way, the previous usage will not be affected.
I want to hear your opinion. @yuxiqian

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like @ChaomingZhangCN's proposed syntax for a fully multiple data source, they're intuitive and expressive, but might be a chore if users just want to connect to a MySQL cluster with multiple servers, as they have to copy all identical configurations to both source definition.

@linjianchang's solution for now seems like MySQL specific, especially for multi-host clusters. It could not be extended for hetero-sources (like concatenating data from different DBMS), or when one wants to use different configs for each node. These cases don't exist for now since all we have is MySQL source connector, but as we're modifying composer and YAML API (instead of MySQL connector itself), such possibility should be discussed more carefully.

As for multiple sources in pipeline itself, I remembered the idea has been informally discussed with @leonardBang and @PatrickRen long time ago, and the conclusion was running multiple sources in one single job actually makes the pipeline more fragile, since any single-point failure would easily escalate and cause a global failover. Things might have changed since then, still needs hearing from senior developers on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sources:
 - type: mysql
    name: mysql-instance-00
    hostname: localhost
    port: 3306
    ....
 - type: mysql
    name: mysql-instance-01
    hostname: localhost
    port: 3307
    ....

And the corresponding PipelineDef looks like this:

public class PipelineDef {
    @Nullable private List<SourceDef> sources;
    private final SourceDef source;
    ...
}

If the sources is not null then we use these data sources, otherwise we use source to build up DataStream. In this way, the previous usage will not be affected. I want to hear your opinion. @yuxiqian @ChaomingZhangCN

It has been modified according to comments, please review it again, thanks!

private static final String HOST_NAME = "hostname";
private static final String PORT = "port";
private static final String COLON = ":";
private static final String MUTIPLE = "_mutiple";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be _multiple.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already modified

@linjianchang linjianchang force-pushed the master-36794 branch 3 times, most recently from 5a9bd3e to f8524d7 Compare January 15, 2025 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli composer docs Improvements or additions to documentation mysql-pipeline-connector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants