Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-20625] Implement V2 Google Cloud PubSub Source in accordance with FLIP-27 #32

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

clmccart
Copy link

@clmccart clmccart commented Oct 8, 2024

TLDR;

  • use StreamingPull instead of a synchronous pull to improve performance
  • implemented using the new Flip-27 interfaces (Source, SplitEnumerator, and SourceReader)

PubSubSplitEnumerator:

  • Since there is no limit in PubSub to the number of subscriber clients, there is no real concept of “split discovery”.
  • Therefore, a "split" in this implementation is simply represented by a single subscriber client for a single subscription.
  • The number of splits is determined by the parallelism of the job.
  • The PubSubSplitEnumerator monitors the number of SourceReaders that are added. It assigns a single split to every SourceReader, meaning each SourceReader will be pulling messages from Pub/Sub in parallel.

PubSubSourceReader:

  • The PubSubSplitReader wraps a Subscriber client from the Cloud Pub/Sub Java client library to asynchronously receive messages from Pub/Sub.
  • Single Subscriber client per source reader
  • Responsible for acknowledging messages
  • There are two main benefits to using the client library:
    • The Subscriber client uses StreamingPull to asynchronously receive messages and provide maximum throughput
    • The client library handles ack management, and automatically extends ack deadlines while processing messages
  • Uses checkpointing to acknowledge messages.
    • When a checkpoint completes, all outstanding messages will be acknowledged. Because acknowledgements are required, the v2 source requires checkpointing to be enabled.
  • flow control
    • max outstanding bytes: 100MB
    • max outstanding elements: defaults to 1000

Adds an example pipeline using the new source and adds the "Deprecated" tag to the old source implementation

Additional context:

Note that documentation will be updated in a separate PR

hannahrogers-google and others added 3 commits October 8, 2024 20:51
fix: remove unwanted changes

feat: create basic split enumerator

fix: requested changes

Update split.proto

Update PubSubSink.java

Update PubSubNotifyingPullSubscriber.java

Update PubSubSinkTest.java

Update PubSubCheckpointSerializerTest.java

Update PubSubSplitEnumeratorTest.java

feat: create pubsub source

fix: comment

Fix a typo in the ordering key prober readme (#353)

Create a parent pom.xml and restructure the flink-connector source code to support directories for integration tests and sample code. Also, replace Java's Optional class with Guava's Optional, which is serializable, to enable starting a Flink job that uses CPS as a source.

Add a WIP example of using CPS as a Flink source.

Clean up POM files.

Change example code to use both a PubSub sink and source.

Fix interrupt not returning empty messages

Add flow control setting to source builder.

Remove limitExceededBehavior source option

Create separate unit tests

Add PubSubSource integration test using CPS emulator

Remove emulatorEndpoint source builder option

Replace PubSubEmulatorTestBase with PubSubEmulatorHelper

Add initial documentation for Flink connector

Address PR comments

Add sink/it testing doc

Add builder options to source/sink

Document source/sink builder options

Add at-least-once delivery guarantee for CPS sink

Update README with connector status

Improve example Flink job and documentation

Update library version to 1.0.0-SNAPSHOT

Use docker to start CPS emulator in it tests

Add PubSubSinkEmulatorTest

Do not add source messages to a checkpoint until after it is emitted (#369)

Add contribution guide for CPS flink connector (#370)
gitignore

move examples into streaming folder

move sourcev2 example into right folder. does not compile

big move + proto compile

move to correct packages

import FixedHeaderProvider

add autovalue to pom

compiles

tests compile

add vscode files to gitignore

fix example pipeline

builds

revert flink-examples-streaming-gcp-pubsub pom

java.lang.NoClassDefFoundError: org/apache/flink/shaded/io/netty/channel/ChannelFactory

pipeline works
Copy link

boring-cyborg bot commented Oct 8, 2024

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

@clmccart clmccart force-pushed the final branch 8 times, most recently from 84cc42f to 3af0b29 Compare October 9, 2024 15:59
@clmccart clmccart changed the title Final Donate PubSub source and sink rewrite back to Apache Oct 9, 2024
.gitignore Outdated Show resolved Hide resolved
@clmccart clmccart changed the title Donate PubSub source and sink rewrite back to Apache [FLINK-20625] Implement V2 Google Cloud PubSub Source in accordance with FLIP-27 Oct 9, 2024
@clmccart
Copy link
Author

clmccart commented Oct 9, 2024

looks like the sink rewrite was completed in FLINK-24298. will need to take a look at that and probably remove the sink changes from this PR

@clmccart
Copy link
Author

looks like the sink rewrite was completed in FLINK-24298. will need to take a look at that and probably remove the sink changes from this PR

the main reason we wrote a new sink implementation was because the previous implementation was deprecated. we didnt make any significant performance improvements so i went ahead and remove the sink rewrite from this PR. we can followup with changes to the sink implementation in a subsequent PR but no changes are pressing.

@clmccart clmccart marked this pull request as ready for review November 18, 2024 17:32
@clmccart
Copy link
Author

@snuyanzin would you mind taking a look at this PR? or is there someone else who might be a better fit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants