-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Add Apache Arrow Flight Connector #24504
base: master
Are you sure you want to change the base?
[native] Add Apache Arrow Flight Connector #24504
Conversation
Thanks for the release note! New release note guidelines. Please remove the manual PR link in the following format from the release note entries for this PR.
I have updated the Release Notes Guidelines to remove the examples of manually adding the PR link. |
c9b9df2
to
2418e65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc! One nit of formatting.
The `tls_certs` directory contains placeholder TLS certificates generated for unit testing the Arrow Flight Connector with TLS enabled. These certificates are not intended for production use and should only be used in the context of unit tests. | ||
|
||
### Generating TLS Certificates | ||
To create the TLS certificates and keys inside the `tls_certs` folder, run the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To create the TLS certificates and keys inside the `tls_certs` folder, run the following command: | |
To create the TLS certificates and keys inside the `tls_certs` folder, run the following command: | |
Add a line after this blank line to separate this text and the command. Without the line, it looks like this:
data:image/s3,"s3://crabby-images/761ae/761ae6ff32986011f5196424c415afcf95f71a3e" alt="Screenshot 2025-02-05 at 4 51 27 PM"
} | ||
} | ||
|
||
void FlightDataSource::addSplit(std::shared_ptr<ConnectorSplit> split) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some minor cleanup in this method to simply use of the serialized FlightEndpoint
# limitations under the License. | ||
find_package(Arrow REQUIRED) | ||
find_package(PkgConfig REQUIRED) | ||
pkg_check_modules(ARROW_FLIGHT REQUIRED IMPORTED_TARGET GLOBAL arrow-flight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to add GLOBAL
for using PkgConfig to prevent an -larrow-flight
linker error. This seems like a better general solution to reference the libraries, see https://stackoverflow.com/questions/29191855/what-is-the-proper-way-to-use-pkg-config-from-cmake for more details
return config_->get<bool>(kServerVerify, true); | ||
} | ||
|
||
std::optional<std::string> FlightConfig::serverSslCertificate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was folly::Optional
, changed to be consistent
@@ -0,0 +1,251 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these tests to verify adding a custom authenticator works
|
||
target_link_libraries( | ||
presto_flight_connector_infra_test presto_protocol | ||
presto_flight_connector_test_lib GTest::gtest GTest::gtest_main ${GLOG}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added G stuff due to linker error
virtual arrow::flight::Location getServerLocation(); | ||
|
||
virtual void setFlightServerOptions( | ||
arrow::flight::FlightServerOptions* serverOptions) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Server options now specified by subclasses
f6eefb6
to
4b83c44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
#include "velox/common/base/Exceptions.h" | ||
|
||
namespace facebook::presto::connector::arrow_flight::auth { | ||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to be in an anonymous namespace, instead of static
@@ -351,7 +351,8 @@ std::unique_ptr<velox::connector::ConnectorSplit> | |||
SystemPrestoToVeloxConnector::toVeloxSplit( | |||
const protocol::ConnectorId& catalogId, | |||
const protocol::ConnectorSplit* const connectorSplit, | |||
const protocol::SplitContext* splitContext) const { | |||
const protocol::SplitContext* splitContext, | |||
const std::map<std::string, std::string>& extraCredentials) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rijin-N @elbinpallimalilibm is there a specific requirement to use these extraCredentials
to authenticate a Flight client? I would think most users would prefer to authenticate the client with a token in the header, rather than use this. The Java Arrow Connector does not have this either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler the base Presto (Java) Flight Connector is a template and not a full connector, and elects not to do authentication at all. The idea is that the authentication mechanism is largely dependent on the specific Flight server, so the whole business of authentication should be dealt with by the specialized connectors.
The problem we need to solve is that different Presto users may have different sets of access permissions on the connected Arrow Flight server. In this case we need to "pass through" the Presto user's authentication to the connected Flight Server, which means we somehow need to send the credentials from the Presto coordinator to the Presto workers.
In Java, there is an Identity API which we can use through connectorSession.getIdentity().getExtraCredentials()
.
IBM's specialization of the base Flight connector uses this API to pass the user's token to the workers. This isn't possible in Prestissimo because Velox connectors have no concept of Identity at all, so we need to pass the token in the split.
This issue provides more context:
#22849
There's also a corresponding Velox issue:
facebookincubator/velox#10107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the background @ashkrisk . This related issues you linked are proposing adding an Identity data structure that all prestissimo connectors can access, but it is so far unresolved. I think this is a better solution that simply passing through a map, so we should hold off of adding these extraCredentials
until we can get the right API with an Identity
. WDYT?
cc @majetideepak @aditi-pandit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler we need this to get the multi-user setup working (IBM's flavor of the Flight connector relies on it). You can leave this off for now as long as you're able to maintain the diff in IBM code until there's a better solution in OSS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashkrisk we need to make sure we have a discussion and proper design before adding APIs like this, since this also affects other connectors. I believe it was mentioned there could be a security concern if the credentials are added to the split, which is then serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler do we need to get a meeting to talk about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with @majetideepak and @aditi-pandit , and have a possible way forward that I'm looking into, and I'll report back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler With regards to security, any method used to transfer the credentials from the coordinator to the worker requires serialization at some point. This includes the Identity
framework available in Java. To ensure security, you need to ensure encrypted connections between all Presto nodes.
Obviously, it's better to have dedicated infra for this, so hopefully you can get together and work something out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to incorporate #22859 which will add the extraCredentials to the connector session properties. This will be available to the Flight data source when it is processing the split and can use them for authentication. I'll try to add a test for this if I'm able to.
4b83c44
to
49a3234
Compare
} | ||
|
||
bool FlightConfig::serverVerify() { | ||
return config_->get<bool>(kServerVerify, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed default to "false" to be consistent with Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want the default to be true
for both Java and C++, otherwise by default there's no protection against server impersonation. Having this off is good for testing, but in production it's much better to leave it on unless there's a VERY good reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, the default behavior for the Flight client is to raise an error if using TLS and a certificate is provided but verify is disabled, and you need to explicitly disable it when using TLS.
I didn't catch that the Java connector defaults the config to NULL, which translates to true. I think that's confusing and we should not make it nullable. I'll change Java and here to default to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the Java side in #24518 and changed the default here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler. Did a quick round of the code.
@@ -365,4 +369,7 @@ JavaClasses: | |||
- presto-main/src/main/java/com/facebook/presto/connector/system/SystemTableLayoutHandle.java | |||
- presto-main/src/main/java/com/facebook/presto/connector/system/SystemTransactionHandle.java | |||
- presto-spi/src/main/java/com/facebook/presto/spi/function/AggregationFunctionMetadata.java | |||
- presto-function-namespace-managers-common/src/main/java/com/facebook/presto/functionNamespace/JsonBasedUdfFunctionMetadata.java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did JsonBasedUdfFunctionMetadata get dropped ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must have been my mistake, I'll put it back
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
if(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition can be moved below and combined with that at line 18.
|
||
namespace facebook::presto::connector { | ||
|
||
void registerAllPrestoConnectors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Rename registerOptionalConnectors ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerConnectors
will do as well.
We should also move the existing connector registrations here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do away with the connector
namespace since the function name has Connectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file names can also be Registration.h/.cpp
since they are in the connectors folder.
presto-native-execution/presto_cpp/main/connectors/Registration.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the SystemConnector here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the SystemConnector here as well.
@majetideepak should the SystemConnector be under namespace facebook::presto::connector
like ArrowFlightConnector?
Could you clarify if you mean the source files SystemConnector.cpp/h or just the registration?
edit: confirmed moving sources for connectors also
} while (false) | ||
|
||
#define AFC_ASSIGN_OR_RAISE_IMPL(result_name, lhs, rexpr) \ | ||
auto&& result_name = (rexpr); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro should be wrap in a do.. while loop as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that will work, the lhs
variable gets scoped in the loop. So something like this you can't do anymore
AFC_ASSIGN_OR_RAISE(auto client, FlightClient::Connect(loc, *clientOpts_));
I believe this was taken from https://github.com/facebookincubator/velox/blob/d3f7b3d8c5c67bdcdcad05d9946ae6b87f88d760/velox/functions/prestosql/json/SIMDJsonUtil.h#L36
Which doesn't wrap in a loop either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it comes from the Arrow source:
https://github.com/apache/arrow/blob/152eb879b4fd4476dadd3cd497bb8d78d3013cef/cpp/src/arrow/result.h#L460
See also the comments around ARROW_ASSIGN_OR_RAISE
in the source:
https://github.com/apache/arrow/blob/152eb879b4fd4476dadd3cd497bb8d78d3013cef/cpp/src/arrow/result.h#L467-L487
std::unique_ptr<velox::connector::ColumnHandle> | ||
ArrowPrestoToVeloxConnector::toVeloxColumnHandle( | ||
const protocol::ColumnHandle* column, | ||
const TypeParser& typeParser) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap unused parameters in comments... /typeParser/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file names should match the class names. Might be good to use all classes with ArrowFlight consistently,
|
||
std::unique_ptr<TestFlightServer> TestFlightServerTest::server; | ||
|
||
TEST_F(TestFlightServerTest, basicTest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand the purpose of this test. It doesn't seem to be related to the Connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure either, it's part of presto_flight_connector_infra_test
tests, which is separate. It starts a server, adds data, then makes it's own client to get the flight stream - not using the connector. So it seems to be testing that the test server can serve Flight clients correctly, but that is also covered with the regular connector tests. @Rijin-N could you comment on this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to run a simple test to be confident that the Test Flight Server is working correctly. Now, any failures that happen in the actual connector tests can be safely assumed to be a problem with the connector and not with the server, potentially making things easier to debug.
This is like the difference between testing each component vs having only integration tests.
{1, 12, 2, std::numeric_limits<int64_t>::max()})})); | ||
|
||
auto idVec = makeFlatVector<int64_t>( | ||
{1, 12, 2, std::numeric_limits<int64_t>::max()}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abstract a variable for {1, 12, 2, std::numeric_limits<int64_t>::max()} so that it can be used in the input and result vector.
|
||
namespace { | ||
|
||
namespace velox = facebook::velox; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't seen this style of shorthand for namespaces in Prestissimo code. Can these be removed ?
"missing columnHandle for column 'value'"); | ||
} | ||
|
||
TEST_F(FlightConnectorTest, dataSourceTest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test names shouldn't have the word "Test" in them. Please fix the names.
presto-native-execution/Makefile
Outdated
@@ -45,6 +45,9 @@ endif | |||
ifneq ($(PRESTO_MEMORY_CHECKER_TYPE),) | |||
EXTRA_CMAKE_FLAGS += -DPRESTO_MEMORY_CHECKER_TYPE=$(PRESTO_MEMORY_CHECKER_TYPE) | |||
endif | |||
ifneq ($(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this.
Users can set EXTRA_CMAKE_FLAGS
directly. Exiting options will be removed from this Makefile.
presto-native-execution/README.md
Outdated
@@ -115,6 +115,15 @@ follow these steps: | |||
* For development, use `make debug` to build a non-optimized debug version. | |||
* Use `make unittest` to build and run tests. | |||
|
|||
#### Arrow Flight Connector | |||
To enable Arrow Flight connector support, set the environment variable: | |||
`PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR = "ON"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix to use EXTRA_CMAKE_FLAGS = -DPRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR=ON
|
||
namespace facebook::presto::connector { | ||
|
||
void registerAllPrestoConnectors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registerConnectors
will do as well.
We should also move the existing connector registrations here.
|
||
namespace facebook::presto::connector { | ||
|
||
void registerAllPrestoConnectors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do away with the connector
namespace since the function name has Connectors.
|
||
namespace facebook::presto::connector { | ||
|
||
void registerAllPrestoConnectors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file names can also be Registration.h/.cpp
since they are in the connectors folder.
presto-native-execution/presto_cpp/main/connectors/Registration.h
|
||
namespace facebook::presto::connector { | ||
|
||
void registerAllPrestoConnectors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the SystemConnector here as well.
namespace facebook::presto::connector::arrow_flight { | ||
|
||
using namespace arrow::flight; | ||
using namespace velox; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these.
Explicitly using velox:: where needed makes the code readable.
@@ -35,15 +35,73 @@ function install_prometheus_cpp { | |||
cmake_install -DBUILD_SHARED_LIBS=ON -DENABLE_PUSH=OFF -DENABLE_COMPRESSION=OFF | |||
} | |||
|
|||
function install_abseil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abseil and grpc are installed by the velox setup script already.
We can align the versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak they are the same versions, but they are done in the call to install_gcs-sdk-cpp
so we can't call them directly unless we update the velox script.
wget_and_untar https://github.com/apache/arrow/archive/apache-arrow-${ARROW_VERSION}.tar.gz arrow | ||
cmake_install_dir arrow/cpp \ | ||
-DARROW_FLIGHT=ON \ | ||
-DARROW_BUILD_BENCHMARKS=ON \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need Arrow benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @BryanCutler @majetideepak,
This was included due to a known issue in Arrow version 15.0.0. Without setting ARROW_BUILD_BENCHMARKS=ON, a compilation error occurs while building the Arrow package with arrow flight. This issue has been fixed in later versions of Arrow.
Please see the PR link for details:
PR: https://github.com/apache/arrow/pull/41622/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's add a comment here about this so that this can be removed when Arrow is updated.
} | ||
|
||
function install_arrow_flight { | ||
ARROW_VERSION="15.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should get this from the Velox script. We can skip here to align the versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak the version is defined in the OS setup scripts, we could do something like
ARROW_VERSION="${ARROW_VERSION:-15.0.0}"
92b5668
to
cc96d0a
Compare
@aditi-pandit @majetideepak I've addressed most of the feedback, just had a couple questions above. The connectors have been moved, namespaces cleaned and files named consistently. Please take another look, thanks! |
7043345
to
5a5a886
Compare
The native Arrow Flight connector can be used to connect to any Arrow Flight enabled Data Source. The metadata layer is handled by the Presto coordinator and does not need to be re-implemented in C++. Any Java connector that inherits from `presto-base-arrow-flight` can use this connector as it's counterpart for the Prestissimo layer. Different Arrow-Flight enabled data sources can differ in authentication styles. A plugin-style interface is provided to handle such cases with custom authentication code by extending `arrow_flight::auth::Authenticator`. RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0004-arrow-flight-connector.md#prestissimo-implementation Co-authored-by: Ashwin Kumar <[email protected]> Co-authored-by: Rijin-N <[email protected]> Co-authored-by: Nischay Yadav <[email protected]>
5a5a886
to
9aa8360
Compare
Description
Add Prestissimo support for Apache Arrow Flight connectors.
The native Arrow Flight connector can be used to connect to any Arrow Flight enabled Data Source. The metadata layer is handled by the Presto coordinator and does not need to be re-implemented in C++. Any Java connector that inherits from
presto-base-arrow-flight
can use this connector as it's counterpart for the Prestissimo layer.Different Arrow-Flight enabled data sources can differ in authentication styles. A plugin-style interface is provided to handle such cases with custom authentication code by extending
arrow_flight::auth::Authenticator
.Motivation and Context
RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0004-arrow-flight-connector.md#prestissimo-implementation
Continues from #24082
Impact
Arrow Flight based connector will be supported in Prestissimo.
Test Plan
Unit Tests set up a testing Arrow Flight server and exchange data using the new connector.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.