Skip to content

Commit 405ff80

Browse files
JelteFY--
andauthored
Remove code related to http caching (#644)
We had implemented our own http caching in #88 when nothing like that existed yet. Now there are two different threads that are implementing something like this in a better way than we have done. So this PR removes our custom caching code from our codebase. Instead in a future release we'll make sure that pg_duckdb works well with these at least one of these other caching solutions. For reference the two projects that are currently implementing caching are: 1. An official in-memory cache by DuckDB, which will eventually improved to also support on-disk caching: duckdb/duckdb#16463 2. A community provided extension called `cache_httpfs`: https://github.com/dentiny/duck-read-cache-fs --------- Co-authored-by: Yves <yves@motherduck.com>
1 parent 8289cf1 commit 405ff80

29 files changed

Lines changed: 27 additions & 4189 deletions

docs/extensions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
The following extensions are installed by default:
44

5-
* httpfs - note that httpfs was forked to add [`duckdb.cache`](functions.md#cache)
5+
* httpfs
66
* json
77

88
Supported extensions for installation are:
@@ -16,7 +16,7 @@ Installing other extensions may work, but is at your own risk.
1616

1717
By default known extensions are allowed to be automatically installed and loaded when a DuckDB query depends on them. This behaviour can be configured using the [`duckdb.autoinstall_known_extensions`](settings.md#duckdbautoinstall_known_extensions) and [`duckdb.autoload_known_extensions`](settings.md#duckdbautoload_known_extensions) settings.
1818

19-
It's also possible to manually install an extension. This can be useful when this autoinstall/autoaload behaviour is disabled, or when DuckDB fails to realise an extension is necessary to execute the query. Installing an extension requires superuser.
19+
It's also possible to manually install an extension. This can be useful when this autoinstall/autoload behaviour is disabled, or when DuckDB fails to realise an extension is necessary to execute the query. Installing an extension requires superuser.
2020

2121
```sql
2222
SELECT duckdb.install_extension('extname');

docs/functions.md

Lines changed: 0 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,6 @@ All of the DuckDB [json functions and aggregates](https://duckdb.org/docs/data/j
2626
| :--- | :---------- |
2727
|[`approx_count_distinct`](https://duckdb.org/docs/sql/functions/aggregates.html#approximate-aggregates)|Gives the approximate count of distinct elements using HyperLogLog|
2828

29-
## Cache Management Functions
30-
31-
| Name | Description |
32-
| :--- | :---------- |
33-
| [`duckdb.cache`](#cache) | Caches a Parquet or CSV file to disk |
34-
| [`duckdb.cache_info`](#cache_info) | Returns metadata about cached files |
35-
| [`duckdb.cache_delete`](#cache_delete) | Deletes a file from the cache |
36-
3729
## DuckDB Administration Functions
3830

3931
| Name | Description |
@@ -257,48 +249,6 @@ Further information:
257249
| :--- | :--- | :---------- |
258250
| path | text | The path, either to a remote httpfs location or a local location (if enabled) of the delta dataset to read. |
259251

260-
#### <a name="cache"></a>`duckdb.cache(path TEXT, type TEXT) -> bool`
261-
262-
Caches a parquet or CSV file to local disk. The file is downloaded synchronously during the execution of the function. Once cached, the cached file is used automatically whenever that URL is used in other httpfs calls, provided that the remote data has not changed. Data is stored based on the eTag of the remote file.
263-
264-
Note that cache management is not automated. Cached data must be deleted manually.
265-
266-
##### Required Arguments
267-
268-
| Name | Type | Description |
269-
| :--- | :--- | :---------- |
270-
| path | text | The path to a remote httpfs location to cache. |
271-
| type | text | File type, either `parquet` or `csv` |
272-
273-
#### <a name="cache_info"></a>`duckdb.cache_info() -> (remote_path text, cache_key text, cache_file_size BIGINT, cache_file_timestamp TIMESTAMPTZ)`
274-
275-
Inspects which remote files are currently cached in DuckDB. The returned data is as follows:
276-
277-
| Name | Type | Description |
278-
| :--- | :--- | :---------- |
279-
| remote_path | text | The original path to the remote httpfs location that was cached |
280-
| cache_key | text | The cache key (eTag) used to store the file |
281-
| cache_file_size | bigint | File size in bytes |
282-
| cache_file_timestamp | timestamptz | Creation time of the cached file |
283-
284-
#### <a name="cache_delete"></a>`duckdb.cache_delete(cache_key text) -> bool`
285-
286-
Deletes a file from the DuckDB cache using the `unique cache_key` of the file.
287-
288-
Example: To delete any copies of a particular remote file:
289-
290-
```sql
291-
SELECT duckdb.cache_delete(cache_key)
292-
FROM duckdb.cache_info()
293-
WHERE remote_path = '...';
294-
```
295-
296-
##### Required Arguments
297-
298-
| Name | Type | Description |
299-
| :--- | :--- | :---------- |
300-
| cache_key | text | The cache key (eTag) to delete |
301-
302252
#### <a name="install_extension"></a>`duckdb.install_extension(extension_name TEXT, repository TEXT DEFAULT 'core') -> bool`
303253

304254
Installs a DuckDB extension and configures it to be loaded automatically in

sql/pg_duckdb--0.3.0--0.4.0.sql

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
-- Add "url_style" column to "secrets" table
22
ALTER TABLE duckdb.secrets ADD COLUMN url_style TEXT;
33

4+
DROP FUNCTION duckdb.cache_delete;
5+
DROP FUNCTION duckdb.cache_info;
6+
DROP FUNCTION duckdb.cache;
7+
DROP TYPE duckdb.cache_info;
48

59
DROP FUNCTION duckdb.install_extension(TEXT);
610
CREATE FUNCTION duckdb.install_extension(extension_name TEXT, source TEXT DEFAULT 'core') RETURNS void

src/pgduckdb_duckdb.cpp

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -287,19 +287,6 @@ DuckDBManager::LoadExtensions(duckdb::ClientContext &context) {
287287
continue;
288288
}
289289

290-
/*
291-
* Skip the httpfs extension. It conflicts with our cached_httpfs
292-
* extension which is loaded by default, but people might still try to
293-
* install it manually. We could also throw an error by doing so, but
294-
* it seems nicer to just skip it, because it is clear what they meant
295-
* to do and this way we don't need to educate people about the
296-
* existence of our cached_httpfs extension (which might be removed in
297-
* the near future anyway).
298-
*/
299-
if (extension.name == "httpfs") {
300-
continue;
301-
}
302-
303290
DuckDBQueryOrThrow(context, "LOAD " + extension.name);
304291
}
305292
}
@@ -319,10 +306,6 @@ DuckDBManager::RefreshConnectionState(duckdb::ClientContext &context) {
319306
UpdateSecretSeq(secret_table_last_seq);
320307
}
321308

322-
auto http_file_cache_set_dir_query =
323-
duckdb::StringUtil::Format("SET http_file_cache_dir TO '%s';", CreateOrGetDirectoryPath("duckdb_cache"));
324-
DuckDBQueryOrThrow(context, http_file_cache_set_dir_query);
325-
326309
if (duckdb_disabled_filesystems != NULL && !superuser()) {
327310
/*
328311
* DuckDB does not allow us to disable this setting on the

src/pgduckdb_options.cpp

Lines changed: 10 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -262,87 +262,6 @@ DuckdbInstallExtension(const std::string &extension_name, const std::string &rep
262262
SPI_finish();
263263
}
264264

265-
static bool
266-
CanCacheRemoteObject(std::string remote_object) {
267-
return (remote_object.rfind("https://", 0) == 0) || (remote_object.rfind("http://", 0) == 0) ||
268-
(remote_object.rfind("s3://", 0) == 0) || (remote_object.rfind("s3a://", 0) == 0) ||
269-
(remote_object.rfind("s3n://", 0) == 0) || (remote_object.rfind("gcs://", 0) == 0) ||
270-
(remote_object.rfind("gs://", 0) == 0) || (remote_object.rfind("r2://", 0) == 0);
271-
}
272-
273-
static bool
274-
DuckdbCacheObject(Datum object, Datum type) {
275-
auto object_path = DatumToString(object);
276-
if (!CanCacheRemoteObject(object_path)) {
277-
elog(WARNING, "(PGDuckDB/DuckdbCacheObject) Object path '%s' can't be cached.", object_path.c_str());
278-
return false;
279-
}
280-
281-
auto object_type = DatumToString(type);
282-
if (object_type != "parquet" && object_type != "csv") {
283-
elog(WARNING, "(PGDuckDB/DuckdbCacheObject) Cache object type should be 'parquet' or 'csv'.");
284-
return false;
285-
}
286-
287-
/* Use a separate connection to cache the objects, so we are sure not to
288-
* leak the value change of enable_http_file_cache in case of an error */
289-
auto con = DuckDBManager::CreateConnection();
290-
auto &context = *con->context;
291-
292-
DuckDBQueryOrThrow(context, "SET enable_http_file_cache TO true;");
293-
294-
auto object_type_fun = object_type == "parquet" ? "read_parquet" : "read_csv";
295-
auto cache_object_query =
296-
duckdb::StringUtil::Format("SELECT 1 FROM %s('%s');", object_type_fun, object_path.c_str());
297-
DuckDBQueryOrThrow(context, cache_object_query);
298-
299-
return true;
300-
}
301-
302-
typedef struct CacheFileInfo {
303-
std::string cache_key;
304-
std::string remote_path;
305-
int64_t file_size;
306-
Timestamp file_timestamp;
307-
} CacheFileInfo;
308-
309-
static std::vector<CacheFileInfo>
310-
DuckdbGetCachedFilesInfos() {
311-
std::string ext(".meta");
312-
std::vector<CacheFileInfo> cache_info;
313-
for (auto &p : std::filesystem::recursive_directory_iterator(CreateOrGetDirectoryPath("duckdb_cache"))) {
314-
if (p.path().extension() == ext) {
315-
std::ifstream cache_file_metadata(p.path());
316-
std::string metadata;
317-
std::vector<std::string> metadata_tokens;
318-
while (std::getline(cache_file_metadata, metadata, ',')) {
319-
metadata_tokens.push_back(metadata);
320-
}
321-
if (metadata_tokens.size() != 4) {
322-
elog(WARNING, "(PGDuckDB/DuckdbGetCachedFilesInfos) Invalid '%s' cache metadata file",
323-
p.path().c_str());
324-
break;
325-
}
326-
cache_info.push_back(CacheFileInfo {metadata_tokens[0], metadata_tokens[1], std::stoll(metadata_tokens[2]),
327-
std::stoll(metadata_tokens[3])});
328-
}
329-
}
330-
return cache_info;
331-
}
332-
333-
static bool
334-
DuckdbCacheDelete(Datum cache_key_datum) {
335-
auto cache_key = DatumToString(cache_key_datum);
336-
if (!cache_key.size()) {
337-
elog(WARNING, "(PGDuckDB/DuckdbGetCachedFilesInfos) Empty cache key");
338-
return false;
339-
}
340-
auto cache_filename = CreateOrGetDirectoryPath("duckdb_cache") + "/" + cache_key;
341-
bool removed = !std::remove(cache_filename.c_str());
342-
std::remove(std::string(cache_filename + ".meta").c_str());
343-
return removed;
344-
}
345-
346265
namespace pg {
347266

348267
std::string
@@ -377,59 +296,23 @@ DECLARE_PG_FUNCTION(pgduckdb_raw_query) {
377296
PG_RETURN_BOOL(true);
378297
}
379298

299+
/*
300+
* We need these dummy cache functions so that people are able to load the
301+
* new pg_duckdb.so file with an old SQL version (where these functions still
302+
* exist). People should then upgrade the SQL part of the extension using the
303+
* command described in the error message. Once we believe no-one is on these old
304+
* version anymore we can remove these dummy functions.
305+
*/
380306
DECLARE_PG_FUNCTION(cache) {
381-
Datum object = PG_GETARG_DATUM(0);
382-
Datum type = PG_GETARG_DATUM(1);
383-
bool result = pgduckdb::DuckdbCacheObject(object, type);
384-
PG_RETURN_BOOL(result);
307+
elog(ERROR, "duckdb.cache is not supported anymore, please run 'ALTER EXTENSION pg_duckdb UPDATE'");
385308
}
386309

387310
DECLARE_PG_FUNCTION(cache_info) {
388-
pgduckdb::RequireDuckdbExecution();
389-
ReturnSetInfo *rsinfo = (ReturnSetInfo *)fcinfo->resultinfo;
390-
Tuplestorestate *tuple_store;
391-
TupleDesc cache_info_tuple_desc;
392-
393-
auto result = pgduckdb::DuckdbGetCachedFilesInfos();
394-
395-
cache_info_tuple_desc = CreateTemplateTupleDesc(4);
396-
TupleDescInitEntry(cache_info_tuple_desc, (AttrNumber)1, "remote_path", TEXTOID, -1, 0);
397-
TupleDescInitEntry(cache_info_tuple_desc, (AttrNumber)2, "cache_file_name", TEXTOID, -1, 0);
398-
TupleDescInitEntry(cache_info_tuple_desc, (AttrNumber)3, "cache_file_size", INT8OID, -1, 0);
399-
TupleDescInitEntry(cache_info_tuple_desc, (AttrNumber)4, "cache_file_timestamp", TIMESTAMPTZOID, -1, 0);
400-
401-
// We need to switch to long running memory context
402-
MemoryContext oldcontext = MemoryContextSwitchTo(rsinfo->econtext->ecxt_per_query_memory);
403-
tuple_store = tuplestore_begin_heap(rsinfo->allowedModes & SFRM_Materialize_Random, false, work_mem);
404-
MemoryContextSwitchTo(oldcontext);
405-
406-
for (auto &cache_info : result) {
407-
Datum values[4] = {0};
408-
bool nulls[4] = {0};
409-
410-
values[0] = CStringGetTextDatum(cache_info.remote_path.c_str());
411-
values[1] = CStringGetTextDatum(cache_info.cache_key.c_str());
412-
values[2] = cache_info.file_size;
413-
/* We have stored timestamp in *seconds* from epoch. We need to convert this to PG timestamptz. */
414-
values[3] = (cache_info.file_timestamp * 1000000) - pgduckdb::PGDUCKDB_DUCK_TIMESTAMP_OFFSET;
415-
416-
HeapTuple tuple = heap_form_tuple(cache_info_tuple_desc, values, nulls);
417-
tuplestore_puttuple(tuple_store, tuple);
418-
heap_freetuple(tuple);
419-
}
420-
421-
rsinfo->returnMode = SFRM_Materialize;
422-
rsinfo->setResult = tuple_store;
423-
rsinfo->setDesc = cache_info_tuple_desc;
424-
425-
return (Datum)0;
311+
elog(ERROR, "duckdb.cache_info is not supported anymore, please run 'ALTER EXTENSION pg_duckdb UPDATE'");
426312
}
427313

428314
DECLARE_PG_FUNCTION(cache_delete) {
429-
pgduckdb::RequireDuckdbExecution();
430-
Datum cache_key = PG_GETARG_DATUM(0);
431-
bool result = pgduckdb::DuckdbCacheDelete(cache_key);
432-
PG_RETURN_BOOL(result);
315+
elog(ERROR, "duckdb.cache_delete is not supported anymore, please run 'ALTER EXTENSION pg_duckdb UPDATE'");
433316
}
434317

435318
DECLARE_PG_FUNCTION(pgduckdb_recycle_ddb) {

test/regression/expected/extensions.out

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@ SET duckdb.allow_community_extensions = true;
33
SELECT * FROM duckdb.query($$ SELECT extension_name, loaded, installed, installed_from FROM duckdb_extensions() WHERE loaded and extension_name != 'jemalloc' $$);
44
extension_name | loaded | installed | installed_from
55
----------------+--------+-----------+----------------
6-
cached_httpfs | t | t |
76
core_functions | t | t |
7+
httpfs | t | t |
88
icu | t | t |
99
json | t | t |
1010
parquet | t | t |
@@ -35,8 +35,8 @@ SELECT last_value FROM duckdb.extensions_table_seq;
3535
SELECT * FROM duckdb.query($$ SELECT extension_name, loaded, installed, installed_from FROM duckdb_extensions() WHERE loaded and extension_name != 'jemalloc' $$);
3636
extension_name | loaded | installed | installed_from
3737
----------------+--------+-----------+----------------
38-
cached_httpfs | t | t |
3938
core_functions | t | t |
39+
httpfs | t | t |
4040
icu | t | t |
4141
json | t | t |
4242
parquet | t | t |
@@ -61,8 +61,8 @@ SELECT last_value FROM duckdb.extensions_table_seq;
6161
SELECT * FROM duckdb.query($$ SELECT extension_name, loaded, installed, installed_from FROM duckdb_extensions() WHERE loaded and extension_name != 'jemalloc' $$);
6262
extension_name | loaded | installed | installed_from
6363
----------------+--------+-----------+----------------
64-
cached_httpfs | t | t |
6564
core_functions | t | t |
65+
httpfs | t | t |
6666
icu | t | t |
6767
json | t | t |
6868
parquet | t | t |
@@ -85,8 +85,8 @@ SELECT * FROM duckdb.query($$ SELECT extension_name, loaded, installed, installe
8585
extension_name | loaded | installed | installed_from
8686
----------------+--------+-----------+----------------
8787
aws | t | t | core
88-
cached_httpfs | t | t |
8988
core_functions | t | t |
89+
httpfs | t | t |
9090
icu | t | t |
9191
json | t | t |
9292
parquet | t | t |
@@ -106,8 +106,8 @@ SELECT * FROM duckdb.query($$ SELECT extension_name, loaded, installed, installe
106106
extension_name | loaded | installed | installed_from
107107
----------------+--------+-----------+----------------
108108
aws | t | t | core
109-
cached_httpfs | t | t |
110109
core_functions | t | t |
110+
httpfs | t | t |
111111
icu | t | t |
112112
json | t | t |
113113
parquet | t | t |
@@ -130,9 +130,9 @@ SELECT * FROM duckdb.query($$ SELECT extension_name, loaded, installed, installe
130130
extension_name | loaded | installed | installed_from
131131
----------------+--------+-----------+----------------
132132
aws | t | t | core
133-
cached_httpfs | t | t |
134133
core_functions | t | t |
135134
duckpgq | t | t | community
135+
httpfs | t | t |
136136
icu | t | t |
137137
json | t | t |
138138
parquet | t | t |

test/regression/expected/non_superuser.out

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,6 @@ ERROR: DuckDB execution is not allowed because you have not been granted the du
4040
SET duckdb.force_execution = false;
4141
SELECT * FROM duckdb.raw_query($$ SELECT * FROM pgduckdb.public.t $$);
4242
ERROR: DuckDB execution is not allowed because you have not been granted the duckdb.postgres_role
43-
SELECT * FROM duckdb.cache_info();
44-
ERROR: DuckDB execution is not allowed because you have not been granted the duckdb.postgres_role
45-
SELECT * FROM duckdb.cache_delete('some file');
46-
ERROR: DuckDB execution is not allowed because you have not been granted the duckdb.postgres_role
4743
CALL duckdb.recycle_ddb();
4844
ERROR: DuckDB execution is not allowed because you have not been granted the duckdb.postgres_role
4945
SET duckdb.force_execution = true;

test/regression/sql/non_superuser.sql

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,6 @@ SELECT * FROM t;
3232
-- actually open a duckdb connection.
3333
SET duckdb.force_execution = false;
3434
SELECT * FROM duckdb.raw_query($$ SELECT * FROM pgduckdb.public.t $$);
35-
SELECT * FROM duckdb.cache_info();
36-
SELECT * FROM duckdb.cache_delete('some file');
3735
CALL duckdb.recycle_ddb();
3836
SET duckdb.force_execution = true;
3937

third_party/cached_httpfs/.clang-format

Lines changed: 0 additions & 2 deletions
This file was deleted.

0 commit comments

Comments
 (0)