Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

m0000800_alter_product_version_range_scheme is way too slow #1142

Open
ctron opened this issue Jan 15, 2025 · 4 comments
Open

m0000800_alter_product_version_range_scheme is way too slow #1142

ctron opened this issue Jan 15, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@ctron
Copy link
Contributor

ctron commented Jan 15, 2025

Having the standard data imported, m0000800_alter_product_version_range_scheme fails to complete withing 2 days.

The problem is the update statement, which has to process the following two tables:

trustify=# select count(*) from version_range;
   count   
-----------
 243556450
trustify=# select count(*) from product_version_range;
   count   
-----------
 145018502

The statement is:

UPDATE version_range SET version_scheme_id = 'rpm' WHERE id IN (SELECT version_range_id FROM product_version_range)
@ctron ctron added the bug Something isn't working label Jan 15, 2025
@ctron ctron added this to Trustify Jan 15, 2025
@ctron ctron moved this to Backlog in Trustify Jan 15, 2025
@JimFuller-RedHat
Copy link
Collaborator

JimFuller-RedHat commented Jan 15, 2025

the problem is that that sql query is getting executed for each row
something like a join

UPDATE version_range
SET version_scheme_id = 'rpm'
FROM version_range
JOIN product_version_range ON version_range.id = product_version_range.version_range_id;

is likely to be a lot better performing.

@dejanb
Copy link
Contributor

dejanb commented Jan 16, 2025

That query is not valid in postgres

ERROR: table name "version_range" specified more than once

I played a bit with its variations, like

CREATE INDEX idx_product_version_range_id ON product_version_range(version_range_id);

with ranges_to_update as (
    select version_range.* from 
    version_range
    join product_version_range on version_range.id = product_version_range.version_range_id
)
update version_range
set version_scheme_id = 'rpm'
from ranges_to_update 
where version_range.id = ranges_to_update.id

But from the analysis it doesn't look it helps much. The bottleneck is updating 18M rows in the end.
Some of the solutions I found proposed are doing batch updates, but I don't see yet how that would fit our migrations.

@JimFuller-RedHat
Copy link
Collaborator

ah right, sorry I prob was writing psuedo psql ... let me write and test something ...

@dejanb
Copy link
Contributor

dejanb commented Jan 16, 2025

Just FYI, the query from the previous comment, executed for me locally in 38min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

3 participants