Skip to content

cut over not retrying ? #971

@babinomec

Description

@babinomec

Hi,

When running gh-ost, the cut-over part often fails due to timeout (unfortunately our slaves often have some lags / spikes of lag). This is not a problem in itself but it does not seem to wait and retry at a later time which I think it is supposed to do.

I am running version 1.1.1

(redacted command and log)

This is what I am running (alter just adds some columns) :

./gh-ost \
  --user="ghost" \
  --password="********" \
  --host="slave" \
  --allow-master-master \
  --assume-master-host="master" \
  --database="db" \
  --table="table" \
  --alter="alter" \
  --verbose \
  --skip-foreign-key-checks \
  --skip-strict-mode \
  --initially-drop-ghost-table \
  --initially-drop-old-table \
  --exact-rowcount \
  --ok-to-drop-table \
  --max-lag-millis=10000 \
  --execute

Here is the output at the cut-over phase :

Copy: 14327000/14362256 99.8%; Applied: 12041; Backlog: 0/1000; Time: 1h16m10s(total), 1h15m58s(copy); streamer: bin-log.005490:906474118; Lag: 2.68s, State: migrating; ETA: 11s
Copy: 14334000/14362256 99.8%; Applied: 12047; Backlog: 2/1000; Time: 1h16m11s(total), 1h15m59s(copy); streamer: bin-log.005490:920606200; Lag: 2.38s, State: migrating; ETA: 8s
Copy: 14342000/14362256 99.9%; Applied: 12053; Backlog: 0/1000; Time: 1h16m12s(total), 1h16m0s(copy); streamer: bin-log.005490:934403718; Lag: 1.78s, State: migrating; ETA: 6s
Copy: 14350000/14362256 99.9%; Applied: 12055; Backlog: 0/1000; Time: 1h16m13s(total), 1h16m1s(copy); streamer: bin-log.005490:947459950; Lag: 1.48s, State: migrating; ETA: 3s
Copy: 14359000/14362256 100.0%; Applied: 12058; Backlog: 0/1000; Time: 1h16m14s(total), 1h16m2s(copy); streamer: bin-log.005490:961146997; Lag: 1.18s, State: migrating; ETA: 1s
2021-05-18 18:18:18 INFO Row copy complete
Copy: 14362256/14362256 100.0%; Applied: 12060; Backlog: 0/1000; Time: 1h16m14s(total), 1h16m2s(copy); streamer: bin-log.005490:966837215; Lag: 1.08s, State: migrating; ETA: due
2021-05-18 18:18:18 INFO Grabbing voluntary lock: gh-ost.1024388992.lock
2021-05-18 18:18:18 INFO Setting LOCK timeout as 6 seconds
2021-05-18 18:18:18 INFO Looking for magic cut-over table
2021-05-18 18:18:18 INFO Creating magic cut-over table `db`.`_table_del`
2021-05-18 18:18:18 INFO Magic cut-over table created
2021-05-18 18:18:18 INFO Locking `db`.`table`, `db`.`_table_del`
2021-05-18 18:18:18 INFO Tables locked
2021-05-18 18:18:18 INFO Session locking original & magic tables is 1024388992
2021-05-18 18:18:18 INFO Writing changelog state: AllEventsUpToLockProcessed:1621354698945390596
2021-05-18 18:18:18 INFO Waiting for events up to lock
Copy: 14362256/14362256 100.0%; Applied: 12060; Backlog: 0/1000; Time: 1h16m15s(total), 1h16m2s(copy); streamer: bin-log.005490:970667197; Lag: 1.28s, State: migrating; ETA: due
2021-05-18 18:18:21 ERROR Timeout while waiting for events up to lock
2021-05-18 18:18:21 ERROR 2021-05-18 18:18:21 ERROR Timeout while waiting for events up to lock
2021-05-18 18:18:21 INFO Looking for magic cut-over table
2021-05-18 18:18:21 INFO Will now proceed to drop magic table and unlock tables
2021-05-18 18:18:21 INFO Dropping magic cut-over table
2021-05-18 18:18:21 INFO Dropping magic cut-over table
2021-05-18 18:18:21 INFO Dropping table `db`.`_table_del`
2021-05-18 18:18:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-05-18 18:18:24 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 1/1000; Time: 1h16m20s(total), 1h16m2s(copy); streamer: bin-log.005490:989474686; Lag: 3.18s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m25s(total), 1h16m2s(copy); streamer: bin-log.005490:1014763533; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m30s(total), 1h16m2s(copy); streamer: bin-log.005490:1026013234; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m35s(total), 1h16m2s(copy); streamer: bin-log.005490:1036196578; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m40s(total), 1h16m2s(copy); streamer: bin-log.005490:1047207998; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m45s(total), 1h16m2s(copy); streamer: bin-log.005490:1057140917; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m50s(total), 1h16m2s(copy); streamer: bin-log.005490:1062986858; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m55s(total), 1h16m2s(copy); streamer: bin-log.005490:1064318221; Lag: 4.48s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m0s(total), 1h16m2s(copy); streamer: bin-log.005490:1073646385; Lag: 4.78s, State: migrating; ETA: due
[2021/05/18 18:19:04] [info] binlogsyncer.go:723 rotate to (bin-log.005491, 4)
[2021/05/18 18:19:04] [info] binlogsyncer.go:723 rotate to (bin-log.005491, 4)
2021-05-18 18:19:04 INFO rotate to next log from bin-log.005491:1073742500 to bin-log.005491
2021-05-18 18:19:04 INFO rotate to next log from bin-log.005491:0 to bin-log.005491
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m5s(total), 1h16m2s(copy); streamer: bin-log.005491:14353974; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m10s(total), 1h16m2s(copy); streamer: bin-log.005491:26928263; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m30s(total), 1h16m2s(copy); streamer: bin-log.005491:64743000; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h18m0s(total), 1h16m2s(copy); streamer: bin-log.005491:118334528; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h18m30s(total), 1h16m2s(copy); streamer: bin-log.005491:176850550; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h19m0s(total), 1h16m2s(copy); streamer: bin-log.005491:236619046; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h19m30s(total), 1h16m2s(copy); streamer: bin-log.005491:292654898; Lag: 0.08s, State: migrating; ETA: due
# Migrating `db`.`table`; Ghost table is `db`.`_table_gho`
# Migrating master:3306; inspecting slave:3306; executing on slave
# Migration started at Tue May 18 17:02:04 +0200 2021
# chunk-size: 1000; max-lag-millis: 10000ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.db.table.sock
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h20m0s(total), 1h16m2s(copy); streamer: bin-log.005491:359697150; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h20m30s(total), 1h16m2s(copy); streamer: bin-log.005491:427123115; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h21m0s(total), 1h16m2s(copy); streamer: bin-log.005491:498372220; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h21m30s(total), 1h16m2s(copy); streamer: bin-log.005491:564757429; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h22m0s(total), 1h16m2s(copy); streamer: bin-log.005491:627971561; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h22m30s(total), 1h16m2s(copy); streamer: bin-log.005491:700060664; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h23m0s(total), 1h16m2s(copy); streamer: bin-log.005491:762556638; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h23m30s(total), 1h16m2s(copy); streamer: bin-log.005491:830434663; Lag: 1.18s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h24m0s(total), 1h16m2s(copy); streamer: bin-log.005491:898773371; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h24m30s(total), 1h16m2s(copy); streamer: bin-log.005491:972940697; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h25m0s(total), 1h16m2s(copy); streamer: bin-log.005491:1033621038; Lag: 0.08s, State: migrating; ETA: due

And it goes on like that even when lag is low and that cut-over could happen. I ctrl-c-ed it after 2h50m.

Any clue on what may be happening and how to fix it ?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions