Impact of the bug
CRAB
Describe the bug
a block gets migrated from global to phys03 w/o the files.
How to reproduce it
rare problem. can not be reproduced.
Expected behavior
The block in phys03 should be an exact replica of the one in global
Additional context and error message
I have seen this a few times, but did not report immediately. Now I have taken steps to properly document.
- block name
/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258
- in
phys03 it is present w/o any file
belforte@lxplus802/TC3> dasgoclient --query 'block block=/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258 instance=prod/phys03'
/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258
belforte@lxplus802/TC3> dasgoclient --query 'file block=/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258 instance=prod/phys03'
belforte@lxplus802/TC3>
- while in
global there are 184 files
belforte@lxplus802/TC3> dasgoclient --query 'block block=/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258 instance=prod/global'
/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258
belforte@lxplus802/TC3> dasgoclient --query 'file block=/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258 instance=prod/global'|wc -l
184
belforte@lxplus802/TC3>
When CRAB was trying to migrate this block, migration failed with (from migration server logs):
[2026-04-22 08:09:51.344719155 +0000 UTC m=+2904.031899684] migrate.go:1119: insert block dump record failed with DBSError Code:128 Description:Not defined Function:dbs.bulkblocks.checkBlockExist Message:Block /JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258 already exists Error: nil
[2026-04-22 08:09:51.393973243 +0000 UTC m=+2904.081153787] migrate.go:1378: update migration request 6499526 to status 3
Logs from April 22 are still on cephs, so here's more info.
- migration server pods: example of one attempt to migrate
(CRAB Publisher makes a list of parent files which are not present in phys03, than asks DBS to migrate the corresponding blocks, it the block is already at destination it is supposed to get migration status 4 and will be happy, instead here it gets status 9, so ti deletes that migraion-id and tries again ... until CRAB operator detects this and blacklist the task whose publication attempts hit this problem)
belforte@vocms0755/dbs-logs> grep "6499526 to" dbs2go-phys03-migratio*.log-20260422
dbs2go-phys03-migration-644c89d6cb-rvbn9.log-20260422:[2026-04-22 08:10:23.257140175 +0000 UTC m=+5556.053575592] migrate.go:1378: update migration request 6499526 to status 1
dbs2go-phys03-migration-644c89d6cb-rvbn9.log-20260422:[2026-04-22 08:10:30.368651523 +0000 UTC m=+5563.165086940] migrate.go:1378: update migration request 6499526 to status 3
dbs2go-phys03-migration-644c89d6cb-rvbn9.log-20260422:[2026-04-22 08:11:23.654446037 +0000 UTC m=+5616.450881454] migrate.go:1378: update migration request 6499526 to status 1
dbs2go-phys03-migration-644c89d6cb-rvbn9.log-20260422:[2026-04-22 08:11:24.921014543 +0000 UTC m=+5617.717449960] migrate.go:1378: update migration request 6499526 to status 3
dbs2go-phys03-migration-694ff69b48-tmsd4.log-20260422:[2026-04-22 08:09:48.920184766 +0000 UTC m=+2901.607365308] migrate.go:1378: update migration request 6499526 to status 1
dbs2go-phys03-migration-694ff69b48-tmsd4.log-20260422:[2026-04-22 08:09:51.393973243 +0000 UTC m=+2904.081153787] migrate.go:1378: update migration request 6499526 to status 3
dbs2go-phys03-migration-694ff69b48-tmsd4.log-20260422:[2026-04-22 08:10:49.886143633 +0000 UTC m=+2962.573324164] migrate.go:1378: update migration request 6499526 to status 1
dbs2go-phys03-migration-694ff69b48-tmsd4.log-20260422:[2026-04-22 08:10:51.602646326 +0000 UTC m=+2964.289826859] migrate.go:1378: update migration request 6499526 to status 3
dbs2go-phys03-migration-694ff69b48-tmsd4.log-20260422:[2026-04-22 08:11:49.898859751 +0000 UTC m=+3022.586040282] migrate.go:1378: update migration request 6499526 to status 9
belforte@vocms0755/dbs-logs>
looking for this block name in all migration log files for April 22 finds many repetition of that.
belforte@vocms0755/dbs-logs> grep "/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258" dbs2go-phys03-mig*.log-20260422 > /tmp/ml
belforte@vocms0755/dbs-logs> grep ID /tmp/ml|cut -d{ -f2|cut -d' ' -f 1|sort|uniq
MIGRATION_REQUEST_ID:6496490
MIGRATION_REQUEST_ID:6499403
MIGRATION_REQUEST_ID:6499461
MIGRATION_REQUEST_ID:6499508
MIGRATION_REQUEST_ID:6499526
MIGRATION_REQUEST_ID:6499616
MIGRATION_REQUEST_ID:6499622
MIGRATION_REQUEST_ID:6499704
MIGRATION_REQUEST_ID:6499716
MIGRATION_REQUEST_ID:6499727
MIGRATION_REQUEST_ID:6499947
MIGRATION_REQUEST_ID:6499987
MIGRATION_REQUEST_ID:6500006
MIGRATION_REQUEST_ID:6500022
MIGRATION_REQUEST_ID:6500028
MIGRATION_REQUEST_ID:6500037
MIGRATION_REQUEST_ID:6500053
MIGRATION_REQUEST_ID:6500059
MIGRATION_REQUEST_ID:6500082
MIGRATION_REQUEST_ID:6500090
MIGRATION_REQUEST_ID:6500114
MIGRATION_REQUEST_ID:6500129
MIGRATION_REQUEST_ID:6500147
MIGRATION_REQUEST_ID:6500163
MIGRATION_REQUEST_ID:6500177
MIGRATION_REQUEST_ID:6500220
MIGRATION_REQUEST_ID:6500257
MIGRATION_REQUEST_ID:6500280
MIGRATION_REQUEST_ID:6500308
belforte@vocms0755/dbs-logs>
The DBS migration pods were continuously restarting that day, due to a memory issue with too many lumis in a different migration. I do not know if that could be the origin of the problem.
I have been unable to find when the block was inserted in phys03.
Impact of the bug
CRAB
Describe the bug
a block gets migrated from global to phys03 w/o the files.
How to reproduce it
rare problem. can not be reproduced.
Expected behavior
The block in phys03 should be an exact replica of the one in global
Additional context and error message
I have seen this a few times, but did not report immediately. Now I have taken steps to properly document.
/JetMET0/Run2025G-PromptReco-v1/AOD#2149f597-8e88-4195-b9b9-5633d8ba6258phys03it is present w/o any fileglobalthere are 184 filesWhen CRAB was trying to migrate this block, migration failed with (from migration server logs):
Logs from April 22 are still on cephs, so here's more info.
(CRAB Publisher makes a list of parent files which are not present in phys03, than asks DBS to migrate the corresponding blocks, it the block is already at destination it is supposed to get migration status 4 and will be happy, instead here it gets status 9, so ti deletes that migraion-id and tries again ... until CRAB operator detects this and blacklist the task whose publication attempts hit this problem)
looking for this block name in all migration log files for April 22 finds many repetition of that.
The DBS migration pods were continuously restarting that day, due to a memory issue with too many lumis in a different migration. I do not know if that could be the origin of the problem.
I have been unable to find when the block was inserted in
phys03.