-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running Inferelator on NYU HPC with KVS #64
Comments
Kostya,
This break happens in the cluster -- print statements come all together at
once at the end of the run (issue not specific to kvs, but smth about how
the cluster works!)
Dayanne
--
…On Fri, May 25, 2018, 10:35 AM kostyat ***@***.***> wrote:
So I was able to set up the code to run on the NYU HPC with KVS. It passed
the unit tests (using bash run_unittests.sh), but then when I set up an
interactive session and ran
time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n 1
python bsubtilis_bbsr_workflow_runner.py'
I first got the following output:
2018-05-24 16:40:45,039 INFO kvs : Setting queue size to 4000
2018-05-24 16:40:45,040 INFO kvs : Server running at 172.16.2.XXX:35887.
2018-05-24 16:40:46,087 INFO kvs : Accepted connect from ('172.16.2.XXX', 49532)
/home/username/inferelator_ng/inferelator_ng/bbsr_python.py:119: RuntimeWarning: invalid value encountered in multiply
bics_sum = np.sum(np.multiply(combos.transpose(),bics[:, np.newaxis]).transpose(),1)
Then nothing for 30 minutes, then after a 30 minute break suddenly this:
2018-05-24 17:14:06,440 INFO kvs : Closing connection from ('172.16.2.XXX', 49532)
Creating design and response matrix ...
Setting up TFA specific response matrix ...
Computing Transcription Factor Activity ...
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Calculating betas using BBSR
Progress: computing BBSR for gene 0
Progress: computing BBSR for gene 100
Progress: computing BBSR for gene 200
Progress: computing BBSR for gene 300
Progress: computing BBSR for gene 400
Progress: computing BBSR for gene 500
Progress: computing BBSR for gene 600
Progress: computing BBSR for gene 700
Progress: computing BBSR for gene 800
Progress: computing BBSR for gene 900
Progress: computing BBSR for gene 1000
Progress: computing BBSR for gene 1100
Progress: computing BBSR for gene 1200
Progress: computing BBSR for gene 1300
Progress: computing BBSR for gene 1400
Progress: computing BBSR for gene 1500
Progress: computing BBSR for gene 1600
Progress: computing BBSR for gene 1700
Progress: computing BBSR for gene 1800
Progress: computing BBSR for gene 1900
Progress: computing BBSR for gene 2000
Progress: computing BBSR for gene 2100
Progress: computing BBSR for gene 2200
Progress: computing BBSR for gene 2300
Progress: computing BBSR for gene 2400
Progress: computing BBSR for gene 2500
Progress: computing BBSR for gene 2600
Progress: computing BBSR for gene 2700
Progress: computing BBSR for gene 2800
Progress: computing BBSR for gene 2900
Progress: computing BBSR for gene 3000
Progress: computing BBSR for gene 3100
Progress: computing BBSR for gene 3200
Progress: computing BBSR for gene 3300
Progress: computing BBSR for gene 3400
Progress: computing BBSR for gene 3500
Progress: computing BBSR for gene 3600
Progress: computing BBSR for gene 3700
Progress: computing BBSR for gene 3800
Progress: computing BBSR for gene 3900
Progress: computing BBSR for gene 4000
Progress: computing BBSR for gene 4100
Progress: computing BBSR for gene 4200
('final s', 4218)
Bootstrap 2 of 2
Calculating MI, Background MI, and CLR Matrix
Calculating betas using BBSR
Progress: computing BBSR for gene 0
Progress: computing BBSR for gene 100
Progress: computing BBSR for gene 200
Progress: computing BBSR for gene 300
Progress: computing BBSR for gene 400
Progress: computing BBSR for gene 500
Progress: computing BBSR for gene 600
Progress: computing BBSR for gene 700
Progress: computing BBSR for gene 800
Progress: computing BBSR for gene 900
Progress: computing BBSR for gene 1000
Progress: computing BBSR for gene 1100
Progress: computing BBSR for gene 1200
Progress: computing BBSR for gene 1300
Progress: computing BBSR for gene 1400
Progress: computing BBSR for gene 1500
Progress: computing BBSR for gene 1600
Progress: computing BBSR for gene 1700
Progress: computing BBSR for gene 1800
Progress: computing BBSR for gene 1900
Progress: computing BBSR for gene 2000
Progress: computing BBSR for gene 2100
Progress: computing BBSR for gene 2200
Progress: computing BBSR for gene 2300
Progress: computing BBSR for gene 2400
Progress: computing BBSR for gene 2500
Progress: computing BBSR for gene 2600
Progress: computing BBSR for gene 2700
Progress: computing BBSR for gene 2800
Progress: computing BBSR for gene 2900
Progress: computing BBSR for gene 3000
Progress: computing BBSR for gene 3100
Progress: computing BBSR for gene 3200
Progress: computing BBSR for gene 3300
Progress: computing BBSR for gene 3400
Progress: computing BBSR for gene 3500
Progress: computing BBSR for gene 3600
Progress: computing BBSR for gene 3700
Progress: computing BBSR for gene 3800
Progress: computing BBSR for gene 3900
Progress: computing BBSR for gene 4000
Progress: computing BBSR for gene 4100
Progress: computing BBSR for gene 4200
('final s', 4218)
2018-05-24 17:14:06,510 INFO kvs : Server shutting down
real 33m21.640s
user 0m0.176s
sys 0m0.141s
The precision-recall curve in the output looked correct too.
Does anyone have any idea why this 30 minute pause might be happening? Is
this related to the problem that's addressed by pull request #63
<#63> ?
@nickdeveaux <https://github.com/nickdeveaux> @dayanne-castro
<https://github.com/dayanne-castro>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#64>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMZwNKP8QN_AMGsg_ePO0Og_khOYWKM1ks5t2Ba0gaJpZM4UOFhB>
.
|
@dayanne-castro , I see, thank you. But then, why does it say that user and system time were both under 1 minute? To me that looks like there was only 0.176 seconds of computational time spent, and the rest was spent in waiting. Also, I thought B. subtilis data takes like 5 minutes to run even on 1 core on a laptop. Also, do you mean why all clusters work or how NYU HPC cluster works? I don't recall it being that way back when I used to use PBS on Mercer. |
@dayanne-castro , I wanted to test that out by increasing the number of cores. So instead of 4 I requested 8 cores (in an interactive session). I checked that I did that correctly by running
in Python, and it gave the right number (8). But the amount of time that it took for it to print something was still 30 minutes. This was the end of the output:
So to me it looks like those 30 minutes are just waiting for something, and not actually doing any computation (because otherwise I would expect the amount of time it takes to go down after I switched from 4 CPUs to 8 CPUs). |
How are you running it? Looks like your KVS is running in a single
core... Happy to meet in person..
…On Fri, May 25, 2018 at 3:30 PM, kostyat ***@***.***> wrote:
@dayanne-castro <https://github.com/dayanne-castro> , I wanted to test
that out by increasing the number of cores. So instead of 4 I requested 8
cores (in an interactive session). I checked that I did that correctly by
running
os.environ['SLURM_CPUS_ON_NODE']
in Python, and it gave the right number (8). But the amount of time that
it took for it to print something was still 30 minutes. This was the end of
the output:
real 32m22.296s
user 0m0.140s
sys 0m0.160s
So to me it looks like those 30 minutes are just waiting for something,
and not actually doing any computation (because otherwise I would expect
the amount of time it takes to go down after I switched from 4 CPUs to 8
CPUs).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#64 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMZwNF_CEIg-MEB7CGyHvNXOZUKx1w7jks5t2FvHgaJpZM4UOFhB>
.
|
sure, just let me know next time you're in CDS. i might be in tomorrow and definitely on monday and after. I first start the interactive session like this:
Then I load the modules and start the python virtual environment (as recommended to me by Shenglong in order to get pybedtools working):
Then I execute the KVS/python command:
|
wait, i see now that i'm running srun redundantly... i think that might be the issue... oops |
yeah... changing the last line to
fixed the delay issue. thanks! |
So I was able to set up the code to run on the NYU HPC with KVS. It passed the unit tests (using
bash run_unittests.sh
), but then when I set up an interactive session and rantime python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n 1 python bsubtilis_bbsr_workflow_runner.py'
I first got the following output:
Then nothing for 30 minutes, then after a 30 minute break suddenly this:
The precision-recall curve in the output looked correct too.
Does anyone have any idea why this 30 minute pause might be happening? Is this related to the problem that's addressed by pull request #63 ? By the way, this happens consistently: I ran it 3 times and got the same ~30 minute pause (plus minus a minute or two) every time.
@nickdeveaux @dayanne-castro
The text was updated successfully, but these errors were encountered: