Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Inferelator on NYU HPC with KVS #64

Closed
kostyat opened this issue May 25, 2018 · 7 comments
Closed

Running Inferelator on NYU HPC with KVS #64

kostyat opened this issue May 25, 2018 · 7 comments

Comments

@kostyat
Copy link
Contributor

kostyat commented May 25, 2018

So I was able to set up the code to run on the NYU HPC with KVS. It passed the unit tests (using bash run_unittests.sh), but then when I set up an interactive session and ran

time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n 1 python bsubtilis_bbsr_workflow_runner.py'

I first got the following output:

2018-05-24 16:40:45,039 INFO     kvs            : Setting queue size to 4000
2018-05-24 16:40:45,040 INFO     kvs            : Server running at 172.16.2.XXX:35887.
2018-05-24 16:40:46,087 INFO     kvs            : Accepted connect from ('172.16.2.XXX', 49532)
/home/username/inferelator_ng/inferelator_ng/bbsr_python.py:119: RuntimeWarning: invalid value encountered in multiply
  bics_sum = np.sum(np.multiply(combos.transpose(),bics[:, np.newaxis]).transpose(),1)

Then nothing for 30 minutes, then after a 30 minute break suddenly this:

2018-05-24 17:14:06,440 INFO     kvs            : Closing connection from ('172.16.2.XXX', 49532)
Creating design and response matrix ... 
Setting up TFA specific response matrix ... 
Computing Transcription Factor Activity ... 
Bootstrap 1 of 2
Calculating MI, Background MI, and CLR Matrix
Calculating betas using BBSR
Progress: computing BBSR for gene 0
Progress: computing BBSR for gene 100
Progress: computing BBSR for gene 200
Progress: computing BBSR for gene 300
Progress: computing BBSR for gene 400
Progress: computing BBSR for gene 500
Progress: computing BBSR for gene 600
Progress: computing BBSR for gene 700
Progress: computing BBSR for gene 800
Progress: computing BBSR for gene 900
Progress: computing BBSR for gene 1000
Progress: computing BBSR for gene 1100
Progress: computing BBSR for gene 1200
Progress: computing BBSR for gene 1300
Progress: computing BBSR for gene 1400
Progress: computing BBSR for gene 1500
Progress: computing BBSR for gene 1600
Progress: computing BBSR for gene 1700
Progress: computing BBSR for gene 1800
Progress: computing BBSR for gene 1900
Progress: computing BBSR for gene 2000
Progress: computing BBSR for gene 2100
Progress: computing BBSR for gene 2200
Progress: computing BBSR for gene 2300
Progress: computing BBSR for gene 2400
Progress: computing BBSR for gene 2500
Progress: computing BBSR for gene 2600
Progress: computing BBSR for gene 2700
Progress: computing BBSR for gene 2800
Progress: computing BBSR for gene 2900
Progress: computing BBSR for gene 3000
Progress: computing BBSR for gene 3100
Progress: computing BBSR for gene 3200
Progress: computing BBSR for gene 3300
Progress: computing BBSR for gene 3400
Progress: computing BBSR for gene 3500
Progress: computing BBSR for gene 3600
Progress: computing BBSR for gene 3700
Progress: computing BBSR for gene 3800
Progress: computing BBSR for gene 3900
Progress: computing BBSR for gene 4000
Progress: computing BBSR for gene 4100
Progress: computing BBSR for gene 4200
('final s', 4218)
Bootstrap 2 of 2
Calculating MI, Background MI, and CLR Matrix
Calculating betas using BBSR
Progress: computing BBSR for gene 0
Progress: computing BBSR for gene 100
Progress: computing BBSR for gene 200
Progress: computing BBSR for gene 300
Progress: computing BBSR for gene 400
Progress: computing BBSR for gene 500
Progress: computing BBSR for gene 600
Progress: computing BBSR for gene 700
Progress: computing BBSR for gene 800
Progress: computing BBSR for gene 900
Progress: computing BBSR for gene 1000
Progress: computing BBSR for gene 1100
Progress: computing BBSR for gene 1200
Progress: computing BBSR for gene 1300
Progress: computing BBSR for gene 1400
Progress: computing BBSR for gene 1500
Progress: computing BBSR for gene 1600
Progress: computing BBSR for gene 1700
Progress: computing BBSR for gene 1800
Progress: computing BBSR for gene 1900
Progress: computing BBSR for gene 2000
Progress: computing BBSR for gene 2100
Progress: computing BBSR for gene 2200
Progress: computing BBSR for gene 2300
Progress: computing BBSR for gene 2400
Progress: computing BBSR for gene 2500
Progress: computing BBSR for gene 2600
Progress: computing BBSR for gene 2700
Progress: computing BBSR for gene 2800
Progress: computing BBSR for gene 2900
Progress: computing BBSR for gene 3000
Progress: computing BBSR for gene 3100
Progress: computing BBSR for gene 3200
Progress: computing BBSR for gene 3300
Progress: computing BBSR for gene 3400
Progress: computing BBSR for gene 3500
Progress: computing BBSR for gene 3600
Progress: computing BBSR for gene 3700
Progress: computing BBSR for gene 3800
Progress: computing BBSR for gene 3900
Progress: computing BBSR for gene 4000
Progress: computing BBSR for gene 4100
Progress: computing BBSR for gene 4200
('final s', 4218)
2018-05-24 17:14:06,510 INFO     kvs            : Server shutting down

real	33m21.640s
user	0m0.176s
sys	0m0.141s

The precision-recall curve in the output looked correct too.

Does anyone have any idea why this 30 minute pause might be happening? Is this related to the problem that's addressed by pull request #63 ? By the way, this happens consistently: I ran it 3 times and got the same ~30 minute pause (plus minus a minute or two) every time.

@nickdeveaux @dayanne-castro

@dayanne-castro
Copy link
Collaborator

dayanne-castro commented May 25, 2018 via email

@kostyat
Copy link
Contributor Author

kostyat commented May 25, 2018

@dayanne-castro , I see, thank you. But then, why does it say that user and system time were both under 1 minute? To me that looks like there was only 0.176 seconds of computational time spent, and the rest was spent in waiting. Also, I thought B. subtilis data takes like 5 minutes to run even on 1 core on a laptop. Also, do you mean why all clusters work or how NYU HPC cluster works? I don't recall it being that way back when I used to use PBS on Mercer.

@kostyat
Copy link
Contributor Author

kostyat commented May 25, 2018

@dayanne-castro , I wanted to test that out by increasing the number of cores. So instead of 4 I requested 8 cores (in an interactive session). I checked that I did that correctly by running

os.environ['SLURM_CPUS_ON_NODE']

in Python, and it gave the right number (8). But the amount of time that it took for it to print something was still 30 minutes. This was the end of the output:


real	32m22.296s
user	0m0.140s
sys	0m0.160s

So to me it looks like those 30 minutes are just waiting for something, and not actually doing any computation (because otherwise I would expect the amount of time it takes to go down after I switched from 4 CPUs to 8 CPUs).

@dayanne-castro
Copy link
Collaborator

dayanne-castro commented May 25, 2018 via email

@kostyat
Copy link
Contributor Author

kostyat commented May 25, 2018

sure, just let me know next time you're in CDS. i might be in tomorrow and definitely on monday and after.

I first start the interactive session like this:

srun -c4 -t2:00:00 --mem=4000 --pty /bin/bash

Then I load the modules and start the python virtual environment (as recommended to me by Shenglong in order to get pybedtools working):

export PYTHONPATH=$PYTHONPATH:$(pwd)/kvsstcp
module load r/intel/3.4.2 python/intel/2.7.12 bedtools/intel/2.26.0
source /home/kmt331/inferelator_ng/py2.7/bin/activate

Then I execute the KVS/python command:

time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'srun -n 1 python bsubtilis_bbsr_workflow_runner.py'

@kostyat
Copy link
Contributor Author

kostyat commented May 25, 2018

wait, i see now that i'm running srun redundantly... i think that might be the issue... oops

@kostyat
Copy link
Contributor Author

kostyat commented May 25, 2018

yeah... changing the last line to

time python ~/inferelator_ng/kvsstcp/kvsstcp.py --execcmd 'python bsubtilis_bbsr_workflow_runner.py'

fixed the delay issue. thanks!

@kostyat kostyat closed this as completed May 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants