-
-
Notifications
You must be signed in to change notification settings - Fork 2
Add optional alignment step to NMT jobs, temporary implementation of eflomal #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #169 +/- ##
==========================================
+ Coverage 88.55% 88.62% +0.07%
==========================================
Files 276 277 +1
Lines 16498 16630 +132
==========================================
+ Hits 14609 14739 +130
- Misses 1889 1891 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I changed the base image for the main Dockerfile to be able to install eflomal. This is something we had done in silnlp (in that case we went from the CUDA base to the basic Ubuntu base) because pytorch comes with all of the necessary CUDA drivers anyway. The tests are passing, but I wanted to double check to see if there would be any other reason why we shouldn't do this for machine.py? Something I hadn't considered until the builds failed is that we haven't yet found a way to get eflomal running on non-Linux OSs. Is that going to be an issue? |
We want it to not crash on windows machines. That should be enough for a "temporary" solution. |
If we can truly change the docker (and the docker CPU only) image to both be from the python base image, they should mirror each other, except that the CPU only image excludes the GPU extras (and whatever else I am missing). |
Can this be gated to exclude windows builds? What do we know about MAC's? |
... using Eflomal (linux only) and returns the alignments as well as the tokenized source and target with the pretranslations. |
I am going to assume that this all works. Is there a unit test we can run to make sure that the basic functionality is working (that is also, of course, only run on Linux machines)? |
Is this a different error you were fixing? What is the reason for this change? |
So, you are just overwriting the file? That should be ok. I just want to make sure that the (new) file is uploaded to the S3 bucket when it is done. |
Previously, johnml1135 (John Lambert) wrote…
I understand now. Are any corresponding changes needed in machine? |
We should test that we actually can call eflomal (but only in Linux). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 13 files at r1, all commit messages.
Reviewable status: 12 of 13 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @isaac091)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 12 of 13 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
pyproject.toml
line 66 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Can this be gated to exclude windows builds? What do we know about MAC's?
I don't think we've had anyone try to use eflomal on macOS. It looks like there is a poetry option to constrain eflomal to linux builds.
machine/jobs/build_nmt_engine.py
line 96 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
... using Eflomal (linux only) and returns the alignments as well as the tokenized source and target with the pretranslations.
Done.
machine/jobs/eflomal_aligner.py
line 4 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I am going to assume that this all works. Is there a unit test we can run to make sure that the basic functionality is working (that is also, of course, only run on Linux machines)?
See below comment. I also tested it on its own w/ a whole Bible.
machine/jobs/smt_engine_build_job.py
line 112 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I understand now. Are any corresponding changes needed in machine?
I'm not entirely sure but I would guess not because the changes are only in jobs-related code.
machine/jobs/translation_engine_build_job.py
line 121 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So, you are just overwriting the file? That should be ok. I just want to make sure that the (new) file is uploaded to the S3 bucket when it is done.
Yes, I'm just overwriting the file. Is that enough to make sure it gets uploaded?
tests/jobs/test_nmt_engine_build_job.py
line 106 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
We should test that we actually can call eflomal (but only in Linux).
I had run these tests with "align_pretranslations" in the test environment config and it does the alignment for the single sentence. Should I add that step to the test officially, or should it have its own unit test? I'll make sure to add a platform check.
Previously, isaac091 (Isaac Schifferer) wrote…
How about we just limit it to linux? If we need MacOS, we can add it later. |
Previously, isaac091 (Isaac Schifferer) wrote…
ok. |
Previously, isaac091 (Isaac Schifferer) wrote…
It looks like it does - we can confirm in integration tests. |
Previously, isaac091 (Isaac Schifferer) wrote…
Either one - it should be tested in CI/CD (but only in linux). |
Previously, isaac091 (Isaac Schifferer) wrote…
Did you upload it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 13 files at r1.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @isaac091)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @isaac091 and @johnml1135)
pyproject.toml
line 66 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
How about we just limit it to linux? If we need MacOS, we can add it later.
This dependency should be part of the jobs
extra.
machine/jobs/build_nmt_engine.py
line 96 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Did you upload it?
I feel like this should be in the build options.
machine/jobs/translation_engine_build_job.py
line 121 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
It looks like it does - we can confirm in integration tests.
I understand that this makes the code cleaner, but it isn't very efficient. We end up reading/writing three potentially large files from the S3 bucket twice. Can we find a way to refactor this to avoid the extra reading/writing?
machine/jobs/nmt_engine_build_job.py
line 131 at r1 (raw file):
writer: DictToJsonWriter, ) -> None: source_segments = [pi["pretranslation"] for pi in batch]
Why was this changed to pretranslation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)
pyproject.toml
line 66 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This dependency should be part of the
jobs
extra.
Ok, that makes sense. Could I then just make eflomal optional so that it only installs when jobs
is added specifically, or will the Windows/macOS builds also add jobs
, i.e. will they use the cpu_only
dockerfile? Or I can do eflomal = { markers = "sys_platform == 'linux'", version = "^2.0.0" }
.
Maybe this isn't directly relevant, but I'm a little confused as to why we have extras defined at all if none of the dependencies in them are marked as optional, because as it currently stands, all of the dependencies are installed regardless of whether any extras are specified.
machine/jobs/build_nmt_engine.py
line 96 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I feel like this should be in the build options.
@johnml1135 Sorry, this was confusing, it'll be in the next commit.
@ddaspit So would I put a default in settings.yaml
? (at huggingface.align_pretranslations
?) Or just check if it's in the config first, like I'm doing now?
machine/jobs/nmt_engine_build_job.py
line 131 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Why was this changed to
pretranslation
?
Just to match with everything else, I can change it back if you don't think it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some progress on making eflomal only run on linux, but I'm not there yet. For macOS/Windows, the builds are failing at the "Lint with pyright" step where I have my (conditional) import of eflomal in eflomal_aligner.py
. I don't see any reason why the conditional would evaluate to True because eflomal is not being installed, so I'm thinking that because it's a linter, it's looking at all the import statements regardless of where they're at. Is there a workaround to this?
On the linux side, the build succeeds only when I check for eflomal based on EFLOMAL_PATH, so I think EFLOMAL_PATH is pointing to the wrong place and it therefore does not try to do alignment in that case. On Monday I'll dig through the build logs to try to find the right path, unless either of you know where to find it off the top of your head :)
Reviewable status: 6 of 13 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved!
Reviewable status: 5 of 14 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)
tests/jobs/test_nmt_engine_build_job.py
line 106 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Either one - it should be tested in CI/CD (but only in linux).
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 14 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/translation_engine_build_job.py
line 121 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I understand that this makes the code cleaner, but it isn't very efficient. We end up reading/writing three potentially large files from the S3 bucket twice. Can we find a way to refactor this to avoid the extra reading/writing?
I got rid of the redundant read in _align
. Alternatively, it should be simple enough to rewrite the inference step so that the translation and alignment is done all at once and the output file is only written once overall, if that's what you were thinking of instead? Of course, it will be less clean code like you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 7 files at r2, 2 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @isaac091 and @johnml1135)
pyproject.toml
line 66 at r1 (raw file):
Previously, isaac091 (Isaac Schifferer) wrote…
Ok, that makes sense. Could I then just make eflomal optional so that it only installs when
jobs
is added specifically, or will the Windows/macOS builds also addjobs
, i.e. will they use thecpu_only
dockerfile? Or I can doeflomal = { markers = "sys_platform == 'linux'", version = "^2.0.0" }
.Maybe this isn't directly relevant, but I'm a little confused as to why we have extras defined at all if none of the dependencies in them are marked as optional, because as it currently stands, all of the dependencies are installed regardless of whether any extras are specified.
The extras are used when installing Machine as a pip package. When using poetry install
or poetry export
all non-optional dependencies are installed without regard to whether a dependency is part of an extra or not. We can make all of the dependencies specified in extras optional, but we will need to make sure to call poetry install --all-extras
on dev machines.
machine/jobs/build_nmt_engine.py
line 96 at r1 (raw file):
Previously, isaac091 (Isaac Schifferer) wrote…
@johnml1135 Sorry, this was confusing, it'll be in the next commit.
@ddaspit So would I put a default in
settings.yaml
? (athuggingface.align_pretranslations
?) Or just check if it's in the config first, like I'm doing now?
Yes, there should be a default in the settings.yaml
.
machine/jobs/nmt_engine_build_job.py
line 131 at r1 (raw file):
Previously, isaac091 (Isaac Schifferer) wrote…
Just to match with everything else, I can change it back if you don't think it's necessary.
Yes, you can change it back.
machine/jobs/translation_engine_build_job.py
line 121 at r1 (raw file):
Previously, isaac091 (Isaac Schifferer) wrote…
I got rid of the redundant read in
_align
. Alternatively, it should be simple enough to rewrite the inference step so that the translation and alignment is done all at once and the output file is only written once overall, if that's what you were thinking of instead? Of course, it will be less clean code like you mentioned.
Yes, I think we should write the translation and alignment all at once.
@ddaspit - can you bring this over the finish line in terms of reveiwing the code for me? |
@johnml1135 Yes, I will make sure the code review gets done. |
As far as tracking eflomal's progress, the eflomal executable can print out its progress, but with
subprocess.run
, I could only get the outputs all at once in the return value. I'm not sure if there's a way to pipe the stdout of the subprocess to the main stdout in real time, but it may not be worth the effort because the alignment is pretty quick (~30 seconds to align 1000 verses locally).This change is