Skip to content

Add optional alignment step to NMT jobs, temporary implementation of eflomal #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

isaac091
Copy link
Collaborator

@isaac091 isaac091 commented Apr 7, 2025

As far as tracking eflomal's progress, the eflomal executable can print out its progress, but with subprocess.run, I could only get the outputs all at once in the return value. I'm not sure if there's a way to pipe the stdout of the subprocess to the main stdout in real time, but it may not be worth the effort because the alignment is pretty quick (~30 seconds to align 1000 verses locally).


This change is Reviewable

@isaac091 isaac091 requested review from ddaspit and johnml1135 April 7, 2025 19:29
@codecov-commenter
Copy link

codecov-commenter commented Apr 10, 2025

Codecov Report

Attention: Patch coverage is 91.03448% with 13 lines in your changes missing coverage. Please review.

Project coverage is 88.62%. Comparing base (a3d7276) to head (881a795).

Files with missing lines Patch % Lines
machine/jobs/nmt_engine_build_job.py 50.00% 4 Missing ⚠️
machine/jobs/translation_file_service.py 63.63% 4 Missing ⚠️
machine/jobs/translation_engine_build_job.py 90.00% 3 Missing ⚠️
machine/jobs/eflomal_aligner.py 97.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #169      +/-   ##
==========================================
+ Coverage   88.55%   88.62%   +0.07%     
==========================================
  Files         276      277       +1     
  Lines       16498    16630     +132     
==========================================
+ Hits        14609    14739     +130     
- Misses       1889     1891       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@isaac091
Copy link
Collaborator Author

I changed the base image for the main Dockerfile to be able to install eflomal. This is something we had done in silnlp (in that case we went from the CUDA base to the basic Ubuntu base) because pytorch comes with all of the necessary CUDA drivers anyway. The tests are passing, but I wanted to double check to see if there would be any other reason why we shouldn't do this for machine.py?

Something I hadn't considered until the builds failed is that we haven't yet found a way to get eflomal running on non-Linux OSs. Is that going to be an issue?

@johnml1135
Copy link
Collaborator

We want it to not crash on windows machines. That should be enough for a "temporary" solution.

@johnml1135
Copy link
Collaborator

dockerfile line 28 at r1 (raw file):



FROM python:$PYTHON_VERSION

If we can truly change the docker (and the docker CPU only) image to both be from the python base image, they should mirror each other, except that the CPU only image excludes the GPU extras (and whatever else I am missing).

@johnml1135
Copy link
Collaborator

pyproject.toml line 66 at r1 (raw file):

sentencepiece = "^0.2.0"
sil-thot = "^3.4.6"
eflomal = "^2.0.0"

Can this be gated to exclude windows builds? What do we know about MAC's?

@johnml1135
Copy link
Collaborator

machine/jobs/build_nmt_engine.py line 96 at r1 (raw file):

    parser.add_argument("--save-model", default=None, type=str, help="Save the model using the specified base name")
    parser.add_argument(
        "--align-pretranslations", default=False, action="store_true", help="Aligns source and target pretranslations"

... using Eflomal (linux only) and returns the alignments as well as the tokenized source and target with the pretranslations.

@johnml1135
Copy link
Collaborator

machine/jobs/eflomal_aligner.py line 4 at r1 (raw file):

# The vast majority of this code is taken from the silnlp repository.

import os

I am going to assume that this all works. Is there a unit test we can run to make sure that the basic functionality is working (that is also, of course, only run on Linux machines)?

@johnml1135
Copy link
Collaborator

machine/jobs/smt_engine_build_job.py line 112 at r1 (raw file):

    source_segments = [pi["pretranslation"] for pi in batch]
    for i, result in enumerate(engine.translate_batch(source_segments)):
        batch[i]["pretranslation"] = result.translation

Is this a different error you were fixing? What is the reason for this change?

@johnml1135
Copy link
Collaborator

machine/jobs/translation_engine_build_job.py line 121 at r1 (raw file):

                check_canceled()

            writer = stack.enter_context(self._translation_file_service.open_target_pretranslation_writer())

So, you are just overwriting the file? That should be ok. I just want to make sure that the (new) file is uploaded to the S3 bucket when it is done.

@johnml1135
Copy link
Collaborator

machine/jobs/smt_engine_build_job.py line 112 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Is this a different error you were fixing? What is the reason for this change?

I understand now. Are any corresponding changes needed in machine?

@johnml1135
Copy link
Collaborator

tests/jobs/test_nmt_engine_build_job.py line 106 at r1 (raw file):

        decoy.when(self.translation_file_service.exists_source_corpus()).then_return(True)
        decoy.when(self.translation_file_service.exists_target_corpus()).then_return(True)
        decoy.when(self.translation_file_service.get_source_pretranslations()).then_do(

We should test that we actually can call eflomal (but only in Linux).

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 13 files at r1, all commit messages.
Reviewable status: 12 of 13 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @isaac091)

Copy link
Collaborator Author

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 13 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)


pyproject.toml line 66 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Can this be gated to exclude windows builds? What do we know about MAC's?

I don't think we've had anyone try to use eflomal on macOS. It looks like there is a poetry option to constrain eflomal to linux builds.


machine/jobs/build_nmt_engine.py line 96 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

... using Eflomal (linux only) and returns the alignments as well as the tokenized source and target with the pretranslations.

Done.


machine/jobs/eflomal_aligner.py line 4 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I am going to assume that this all works. Is there a unit test we can run to make sure that the basic functionality is working (that is also, of course, only run on Linux machines)?

See below comment. I also tested it on its own w/ a whole Bible.


machine/jobs/smt_engine_build_job.py line 112 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I understand now. Are any corresponding changes needed in machine?

I'm not entirely sure but I would guess not because the changes are only in jobs-related code.


machine/jobs/translation_engine_build_job.py line 121 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

So, you are just overwriting the file? That should be ok. I just want to make sure that the (new) file is uploaded to the S3 bucket when it is done.

Yes, I'm just overwriting the file. Is that enough to make sure it gets uploaded?


tests/jobs/test_nmt_engine_build_job.py line 106 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

We should test that we actually can call eflomal (but only in Linux).

I had run these tests with "align_pretranslations" in the test environment config and it does the alignment for the single sentence. Should I add that step to the test officially, or should it have its own unit test? I'll make sure to add a platform check.

@johnml1135
Copy link
Collaborator

pyproject.toml line 66 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

I don't think we've had anyone try to use eflomal on macOS. It looks like there is a poetry option to constrain eflomal to linux builds.

How about we just limit it to linux? If we need MacOS, we can add it later.

@johnml1135
Copy link
Collaborator

machine/jobs/smt_engine_build_job.py line 112 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

I'm not entirely sure but I would guess not because the changes are only in jobs-related code.

ok.

@johnml1135
Copy link
Collaborator

machine/jobs/translation_engine_build_job.py line 121 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

Yes, I'm just overwriting the file. Is that enough to make sure it gets uploaded?

It looks like it does - we can confirm in integration tests.

@johnml1135
Copy link
Collaborator

tests/jobs/test_nmt_engine_build_job.py line 106 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

I had run these tests with "align_pretranslations" in the test environment config and it does the alignment for the single sentence. Should I add that step to the test officially, or should it have its own unit test? I'll make sure to add a platform check.

Either one - it should be tested in CI/CD (but only in linux).

@johnml1135
Copy link
Collaborator

machine/jobs/build_nmt_engine.py line 96 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

Done.

Did you upload it?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 13 files at r1.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @isaac091)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @isaac091 and @johnml1135)


pyproject.toml line 66 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How about we just limit it to linux? If we need MacOS, we can add it later.

This dependency should be part of the jobs extra.


machine/jobs/build_nmt_engine.py line 96 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Did you upload it?

I feel like this should be in the build options.


machine/jobs/translation_engine_build_job.py line 121 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It looks like it does - we can confirm in integration tests.

I understand that this makes the code cleaner, but it isn't very efficient. We end up reading/writing three potentially large files from the S3 bucket twice. Can we find a way to refactor this to avoid the extra reading/writing?


machine/jobs/nmt_engine_build_job.py line 131 at r1 (raw file):

    writer: DictToJsonWriter,
) -> None:
    source_segments = [pi["pretranslation"] for pi in batch]

Why was this changed to pretranslation?

Copy link
Collaborator Author

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)


pyproject.toml line 66 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This dependency should be part of the jobs extra.

Ok, that makes sense. Could I then just make eflomal optional so that it only installs when jobs is added specifically, or will the Windows/macOS builds also add jobs, i.e. will they use the cpu_only dockerfile? Or I can do eflomal = { markers = "sys_platform == 'linux'", version = "^2.0.0" }.

Maybe this isn't directly relevant, but I'm a little confused as to why we have extras defined at all if none of the dependencies in them are marked as optional, because as it currently stands, all of the dependencies are installed regardless of whether any extras are specified.


machine/jobs/build_nmt_engine.py line 96 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I feel like this should be in the build options.

@johnml1135 Sorry, this was confusing, it'll be in the next commit.

@ddaspit So would I put a default in settings.yaml? (at huggingface.align_pretranslations?) Or just check if it's in the config first, like I'm doing now?


machine/jobs/nmt_engine_build_job.py line 131 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why was this changed to pretranslation?

Just to match with everything else, I can change it back if you don't think it's necessary.

Copy link
Collaborator Author

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some progress on making eflomal only run on linux, but I'm not there yet. For macOS/Windows, the builds are failing at the "Lint with pyright" step where I have my (conditional) import of eflomal in eflomal_aligner.py. I don't see any reason why the conditional would evaluate to True because eflomal is not being installed, so I'm thinking that because it's a linter, it's looking at all the import statements regardless of where they're at. Is there a workaround to this?

On the linux side, the build succeeds only when I check for eflomal based on EFLOMAL_PATH, so I think EFLOMAL_PATH is pointing to the wrong place and it therefore does not try to do alignment in that case. On Monday I'll dig through the build logs to try to find the right path, unless either of you know where to find it off the top of your head :)

Reviewable status: 6 of 13 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Collaborator Author

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved!

Reviewable status: 5 of 14 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)


tests/jobs/test_nmt_engine_build_job.py line 106 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Either one - it should be tested in CI/CD (but only in linux).

Done.

Copy link
Collaborator Author

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 14 files reviewed, 6 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/translation_engine_build_job.py line 121 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I understand that this makes the code cleaner, but it isn't very efficient. We end up reading/writing three potentially large files from the S3 bucket twice. Can we find a way to refactor this to avoid the extra reading/writing?

I got rid of the redundant read in _align. Alternatively, it should be simple enough to rewrite the inference step so that the translation and alignment is done all at once and the output file is only written once overall, if that's what you were thinking of instead? Of course, it will be less clean code like you mentioned.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 7 files at r2, 2 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @isaac091 and @johnml1135)


pyproject.toml line 66 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

Ok, that makes sense. Could I then just make eflomal optional so that it only installs when jobs is added specifically, or will the Windows/macOS builds also add jobs, i.e. will they use the cpu_only dockerfile? Or I can do eflomal = { markers = "sys_platform == 'linux'", version = "^2.0.0" }.

Maybe this isn't directly relevant, but I'm a little confused as to why we have extras defined at all if none of the dependencies in them are marked as optional, because as it currently stands, all of the dependencies are installed regardless of whether any extras are specified.

The extras are used when installing Machine as a pip package. When using poetry install or poetry export all non-optional dependencies are installed without regard to whether a dependency is part of an extra or not. We can make all of the dependencies specified in extras optional, but we will need to make sure to call poetry install --all-extras on dev machines.


machine/jobs/build_nmt_engine.py line 96 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

@johnml1135 Sorry, this was confusing, it'll be in the next commit.

@ddaspit So would I put a default in settings.yaml? (at huggingface.align_pretranslations?) Or just check if it's in the config first, like I'm doing now?

Yes, there should be a default in the settings.yaml.


machine/jobs/nmt_engine_build_job.py line 131 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

Just to match with everything else, I can change it back if you don't think it's necessary.

Yes, you can change it back.


machine/jobs/translation_engine_build_job.py line 121 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

I got rid of the redundant read in _align. Alternatively, it should be simple enough to rewrite the inference step so that the translation and alignment is done all at once and the output file is only written once overall, if that's what you were thinking of instead? Of course, it will be less clean code like you mentioned.

Yes, I think we should write the translation and alignment all at once.

@johnml1135
Copy link
Collaborator

@ddaspit - can you bring this over the finish line in terms of reveiwing the code for me?

@ddaspit
Copy link
Contributor

ddaspit commented Apr 22, 2025

@johnml1135 Yes, I will make sure the code review gets done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants