Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add media download service (part 1) #104

Merged
merged 14 commits into from
Oct 9, 2024
Merged

Conversation

philmcmahon
Copy link
Contributor

@philmcmahon philmcmahon commented Oct 3, 2024

What does this change?

This pr introduces a new app to the transcription service - 'media download'. This service reads off a (new) SQS queue, parses the messages into a new MediaDownloadJob type, and then downloads the relevant url, then uploads the resulting file to the 'source media bucket' which is used by the transcription service, and sends an SQS message to the transcription worker to tell it to transcribe the file

To facilitate testing this service, I've also added a new endpoint to the transcription API, which pushes messages onto the queue based off the request body. You can test this endpoint by making a POST request to localhost:9103/api/transcribe-url with the following body:

{"url":"https://www.youtube.com/watch?v=upMOIAwWXmM","languageCode":"en","translationRequested":false}

Note - the infrastructure for the media download service doesn't exist yet, the project can only be run locally at the moment.

How to test

You'll need to manually install yt-dlp (https://github.com/yt-dlp/yt-dlp?tab=readme-ov-file#installation) - probably pip install yt-dlp is the easiest if you've got a suitable pythong environment set up. I haven't scripted this step because eventually we'll run yt-dlp using a docker container rather than isntalling it directly onto dev machines. This is how the existing whisper process works

If you run the transcription API, send the above request. If you get a 200 back, then run npm run media-download::start and you should see it read sqs/download the video/upload file to s3/send a transcription sqs message

Still todo

  • UI for providing youtube url
  • Infra for media download service
  • Update output handler and export UI to support saving the original video to google drive

@philmcmahon philmcmahon requested a review from a team as a code owner October 3, 2024 08:01
@philmcmahon philmcmahon changed the title Pm add local media download Add media download service Oct 3, 2024
@philmcmahon philmcmahon changed the title Add media download service Add media download service (part 1) Oct 3, 2024
visibilityTimeout: Duration.seconds(30),
contentBasedDeduplication: true,
deadLetterQueue: {
queue: transcriptionDeadLetterQueue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want transcription and download to share a dead letter queue? It seems like debugging might be easier if they're separate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eek good spot! This was a copy paste failure, I agree they should be separate

JSON.stringify(job),
s3Key,
);
if (!isSqsFailure(messageResult) && translationRequested) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to log an error if messageResult is an sqs failure

id: string,
) => {
const output =
await $`yt-dlp --write-info-json --no-clean-info-json --newline -o "${destinationDirectoryPath}/${id}" ${url}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has yt-dlp been pretty consistent for youtube? I was noticing some flakiness using yt-dlp to download videos today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, I've had some success with it. I'm hopeful that even if we end up having to use a different tool most of this code will still be useful

scripts/setup.sh Outdated
@@ -45,7 +52,8 @@ DYNAMODB_ARN=$(aws --endpoint-url=http://localhost:4566 dynamodb create-table \

echo "Created table, arn: ${DYNAMODB_ARN}"


# media-downloader dependencies
brew install ffmpeg phantomjs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about installing yt-dlp itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've updated the pr description to address this, and now removed this brew command from setup.sh.

The way whisper works at the moment is that the worker runs a docker container with whisper installed rather than actually executing whisper directly. I think we could do the same thing for media-download.

At the moment there's no guidance on installing yt-dlp in this PR, pending the next PR with various dockerfiles and other infrastructure bits in it. I could add a Pipfile (pipenv is my fav new python thing since you showed it to me) but given that it will just be for a short while before we swap out running yt-dlp directly for docker maybe it's fine as is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, happy to wait for yt-dlp on docker

);
if (job) {
const metadata = await downloadMedia(job.url, '/tmp', job.id);
const key = await uploadToS3(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the file from the file system once its uploaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I'm hoping we can run media-download on an ECS fargate task (like lurch ingest jobs) so the whole container will be deleted after the download/upload has finished

Comment on lines +99 to +104
return await sendMessage(
client,
queueUrl,
JSON.stringify({ ...job, translate: true }),
s3Key,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this bit - we put two messages in the queue if the user has requested a translation? Doesn't the worker create the translation as its transcribing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way translation works at the moment is that we send two messages and the translation runs on one worker, the normal transcription on the other. My idea there was to speed up the transcription/translation - if we do it all on the same worker then it could take twice as long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok thanks, I hadn't realized this.

@philmcmahon philmcmahon force-pushed the pm-add-local-media-download branch from 82d39d6 to 99fc2db Compare October 9, 2024 14:32
@philmcmahon
Copy link
Contributor Author

@zekehuntergreen thanks v much for the review. I ran into an issue related to zx deprecating commonjs support and decided (again I think!) to swap it out and use the basic node subprocess command instead - that change is here 322acaf

stderr: stderr.join(''),
code: code || undefined,
};
logger.info('Ignoring stdout to avoid logging sensitive data');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important not to log stdout when running yt-dlp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've updated this ecff3ce

Copy link
Contributor

@zekehuntergreen zekehuntergreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download-percentage

@philmcmahon philmcmahon force-pushed the pm-add-local-media-download branch from 263b2f8 to 4ab40fa Compare October 9, 2024 16:36
@philmcmahon philmcmahon force-pushed the pm-add-local-media-download branch from 4ab40fa to cc1248d Compare October 9, 2024 16:43
@philmcmahon philmcmahon merged commit 8a79e0b into main Oct 9, 2024
3 checks passed
@philmcmahon philmcmahon deleted the pm-add-local-media-download branch October 9, 2024 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants