Add media download service (part 1) #104

philmcmahon · 2024-10-03T08:01:09Z

What does this change?

This pr introduces a new app to the transcription service - 'media download'. This service reads off a (new) SQS queue, parses the messages into a new MediaDownloadJob type, and then downloads the relevant url, then uploads the resulting file to the 'source media bucket' which is used by the transcription service, and sends an SQS message to the transcription worker to tell it to transcribe the file

To facilitate testing this service, I've also added a new endpoint to the transcription API, which pushes messages onto the queue based off the request body. You can test this endpoint by making a POST request to localhost:9103/api/transcribe-url with the following body:

{"url":"https://www.youtube.com/watch?v=upMOIAwWXmM","languageCode":"en","translationRequested":false}

Note - the infrastructure for the media download service doesn't exist yet, the project can only be run locally at the moment.

How to test

You'll need to manually install yt-dlp (https://github.com/yt-dlp/yt-dlp?tab=readme-ov-file#installation) - probably pip install yt-dlp is the easiest if you've got a suitable pythong environment set up. I haven't scripted this step because eventually we'll run yt-dlp using a docker container rather than isntalling it directly onto dev machines. This is how the existing whisper process works

If you run the transcription API, send the above request. If you get a 200 back, then run npm run media-download::start and you should see it read sqs/download the video/upload file to s3/send a transcription sqs message

Still todo

UI for providing youtube url
Infra for media download service
Update output handler and export UI to support saving the original video to google drive

zekehuntergreen · 2024-10-07T16:24:43Z

packages/cdk/lib/transcription-service.ts

+				visibilityTimeout: Duration.seconds(30),
+				contentBasedDeduplication: true,
+				deadLetterQueue: {
+					queue: transcriptionDeadLetterQueue,


Do we want transcription and download to share a dead letter queue? It seems like debugging might be easier if they're separate.

eek good spot! This was a copy paste failure, I agree they should be separate

zekehuntergreen · 2024-10-07T16:29:11Z

packages/backend-common/src/sqs.ts

+		JSON.stringify(job),
+		s3Key,
+	);
+	if (!isSqsFailure(messageResult) && translationRequested) {


might want to log an error if messageResult is an sqs failure

packages/backend-common/src/sqs.ts

zekehuntergreen · 2024-10-07T16:34:36Z

packages/media-download/src/yt-dlp.ts

+	id: string,
+) => {
+	const output =
+		await $`yt-dlp --write-info-json --no-clean-info-json --newline -o "${destinationDirectoryPath}/${id}" ${url}`;


has yt-dlp been pretty consistent for youtube? I was noticing some flakiness using yt-dlp to download videos today.

as discussed, I've had some success with it. I'm hopeful that even if we end up having to use a different tool most of this code will still be useful

zekehuntergreen · 2024-10-07T16:36:54Z

scripts/setup.sh

@@ -45,7 +52,8 @@ DYNAMODB_ARN=$(aws --endpoint-url=http://localhost:4566 dynamodb create-table \

 echo "Created table, arn: ${DYNAMODB_ARN}"

-
+# media-downloader dependencies
+brew install ffmpeg phantomjs


what about installing yt-dlp itself?

Good point. I've updated the pr description to address this, and now removed this brew command from setup.sh.

The way whisper works at the moment is that the worker runs a docker container with whisper installed rather than actually executing whisper directly. I think we could do the same thing for media-download.

At the moment there's no guidance on installing yt-dlp in this PR, pending the next PR with various dockerfiles and other infrastructure bits in it. I could add a Pipfile (pipenv is my fav new python thing since you showed it to me) but given that it will just be for a short while before we swap out running yt-dlp directly for docker maybe it's fine as is

Ok, happy to wait for yt-dlp on docker

zekehuntergreen · 2024-10-07T16:42:28Z

packages/media-download/src/index.ts

+	);
+	if (job) {
+		const metadata = await downloadMedia(job.url, '/tmp', job.id);
+		const key = await uploadToS3(


should we remove the file from the file system once its uploaded?

At the moment I'm hoping we can run media-download on an ECS fargate task (like lurch ingest jobs) so the whole container will be deleted after the download/upload has finished

zekehuntergreen · 2024-10-07T16:49:01Z

packages/backend-common/src/sqs.ts

+		return await sendMessage(
+			client,
+			queueUrl,
+			JSON.stringify({ ...job, translate: true }),
+			s3Key,
+		);


I'm not sure I understand this bit - we put two messages in the queue if the user has requested a translation? Doesn't the worker create the translation as its transcribing?

The way translation works at the moment is that we send two messages and the translation runs on one worker, the normal transcription on the other. My idea there was to speed up the transcription/translation - if we do it all on the same worker then it could take twice as long

Ah ok thanks, I hadn't realized this.

…ed a job from sqs, uses yt-dlp to download and uploads the result to s3

…scription

philmcmahon · 2024-10-09T15:14:43Z

@zekehuntergreen thanks v much for the review. I ran into an issue related to zx deprecating commonjs support and decided (again I think!) to swap it out and use the basic node subprocess command instead - that change is here 322acaf

zekehuntergreen · 2024-10-09T15:46:19Z

packages/backend-common/src/process.ts

+				stderr: stderr.join(''),
+				code: code || undefined,
+			};
+			logger.info('Ignoring stdout to avoid logging sensitive data');


Is it important not to log stdout when running yt-dlp?

Good point, I've updated this ecff3ce

zekehuntergreen

github-actions · 2024-10-09T16:38:18Z

Deploy build 696 of `investigations::transcription-service` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2024-10-09T16:38:20Z

Deploy build 576 of `investigations::transcription-service-repository` to CODE

All deployment options

From guardian/actions-riff-raff.

philmcmahon requested a review from a team as a code owner October 3, 2024 08:01

philmcmahon changed the title ~~Pm add local media download~~ Add media download service Oct 3, 2024

philmcmahon changed the title ~~Add media download service~~ Add media download service (part 1) Oct 3, 2024

zekehuntergreen reviewed Oct 7, 2024

View reviewed changes

philmcmahon added 7 commits October 9, 2024 15:27

Introduce media download service (and associated types) - which fetch…

34117b9

…ed a job from sqs, uses yt-dlp to download and uploads the result to s3

Add new sqs queue for media download service

a524c85

Add media download api endpoint

53c0794

Simplify transcript sqs logic for translations

c20592b

After uploading the file to s3, send an SQS message requesting a tran…

1331600

…scription

media-download not media-downloader

c663acf

Use a separate dead letter queue for the media download queue

99fc2db

philmcmahon force-pushed the pm-add-local-media-download branch from 82d39d6 to 99fc2db Compare October 9, 2024 14:32

philmcmahon added 4 commits October 9, 2024 15:47

Remove phantomjs/ffmpeg dependencies

3d39603

Log an error when failing to send sqs message.

1a8ecf4

Use runSpawnCommand rather than zx for media download service

322acaf

Bump gu cdk and aws cdk versions

86ac652

zekehuntergreen reviewed Oct 9, 2024

View reviewed changes

zekehuntergreen approved these changes Oct 9, 2024

View reviewed changes

philmcmahon added 2 commits October 9, 2024 17:22

Only hide stdout sometimes

ecff3ce

Update package-lock

a7f1e8e

philmcmahon force-pushed the pm-add-local-media-download branch from 263b2f8 to 4ab40fa Compare October 9, 2024 16:36

Explicitly depend on @guardian/eslint-config

cc1248d

philmcmahon force-pushed the pm-add-local-media-download branch from 4ab40fa to cc1248d Compare October 9, 2024 16:43

philmcmahon merged commit 8a79e0b into main Oct 9, 2024
3 checks passed

philmcmahon deleted the pm-add-local-media-download branch October 9, 2024 16:46

philmcmahon mentioned this pull request Oct 22, 2024

Support transcribing a url #107

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add media download service (part 1) #104

Add media download service (part 1) #104

philmcmahon commented Oct 3, 2024 •

edited

Loading

zekehuntergreen Oct 7, 2024

philmcmahon Oct 9, 2024

zekehuntergreen Oct 7, 2024

zekehuntergreen Oct 7, 2024

philmcmahon Oct 9, 2024

zekehuntergreen Oct 7, 2024

philmcmahon Oct 9, 2024

zekehuntergreen Oct 9, 2024

zekehuntergreen Oct 7, 2024

philmcmahon Oct 9, 2024

zekehuntergreen Oct 7, 2024

philmcmahon Oct 9, 2024

zekehuntergreen Oct 9, 2024

philmcmahon commented Oct 9, 2024

zekehuntergreen Oct 9, 2024

philmcmahon Oct 9, 2024

zekehuntergreen left a comment

github-actions bot commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading

Add media download service (part 1) #104

Add media download service (part 1) #104

Conversation

philmcmahon commented Oct 3, 2024 • edited Loading

What does this change?

How to test

Still todo

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philmcmahon commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zekehuntergreen left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 9, 2024 • edited Loading

github-actions bot commented Oct 9, 2024 • edited Loading

philmcmahon commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading