Support transcribing a url #107

philmcmahon · 2024-10-22T16:37:27Z

What does this change?

Following from #104, this PR provides the user interface and infrastructure needed to support transcription of media urls submitted via the UI. It also includes an update to the media-download service to support running with a proxy server.

Depends on https://github.com/guardian/investigations-platform/pull/523

Infrastructure

The media-download service itself is built into a docker container - you can see the changes to the CI github actions workflow to see how this is built. Currently that container is published only to ECR (AWS).

After a user submits a url, a request is made to the transcription API. The API puts a message on a (new) media download SQS task queue, then returns a success code back to the client.

We use an event bridge pipe to connect the SQS queue to the media-download step function. The pipe does the job of reading the message, passing it as an input property to the step function, and then deleting the message. Pipes can do data validation/manipulation, but at the moment it's just a dumb pipe - read, trigger, delete. If an invalid message is sent then it is up to the media-download service to deal with it.

The media-download service itself is an ECS task wrapped in a step function. The main change other than the proxy service is that the service no longer needs to deal with the incoming SQS queue - it just gets the contents of the message as an environment variable (thanks to the event bridge pipe).

In order to add an ECS Task to the transcription service without having to check in the CDK context file, I had to make some changes to GuCdk which are not yet merged - this PR is currently against my branch on the CDK project. The associated PRs are here:

feat: Require subnet property for ecs task, new subnetsFromParameter property cdk#2477
Parameterise cpuArchitecture for GuEcsTask cdk#2486
Separately, there is a proposal to remove the ecs task patterns from gucdk since no other teams are using it, but for now it's staying there

Discussion here on how to ship logs from this task guardian/cloudwatch-logs-management#360

Other changes

The output handler has been updated to provide a download url for the media. In a future piece of work we'll add export to google drive.

Finally, the client side UploadForm has had a small makeover, mainly with the second page you get to after submitting the form being pulled out into smaller component files. This change is probably best tested by trying out the forms locally or on CODE.

How to test

Currently live on CODE https://transcribe.code.dev-gutools.co.uk/

How can we measure success?

The intention here is for journalists to be able to keep a record of media they are using for journalistic purposes

…ed a job from sqs, uses yt-dlp to download and uploads the result to s3

Co-authored by : Marjan Kalanaki <[email protected]>

…pipe

…ion to get the secret

github-actions · 2024-10-23T14:33:36Z

Deploy build 730 of `investigations::transcription-service` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2024-10-23T14:33:39Z

Deploy build 610 of `investigations::transcription-service-repository` to CODE

All deployment options

From guardian/actions-riff-raff.

zekehuntergreen

This is looking good - just a few comments.
Also fyi testing a few urls in CODE I haven't received a failure or success email yet.

containers/media-download.Dockerfile

zekehuntergreen · 2024-10-28T12:57:53Z

packages/media-download/src/yt-dlp.ts

+				'-o',
+				'StrictHostKeyChecking=no',
+				'-D',
+				'1337',


might be worth putting this port in a variable?

addressed here, decided to make the proxy function return a url 0385e64

zekehuntergreen · 2024-10-28T13:04:33Z

packages/backend-common/src/process.ts

@@ -11,21 +11,31 @@ export const runSpawnCommand = (
 	cmd: string,
 	args: ReadonlyArray<string>,
 	logStdout: boolean,
+	logImmediately: boolean = false,


Might be worth enforcing at a type level that we never log immediately if processName == transcribe. If its a transcription of sensitive source material we don't want the transcript in the logs. We could make processName a union of string literals.

Great shout 007b215

zekehuntergreen · 2024-10-28T13:12:52Z

packages/client/src/components/UploadProgress.tsx

+				{Object.entries(uploads).map(([key, value]) => (
+					<li className="flex items-center">
+						<span className={'mr-1'}>{iconForStatus(value)}</span>
+						{key} {value === RequestStatus.Invalid && ' (invalid url)'}


it would be good to add a button to reset the UI in this case

zekehuntergreen · 2024-10-28T13:23:32Z

packages/client/src/components/UploadFailure.tsx

+	mediaSource,
+}: {
+	reset: () => void;
+	mediaSource: 'file' | 'url';


could define this as a type since it's in a few places

zekehuntergreen · 2024-10-28T13:38:31Z

packages/client/src/components/UploadForm.tsx

+			const urlsWithStatus = Object.fromEntries(
+				urls.map((url) => [
+					url,
+					checkUrlValid(url) ? RequestStatus.InProgress : RequestStatus.Invalid,


This feels more like form validation than async error reporting.
Is it tricky to get the form to point out invalid urls before its submitted?

Agreed this would be better done prior to submission. This could be done with this nice validation stuff in flowbite https://flowbite-react.com/docs/components/forms#form-validation

The reason I haven't done it is because at the moment I'm just using a single text area for all the urls, so it would be tricky to indicate which one is invalid via the form element. I think a better solution would be to have one text field per url, and a button to add an extra field, but it's a fair bit of extra work so was going to launch with this as is

zekehuntergreen · 2024-10-28T13:42:41Z

packages/cdk/tsconfig.json

+	"extends": "@guardian/tsconfig/tsconfig.json",
+	"compilerOptions": {
+		"module": "CommonJS",
+		"noUnusedLocals": false


do we want unused locals?

ah thanks, no we don't - I just allowed them because it was getting annoying when I was debugging the CDK by commenting stuff out to have to delete all the const blah = assignments that were no longer used as a result

zekehuntergreen · 2024-10-28T14:00:28Z

packages/media-downloader/src/index.ts

+			client: s3Client,
+			params: {
+				Bucket: bucket,
+				Key: `downloaded-media/${metadata.title}.${metadata.extension}`,


Seems possible that two different files will share a title. Could this result in files being accidentally overwritten or permissions leaking to the wrong user?

apologies, this code shouldn't have been checked in, in the media-download version it just uses the id

zekehuntergreen · 2024-10-28T14:06:42Z

packages/media-downloader/src/sqs.ts

I might be missing something but I'm having a hard time working out what the difference is between the media-downloader and media-download packages. Should they be separate or did you mean to rename media-download? It looks like there's some overlap in functionality e.g. implementations of uploadToS3 and execution of yt-dlp.

oh dear! media-downloader should definitely have been removed entirely

Co-authored-by: Zeke Hunter-Green <[email protected]>

…n random boolean

philmcmahon · 2024-10-29T12:08:30Z

This is looking good - just a few comments. Also fyi testing a few urls in CODE I haven't received a failure or success email yet.

Ah, thanks it was broken on CODE because the step function was pointing at the wrong container tag - should be working now

zekehuntergreen

Looks good 🎉

philmcmahon added 20 commits October 22, 2024 11:45

Introduce media download service (and associated types) - which fetch…

73e2531

…ed a job from sqs, uses yt-dlp to download and uploads the result to s3

Prepare for building multiple containers

f0a8700

Co-authored by : Marjan Kalanaki <[email protected]>

Add ECS task for media-download service, triggered using EventBridge …

b5ee6e8

…pipe

Send message if media download fails.

c904793

Remove SQS logic from media download service (will get input directly)

0eb9c0d

fix environmentOverrides - data is an array

ef088f7

Modify upload UI to support submitting a URL.

2546628

Fix code for new MediaDownloadFailure type

a355796

Fix workflow name

f80fc8d

Add region/stage env vars to ecs task

63f2074

Log out yt-dlp output immediately

49a297c

Use simplified ecs task subnets setting. Use DEV container

83e0981

Add configuration to use a proxy server with yt-dlp

b2013ec

Add copyright warning to upload form

a3d0253

Lazily evaluate proxy ssh key so that not every service needs permiss…

a16f3b0

…ion to get the secret

Move media download build into main ci workflow

87686d2

Improve upload form components to support urls as well as file uploads

5b6f347

allow ci workflow to upload to ghcr

b59d177

Only publish media download container to ECR

6eaffa3

Add convenience script to trigger media download service

fad396d

philmcmahon requested a review from a team as a code owner October 22, 2024 16:37

philmcmahon changed the title ~~Add media url UI and infrastructure~~ Support transcribing a url Oct 22, 2024

Update copyright warning

1a95482

zekehuntergreen reviewed Oct 28, 2024

View reviewed changes

philmcmahon and others added 4 commits October 28, 2024 14:27

fix typo in dockerfile

7604886

Co-authored-by: Zeke Hunter-Green <[email protected]>

Remove media-downloader package

4bc0a5b

Parameterise port

0385e64

Introduce MediaSourceType rather than having file | url everywhere

1c47304

philmcmahon added 5 commits October 28, 2024 15:49

Re-disable unused locals

796d957

Use main not DEV version of container

8f3fc7a

Show failure message when one or more invalid urls

75055c8

Make stdout logging of subprocesses based off process name rather tha…

007b215

…n random boolean

Use branch container tag

9c514ca

philmcmahon force-pushed the pm-media-download-infra branch from 0d7badb to 8f4c951 Compare October 29, 2024 12:26

Only build whisper container when whisper dockerfile is changed

8875b59

philmcmahon force-pushed the pm-media-download-infra branch from 8f4c951 to 8875b59 Compare October 29, 2024 12:26

zekehuntergreen approved these changes Oct 29, 2024

View reviewed changes

philmcmahon merged commit 2d19531 into main Oct 29, 2024
5 checks passed

philmcmahon deleted the pm-media-download-infra branch October 29, 2024 14:22

philmcmahon mentioned this pull request Oct 29, 2024

Use correct ECR container version for media download service #108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transcribing a url #107

Support transcribing a url #107

philmcmahon commented Oct 22, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading

zekehuntergreen left a comment

zekehuntergreen Oct 28, 2024

philmcmahon Oct 28, 2024

zekehuntergreen Oct 28, 2024

philmcmahon Oct 29, 2024

zekehuntergreen Oct 28, 2024

zekehuntergreen Oct 28, 2024

philmcmahon Oct 28, 2024

zekehuntergreen Oct 28, 2024

philmcmahon Oct 29, 2024

zekehuntergreen Oct 28, 2024

philmcmahon Oct 28, 2024

zekehuntergreen Oct 28, 2024

philmcmahon Oct 28, 2024

zekehuntergreen Oct 28, 2024

philmcmahon Oct 28, 2024

philmcmahon commented Oct 29, 2024

zekehuntergreen left a comment

Support transcribing a url #107

Support transcribing a url #107

Conversation

philmcmahon commented Oct 22, 2024 • edited Loading

What does this change?

Infrastructure

Other changes

How to test

How can we measure success?

github-actions bot commented Oct 23, 2024 • edited Loading

github-actions bot commented Oct 23, 2024 • edited Loading

zekehuntergreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philmcmahon commented Oct 29, 2024

zekehuntergreen left a comment

Choose a reason for hiding this comment

philmcmahon commented Oct 22, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading