Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transcribing a url #107

Merged
merged 31 commits into from
Oct 29, 2024
Merged

Support transcribing a url #107

merged 31 commits into from
Oct 29, 2024

Conversation

philmcmahon
Copy link
Contributor

@philmcmahon philmcmahon commented Oct 22, 2024

What does this change?

Following from #104, this PR provides the user interface and infrastructure needed to support transcription of media urls submitted via the UI. It also includes an update to the media-download service to support running with a proxy server.

Depends on https://github.com/guardian/investigations-platform/pull/523

Infrastructure

The media-download service itself is built into a docker container - you can see the changes to the CI github actions workflow to see how this is built. Currently that container is published only to ECR (AWS).

After a user submits a url, a request is made to the transcription API. The API puts a message on a (new) media download SQS task queue, then returns a success code back to the client.

We use an event bridge pipe to connect the SQS queue to the media-download step function. The pipe does the job of reading the message, passing it as an input property to the step function, and then deleting the message. Pipes can do data validation/manipulation, but at the moment it's just a dumb pipe - read, trigger, delete. If an invalid message is sent then it is up to the media-download service to deal with it.

The media-download service itself is an ECS task wrapped in a step function. The main change other than the proxy service is that the service no longer needs to deal with the incoming SQS queue - it just gets the contents of the message as an environment variable (thanks to the event bridge pipe).

In order to add an ECS Task to the transcription service without having to check in the CDK context file, I had to make some changes to GuCdk which are not yet merged - this PR is currently against my branch on the CDK project. The associated PRs are here:

Discussion here on how to ship logs from this task guardian/cloudwatch-logs-management#360

Other changes

The output handler has been updated to provide a download url for the media. In a future piece of work we'll add export to google drive.

Finally, the client side UploadForm has had a small makeover, mainly with the second page you get to after submitting the form being pulled out into smaller component files. This change is probably best tested by trying out the forms locally or on CODE.

How to test

Currently live on CODE https://transcribe.code.dev-gutools.co.uk/

How can we measure success?

The intention here is for journalists to be able to keep a record of media they are using for journalistic purposes

@philmcmahon philmcmahon requested a review from a team as a code owner October 22, 2024 16:37
@philmcmahon philmcmahon changed the title Add media url UI and infrastructure Support transcribing a url Oct 22, 2024
Copy link
Contributor

@zekehuntergreen zekehuntergreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good - just a few comments.
Also fyi testing a few urls in CODE I haven't received a failure or success email yet.

'-o',
'StrictHostKeyChecking=no',
'-D',
'1337',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth putting this port in a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed here, decided to make the proxy function return a url 0385e64

@@ -11,21 +11,31 @@ export const runSpawnCommand = (
cmd: string,
args: ReadonlyArray<string>,
logStdout: boolean,
logImmediately: boolean = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth enforcing at a type level that we never log immediately if processName == transcribe. If its a transcription of sensitive source material we don't want the transcript in the logs. We could make processName a union of string literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great shout 007b215

{Object.entries(uploads).map(([key, value]) => (
<li className="flex items-center">
<span className={'mr-1'}>{iconForStatus(value)}</span>
{key} {value === RequestStatus.Invalid && ' (invalid url)'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to add a button to reset the UI in this case

mediaSource,
}: {
reset: () => void;
mediaSource: 'file' | 'url';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could define this as a type since it's in a few places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const urlsWithStatus = Object.fromEntries(
urls.map((url) => [
url,
checkUrlValid(url) ? RequestStatus.InProgress : RequestStatus.Invalid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels more like form validation than async error reporting.
Is it tricky to get the form to point out invalid urls before its submitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this would be better done prior to submission. This could be done with this nice validation stuff in flowbite https://flowbite-react.com/docs/components/forms#form-validation

The reason I haven't done it is because at the moment I'm just using a single text area for all the urls, so it would be tricky to indicate which one is invalid via the form element. I think a better solution would be to have one text field per url, and a button to add an extra field, but it's a fair bit of extra work so was going to launch with this as is

"extends": "@guardian/tsconfig/tsconfig.json",
"compilerOptions": {
"module": "CommonJS",
"noUnusedLocals": false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want unused locals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks, no we don't - I just allowed them because it was getting annoying when I was debugging the CDK by commenting stuff out to have to delete all the const blah = assignments that were no longer used as a result

client: s3Client,
params: {
Bucket: bucket,
Key: `downloaded-media/${metadata.title}.${metadata.extension}`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems possible that two different files will share a title. Could this result in files being accidentally overwritten or permissions leaking to the wrong user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, this code shouldn't have been checked in, in the media-download version it just uses the id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but I'm having a hard time working out what the difference is between the media-downloader and media-download packages. Should they be separate or did you mean to rename media-download? It looks like there's some overlap in functionality e.g. implementations of uploadToS3 and execution of yt-dlp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh dear! media-downloader should definitely have been removed entirely

@philmcmahon
Copy link
Contributor Author

This is looking good - just a few comments. Also fyi testing a few urls in CODE I haven't received a failure or success email yet.

Ah, thanks it was broken on CODE because the step function was pointing at the wrong container tag - should be working now

@philmcmahon philmcmahon force-pushed the pm-media-download-infra branch from 0d7badb to 8f4c951 Compare October 29, 2024 12:26
@philmcmahon philmcmahon force-pushed the pm-media-download-infra branch from 8f4c951 to 8875b59 Compare October 29, 2024 12:26
Copy link
Contributor

@zekehuntergreen zekehuntergreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🎉

@philmcmahon philmcmahon merged commit 2d19531 into main Oct 29, 2024
5 checks passed
@philmcmahon philmcmahon deleted the pm-media-download-infra branch October 29, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants