-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add media download service (part 1) #104
Conversation
visibilityTimeout: Duration.seconds(30), | ||
contentBasedDeduplication: true, | ||
deadLetterQueue: { | ||
queue: transcriptionDeadLetterQueue, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want transcription and download to share a dead letter queue? It seems like debugging might be easier if they're separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eek good spot! This was a copy paste failure, I agree they should be separate
JSON.stringify(job), | ||
s3Key, | ||
); | ||
if (!isSqsFailure(messageResult) && translationRequested) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to log an error if messageResult is an sqs failure
id: string, | ||
) => { | ||
const output = | ||
await $`yt-dlp --write-info-json --no-clean-info-json --newline -o "${destinationDirectoryPath}/${id}" ${url}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has yt-dlp been pretty consistent for youtube? I was noticing some flakiness using yt-dlp to download videos today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, I've had some success with it. I'm hopeful that even if we end up having to use a different tool most of this code will still be useful
scripts/setup.sh
Outdated
@@ -45,7 +52,8 @@ DYNAMODB_ARN=$(aws --endpoint-url=http://localhost:4566 dynamodb create-table \ | |||
|
|||
echo "Created table, arn: ${DYNAMODB_ARN}" | |||
|
|||
|
|||
# media-downloader dependencies | |||
brew install ffmpeg phantomjs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about installing yt-dlp itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've updated the pr description to address this, and now removed this brew command from setup.sh.
The way whisper works at the moment is that the worker runs a docker container with whisper installed rather than actually executing whisper directly. I think we could do the same thing for media-download.
At the moment there's no guidance on installing yt-dlp in this PR, pending the next PR with various dockerfiles and other infrastructure bits in it. I could add a Pipfile (pipenv is my fav new python thing since you showed it to me) but given that it will just be for a short while before we swap out running yt-dlp directly for docker maybe it's fine as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, happy to wait for yt-dlp on docker
); | ||
if (job) { | ||
const metadata = await downloadMedia(job.url, '/tmp', job.id); | ||
const key = await uploadToS3( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove the file from the file system once its uploaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I'm hoping we can run media-download on an ECS fargate task (like lurch ingest jobs) so the whole container will be deleted after the download/upload has finished
return await sendMessage( | ||
client, | ||
queueUrl, | ||
JSON.stringify({ ...job, translate: true }), | ||
s3Key, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this bit - we put two messages in the queue if the user has requested a translation? Doesn't the worker create the translation as its transcribing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way translation works at the moment is that we send two messages and the translation runs on one worker, the normal transcription on the other. My idea there was to speed up the transcription/translation - if we do it all on the same worker then it could take twice as long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok thanks, I hadn't realized this.
…ed a job from sqs, uses yt-dlp to download and uploads the result to s3
82d39d6
to
99fc2db
Compare
@zekehuntergreen thanks v much for the review. I ran into an issue related to zx deprecating commonjs support and decided (again I think!) to swap it out and use the basic node subprocess command instead - that change is here 322acaf |
stderr: stderr.join(''), | ||
code: code || undefined, | ||
}; | ||
logger.info('Ignoring stdout to avoid logging sensitive data'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important not to log stdout when running yt-dlp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've updated this ecff3ce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
263b2f8
to
4ab40fa
Compare
4ab40fa
to
cc1248d
Compare
What does this change?
This pr introduces a new app to the transcription service - 'media download'. This service reads off a (new) SQS queue, parses the messages into a new
MediaDownloadJob
type, and then downloads the relevant url, then uploads the resulting file to the 'source media bucket' which is used by the transcription service, and sends an SQS message to the transcription worker to tell it to transcribe the fileTo facilitate testing this service, I've also added a new endpoint to the transcription API, which pushes messages onto the queue based off the request body. You can test this endpoint by making a POST request to localhost:9103/api/transcribe-url with the following body:
Note - the infrastructure for the media download service doesn't exist yet, the project can only be run locally at the moment.
How to test
You'll need to manually install yt-dlp (https://github.com/yt-dlp/yt-dlp?tab=readme-ov-file#installation) - probably
pip install yt-dlp
is the easiest if you've got a suitable pythong environment set up. I haven't scripted this step because eventually we'll run yt-dlp using a docker container rather than isntalling it directly onto dev machines. This is how the existing whisper process worksIf you run the transcription API, send the above request. If you get a 200 back, then run
npm run media-download::start
and you should see it read sqs/download the video/upload file to s3/send a transcription sqs messageStill todo