The purpose of this application is to connect to The Movie Database (TMDB) API to extract various movie and cast details as flat files, intended to be loaded to a separate database. It also has the ability to connect to the target database to extract certain datasets, particularly which titles and cast members have already been extracted.
You will need to acquire a free TMDB API key to use this application. Details can be found at the location below:
https://developer.themoviedb.org/docs/authentication-application
- Requirements
- Installation
- Passwords
- App Configuration
- Usage
- API Listener Mode
- Docker Deployment with S3
- About the Author
python > 3.8
pandas >=1.5.3, <2.0.0
sqlalchemy >=2.0.4, <3.0.0
flask >=2.2.5, <3.0.0
waitress >=2.1.2 ,<3.0.0
psycopg2-binary >=2.9.5, <3.0.0
configparser >=5.3.0, <6.0.0
dateparser >=1.1.8, <2.0.0
keyring >=23.13.1, <24.0.0
nexus-utilities >=0.6.0, <1.0.0 # My custom utilities package
pip3 install pandas >=1.5.3,<2.0.0
pip3 install sqlalchemy >=2.0.4,<3.0.0
pip3 install flask >=2.2.5,<3.0.0
pip3 install waitress >=2.1.2,<3.0.0
pip3 install psycopg2-binary >=2.9.5,<3.0.0
pip3 install configparser >=5.3.0,<6.0.0
pip3 install dateparser >=1.1.8,<2.0.0
pip3 install keyring >=23.13.1,<24.0.0
pip3 install nexus-utilities >=0.6.0,<1.0.0 # My custom utilities package
OR
# From the package root directory
pip3 install -r requirements.txt
From the package root:
pip3 install -e .
Note that this is needed so relative imports will function properly
The modules for retrieving secured information are located in the nexus-utilities package. The desired method should be specified in the app_config.ini file. All methods accept two required strings of 'password_method' and 'password_key' and a number of optional arguments, and return a string of 'secret_value'. See the documentation for nexus-utilities at https://github.com/james-larsen/nexus-utilities for more details
If you do decide to use the keyring library, you will need to add entries using the "user_name" and "secret_key" from the connections_config.ini file for both the TMDB API key and the target DB password (optional):
keyring.set_password("{user_name}", "{secret_key}", "my_api_key")
keyring.set_password("{user_name}", "{secret_key}", "my_db_password")
Alternatively, if the below OS Environment variables are available, they will be used instead:
- NEXUS_TMDB_API_KEY
- NEXUS_TMDB_TARGET_DB_PASSWORD
The application is controlled by a number of configuration files, read using the ConfigParser library:
./config/app_config.ini
Controls the general behavior of the application (example values provided)
[app_settings]
# Root location to place extracted flat files
load_file_path = path/to/deliver/files
# Root location to place extracted images
images_path = path/to/images
# Restrict results to a specific language (2-digit iso code, Eg. "en")
original_language = en
# Whether to include adult content in results. Accepts "Include", "Exclude" or "Only"
adult_content_flag = Exclude
./config/connections_config.ini
Holds connection details for the TMDB API and target database. Builds a SQLAlchemy connection string with the following pattern:
'{connect_type}://{user_name}:{password}@{server_address}:{server_port}/{server_name}'
[target_connection]
# SQLAlchemy connection type
connect_type = postgresql+psycopg2
# Environment for connection (dev / qa / prod). Informational only
environment = dev
server_address = localhost
server_port = 5432
server_name =
# Method for retrieving secrets. Accepts "keyring", "secretsmanager" or "ssm"
password_method = ssm
# Access key for secrets retrieval method
password_access_key =
# Secret key for secrets retrieval method
password_secret_key =
# Enpoint URL for secrets retrieval method
password_endpoint_url = https://ssm.us-west-1.amazonaws.com
# Region name for secrets retrieval method
password_region_name = us-west-1
# Path for secrets retrieval method
password_password_path = my_program/passwords/dev
user_name =
# Reference value for retrieving the correct password using keyring library
secret_key =
[tmdb_api_connection]
# Method for retrieving secrets. Accepts "keyring", "secretsmanager" or "ssm"
password_method = ssm
# Access key for secrets retrieval method
password_access_key =
# Secret key for secrets retrieval method
password_secret_key =
# Enpoint URL for secrets retrieval method
password_endpoint_url = https://ssm.us-west-1.amazonaws.com
# Region name for secrets retrieval method
password_region_name = us-west-1
# Path for secrets retrieval method
password_password_path = my_program/passwords/dev
user_name =
secret_key =
./config/output_files_config.ini
Controls which data files are generated by the application
[output_files]
# Title details
titles = True
# Title / genre combinations
title_genres = True
# Genre details
genres = True
# Title / spoken language combinations
title_spoken_languages = True
# Spoken language details
spoken_languages = True
# Title / production country combinations
title_production_countries = True
# Production country details
production_countries = True
# Title / production company combinations
title_production_companies = True
# Production company details
production_companies = True
# Title / collection combinations
title_collections = True
# Collection details
collections = True
# Title / keyword combinations
title_keywords = True
# Keyword details
keywords = True
# Person details
persons = True
# Person AKA details
person_aka = True
# Title / person combinations
title_cast = True
# Title / image combinations
title_images = True
# Titles no longer found
title_removed = True
# Persons no longer found
person_removed = True
A number of sql files are located in ./config/sql_queries/. These are used for connecting to the target DB and retrieving certain information, such as already loaded titles, or title cast without person details. This is completely optional, and if proper details for the target DB or the sql files are not provided, the functionality will not be used.
Note: The field names don't matter, but the number, order, and format of the fields do. If your database has no concept of "adult content", hardcode "F" where necessary
./config/sql_queries/favorite_persons_sql.sql
List of favorite persons
Expected Select Output:
- person_id (int)
- adult_flag (str: "T" or "F")
./config/sql_queries/loaded_persons_sql.sql
List of already loaded persons
Expected Select Output:
- person_id (int)
- adult_flag (str: "T" or "F")
./config/sql_queries/loaded_title_cast_sql.sql
List of already loaded titles with cast linkages
Expected Select Output:
- tmdb_id (int)
./config/sql_queries/loaded_title_images_sql.sql
List of already loaded titles with downloaded images
Expected Select Output:
- tmdb_id (int)
./config/sql_queries/loaded_titles_sql.sql
List of already loaded titles
Expected Select Output:
- tmdb_id (int)
- adult_flag (str: "T" or "F")
./config/sql_queries/persons_missing_sql.sql
List of persons that appear in title cast, but do not have person details. I suggest using a "min" function on the "adult_flag" to include persons who appear in both adult and non-adult titles (unless you intend to use adult "Only", in which case use a "max" function)
Expected Select Output:
- person_id (int)
- adult_flag (str: "T" or "F")
./config/sql_queries/search_terms_sql.sql
List of string search terms to attempt to find titles with. Useful if you have another table containing title names you'd like to attempt to pull TMDB title data for
Expected Select Output:
- search_term (str)
./config/sql_queries/titles_missing_cast_sql.sql
List of loaded titles without any cast linkages
Expected Select Output:
- tmdb_id (int)
- adult_flag (str: "T" or "F")
./config/sql_queries/titles_missing_keywords_sql.sql
List of loaded titles without any keyword linkages
Expected Select Output:
- tmdb_id (int)
- adult_flag (str: "T" or "F")
Configure the following files (details above):
- ./config/app_config.ini
- ./config/connections_config.ini
- ./config/output_files_config.ini
- ./config/sql_queries/*.sql
The application uses a number of function arguments with optional flags. The flags will be described first, then the arguments along with which flags they are compatible with. Note that all arguments and flags have --long and -short representations.
- --api_host (-host): Host to launch API listener (default 'localhost')
- --port (-p): Port to listen for API requests (default '5002')
- --verbose_flag (-v): Whether to allow terminal output when in API listener mode
- --original_language (-lang): Primary language spoken in the title. Accepts 2 digit, lowercase iso standard codes (Eg. "en")
- --min_runtime (-rt): Minimum title runtime. Use with caution, as many titles erroneously use 1 minute as a placeholder
- --adult_content_flag (-adult): Whether to include adult content in results. Accepts "include" ("i"), "exclude" ("e") or "only" ("o")
- --skip_loaded_titles (-skip): Add the "-skip" flag to avoid pulling titles already pulled previously. Relies on the target DB connection capabilities discussed above
- --search_terms (-search): Accepts multiple string arguments of keywords to search by
- --tmdb_ids (-tid): Accepts multiple tmdb_id integer values
- --person_ids (-pid): Accepts multiple person_id integer values
- --row_limit (-rl): Row limit to apply to requests. Note that this won't always exactly reflect in the number of results that are returned, depending on the kind of data being retrieved. Used primarily to limit number of requests sent to the TMDB API
- --time_window (-tw): Accepts "day" ("d") or "week" ("w")
- api_listener (api): Launch the app into API service mode
- --api_host
- --port
- --verbose_flag
- rebuild_settings (rs): Rebuild app settings. Useful if the app is running in API service mode, and a setting has been updated and needs to be reflected for future API requests
- display_missing_counts (dmc): Display the current number of missing cast, keywords and persons
- get_movies_updated_yesterday (gmuy): Retrieves movies changed yesterday
- --original_language
- --min_runtime
- --adult_content_flag
- get_movies_by_favorite_actor (gmbfa): Retrieve titles with specified persons as cast members
- --person_id_list - If omitted, will utilize the results of the query "favorite_persons_sql.sql"
- --adult_content_flag
- --skip_loaded_titles
- --row_limit
- get_movies_by_search_terms (gmbst): Retrieve titles matching provided search terms
- --search_terms - If omitted, will utilize the results of the query "search_terms_sql.sql"
- --original_language
- --adult_content_flag
- --skip_loaded_titles
- --row_limit
- get_trending_movies (gtm): Retrieve trending titles over the last day or week. Returns a maximum of 20,000 titles
- --time_window
- --original_language
- --skip_loaded_titles
- --row_limit
- get_missing_title_keywords (gmtk): Retrieve keywords linkages for titles without any
- --tmdb_ids
- --adult_content_flag
- --row_limit
- get_missing_persons (gmp): Retrieve missing persons referenced in the title cast data
- --person_ids
- --adult_content_flag
- --row_limit
- get_missing_title_cast (gmtc): Retrieve cast linkages for titles without any
- --tmdb_ids
- --adult_content_flag
- --row_limit
- get_all_movies (gam): Downloads the daily full movie list with a subset of fields
- get_all_persons (gap): Downloads the daily full person list with a subset of fields
- reconcile_movies_against_full_list (rmafl): Checks the currently loaded list of titles against today's full list to identify removed titles. Note that the daily lists do not contain adult content, so if your data does, ensure "loaded_titles_sql" is configured properly to not trigger false positives as removed titles
- reconcile_persons_against_full_list (rpafl): Checks the currently loaded list of persons against today's full list to identify removed persons. Note that the daily lists do not contain adult content, so if your data does, ensure "loaded_persons_sql" is configured properly to not trigger false positives as removed persons
python3 ../src/main.py dmc
python3 ../src/main.py get_movies_updated_yesterday -lang "en" -adult "exclude"
python3 ../src/main.py get_movies_by_favorite_actor -pid 1158 3223 -skip
python3 ../src/main.py gmbst -search "fast 9" "a christmas story" -lang "en" -adult "exclude" -skip
python3 ../src/main.py get_trending_movies -tw "week" -lang "jp" -rl 6000 --skip_loaded_titles
python3 ../src/main.py get_missing_title_keywords --adult_content_flag "include"
python3 ../src/main.py get_missing_persons -adult "o"
python3 ../src/main.py get_missing_title_cast -adult "e" --row_limit 500
When deployed into a container using the attached Dockerfiles, the application will launch into API listener mode on localhost, port 5002 by default. This can be adjusted using one of the methods below:
Via command line:
python main.py --api_listener -host "0.0.0.0" -p "5002"
Via environment variables:
- NEXUS_TMDB_API_HOST
- NEXUS_TMDB_API_PORT
Once started, in can be accessed via the POST endpoint at 'http://{TARGET_HOST}:{PORT}/request'. You should pass a dictionary of arguments. Below is a list of accepted functions and arguments, which correspond to the detailed information above. The only one specific to the API is "rebuild_settings (rs)", which will force the application to reread its configuration files without having to be stopped.
Functions
- display_missing_counts (dmc)
- get_movies_updated_yesterday (gmuy)
- get_movies_by_favorite_actor (gmbfa)
- get_movies_by_search_terms (gmbst)
- get_trending_movies (gtm)
- get_missing_title_keywords (gmtk)
- get_missing_persons (gmp)
- get_missing_title_cast (gmtc)
- get_title_images_by_persons (gtibp)
- get_all_movies (gam)
- get_all_persons (gap)
- reconcile_movies_against_full_list (rmafl)
- reconcile_persons_against_full_list (rpafl)
- rebuild_settings (rs)
Arguments
- original_language
- min_runtime
- adult_content_flag
- skip_loaded_titles (Pass as "True/False")
- download_backdrops (Pass as "True/False")
- download_posters (Pass as "True/False")
- download_logos (Pass as "True/False")
- search_terms (Pass as ["List"])
- tmdb_ids (Pass as [List])
- person_ids (Pass as [List])
- row_limit
- time_window
Examples:
params = {
'function': 'display_missing_counts'
}
response = requests.post('http://localhost:5002/request', params=params)
params = {
'function': 'get_movies_updated_yesterday',
'adult_content_flag': 'Exclude',
'original_language': 'en'
}
response = requests.post('http://localhost:5002/request', params=params)
params = {
'function': 'get_movies_by_search_terms',
'search_terms': ['Christmas', 'New Years'],
'skip_loaded_titles': 'True',
'row_limit': '1000'
}
response = requests.post('http://localhost:5002/request', params=params)
params = {
'function': 'get_movies_by_favorite_actor',
'person_ids': [1158, 738],
'skip_loaded_titles': 'False',
'row_limit': '200'
}
response = requests.post('http://localhost:5002/request', params=params)
A base Docker image has been created based on the "tiangolo/uwsgi-nginx" Linux image, with all necessary files and libraries deployed. It can be found at "jameslarsen42/nexus-tmdb-data-retriever-base". Alternatively, a Dockerfile has been included in this package if you wish to build it yourself.
Once you have the base image, you should customize the the contents of the ./docker/s3/ folder and build into a new image.
The Docker Container uses s3sf fuse to mount an S3 bucket to specific locations used by the application. An S3 bucket should be created with a certain sub-folder structure. A template can be found at ./templates/S3 Folder Structure/. Note that the "app_config.ini" in this folder has already been optimized to point to the correct locations for Uploads, but other settings should be customized before uploading to S3. Similarly make sure to customize "connections_config.ini" for your TMDB API key and target database.
The below environment variables are required to be defined when launching the container in order for the S3 mounts to work:
- NEXUS_TMDB_AWS_ACCESS_KEY_ID
- NEXUS_TMDB_AWS_SECRET_ACCESS_KEY
- NEXUS_TMDB_S3_SERVER_PATH
- NEXUS_TMDB_S3_BUCKET_NAME
You can specify variables directly if you like, but the simplest method is below, after customizing your "docker.env" file. Note that the "--cap-add SYS_ADMIN --device /dev/fuse" is necessary for the S3 mounts to work properly. You can also pass the "--bash" flag to launch the container to a bash terminal. Otherwise it will enter into API listener mode with the default settings.
docker run --env-file file/path/to/docker.env --cap-add SYS_ADMIN --device /dev/fuse -it nexus-tmdb-data-retriever-s3:latest
My name is James Larsen, and I have been working professionally as a Business Analyst, Database Architect and Data Engineer since 2007. While I specialize in Data Modeling and SQL, I am working to improve my knowledge in different data engineering technologies, particularly Python.