JSON data-driven App & growing number of MySQL schemas to automate import of files, build custom "file data ingestion methods" into MySQL or MariaDB databases and generate a well-documented data lineage audit trail.
Every file data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.
Each "file data ingestion method" has separate repository with database schema designed for files-to-MySQL:
main:process_files Import Load Summary from ingesting log files included in repository.
config.json has Processes and Observers configured to share the seven (7) log format folders in repository /data/ folder.
-
Some Processes load files from folders into staging LOAD TABLES
data_file_loader.py, some execute MySQL stored proceduresdatabase_module.pyand some processes perform Data Enhancements -data_enrichment_geoip.pyanddata_enrichment_useragent.py. -
All Process Datasets have an
attributesproperty. The attributes property can have any number of properties the Process Module requires.
data_file_loader.py module has attribute properties for log_format, load_table, path, recursive, server, server_port values.
database_module.py module has attribute properties module_name and module_parm1 values.
To process different log format files in different directories the config.json Process collection is populated with different combinations of Process Datasets.
Many config.json Process datasets contain database_module.py and data_file_loader.py for module_name property. These 2 modules are often reused with different attributes property values.
-
All Process Modules have
processmethod and ProcessProperties subclassproperties_process.py. -
main:process_filescan be passed a collection filter parameter. It can be a Process list (processID) to execute for any number of reasons. This makes the App more integrable and adaptable.
If no parameter is passed main:process_files executes config.json Processes for status = 'Active'.
- All Observers watch using Observer datasets :
path,recursive,intervalfor the arrival of new files inpathvalues.
Each Observer dataset also has process_list property. The process_list holds a Python List of [processid]. It is a subset of id property from config.json Processes collection.
The process_list property and watchdog event.src_path property are passed to main:process_files which will override configured Process executions.
Multiple folders and formats can be processed running different Observers with properties for different log formats and paths.
All processing stages (child processes) are encapsulated within one main:process_files (parent process) that captures process metrics, notifications and errors into database import tables.
Every log data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.
Multiple access and error logs and formats can be loaded, parsed and imported along with User Agent parsing and IP Address Geolocation retrieval processes within a single main:process_files execution.
main:process_files executions (config.json file) can be configured to only load logs to Server (single process) leaving other processes to be executed within another main:process_files execution (config.json file) on a centralized computer.
Python handles polling of log file folders and executing database LOAD DATA, Procedures, Functions and SQL Statements.
using MaxMind GeoIP2 Python API provides IP country, subdivision, city, system organization, network and coordinates information stored and normalized into 6 database schema tables.
Application requires two GeoLite databases - City & ASN. GeoLite databases are subsets of the commercial databases with reduced coverage and accuracy. Application tested with these databases:
-
GeoLite2 databases at MaxMind available under MaxMind continues to incorporate Creative Commons into our GeoLite End User Agreement (EULA).
-
DB-IP Lite databases at DB-IP available under Creative Commons Attribution 4.0 International License.
using user-agents provides browser, device and operating system information stored and normalized into 11 database schema tables.
Single quotes around 'PyMySQL[rsa]' package required on macOS.
| Python Package | Installation Command | GitHub Repository |
|---|---|---|
| PyMySQL | python -m pip install PyMySQL | PyMySQL/PyMySQL |
| user-agents | python -m pip install pyyaml ua-parser user-agents | selwin/python-user-agents |
| watchdog | python -m pip install watchdog | gorakhargosh/watchdog |
| python-dotenv | python -m pip install python-dotenv | theskumar/python-dotenv |
| geoip2 | python -m pip install geoip2 | maxmind/GeoIP2-python |
| tabulate | python -m pip install tabulate | astanin/python-tabulate |
ALTER TABLE http_logs.access_log ADD CONSTRAINT access_log_importfileid FOREIGN KEY (importfileid) REFERENCES system_files.import_file (id) ON DELETE RESTRICT ON UPDATE CASCADE;
mysql-http-logs includes all database DDL and build scripts for the database schema used in this repository.
Application determines what files have been processed using import_file TABLE.
Each imported file has record with name, path, size, created, modified attributes inserted during main:process_files.
Application runs with no need for user interaction. File deletion is not required by application if files desired for later reference.
From documentation read NGINX standard access logformat is same as Apache combined. I have not verified yet.
NGINX log files in /data/nginx_combined/ and /data/nginx_error/ are from new NGINX server.
Apache log formats have been thoroughly researched and tested.
Each log format has a Stored Procedure. More information will be added over new few days.
The Apache and NGINX code demonstrates how to incorporate without code modification of current processes.
in my development queue mysql-to-echarts is a visualization tool for the database schema. The Web interface consists of Express web application frameworks with W2UI drill-down data grids for Data Point Details.



