Skip to content

WillTheFarmer/files-to-mysql

Python handles File processing & Database handles Data processing

Application runs on Windows, Linux & MacOS - Database runs on MySQL & MariaDB

JSON data-driven App & growing number of MySQL schemas to automate import of files, build custom "file data ingestion methods" into MySQL or MariaDB databases and generate a well-documented data lineage audit trail.

Every file data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.

Each "file data ingestion method" has separate repository with database schema designed for files-to-MySQL:

mysql-http-logs

mysql-system-logs

mysql-system-files

current Apache and NGINX http log files method is example of a custom "file data ingestion method"

Entity Relationship Diagram main:process_files Import Load Summary from ingesting log files included in repository.

two Collections and Factory Method

config.json has Processes and Observers configured to share the seven (7) log format folders in repository /data/ folder.

Process properties - application Processes

Process Properties

Observer properties - application watchdog Observers

Observers Properties

  1. Some Processes load files from folders into staging LOAD TABLES data_file_loader.py, some execute MySQL stored procedures database_module.py and some processes perform Data Enhancements - data_enrichment_geoip.py and data_enrichment_useragent.py.

  2. All Process Datasets have an attributes property. The attributes property can have any number of properties the Process Module requires.

data_file_loader.py module has attribute properties for log_format, load_table, path, recursive, server, server_port values.

database_module.py module has attribute properties module_name and module_parm1 values.

To process different log format files in different directories the config.json Process collection is populated with different combinations of Process Datasets.

Many config.json Process datasets contain database_module.py and data_file_loader.py for module_name property. These 2 modules are often reused with different attributes property values.

The data-driven properties allow flexibility and expandability

  1. All Process Modules have process method and ProcessProperties subclass properties_process.py.

  2. main:process_files can be passed a collection filter parameter. It can be a Process list (processID) to execute for any number of reasons. This makes the App more integrable and adaptable.

If no parameter is passed main:process_files executes config.json Processes for status = 'Active'.

  1. All Observers watch using Observer datasets : path, recursive, interval for the arrival of new files in path values.

Each Observer dataset also has process_list property. The process_list holds a Python List of [processid]. It is a subset of id property from config.json Processes collection.

List of Observers Watching: Observers Watching

The process_list property and watchdog event.src_path property are passed to main:process_files which will override configured Process executions.

Multiple folders and formats can be processed running different Observers with properties for different log formats and paths.

All processing stages (child processes) are encapsulated within one main:process_files (parent process) that captures process metrics, notifications and errors into database import tables.

Every log data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.

Multiple access and error logs and formats can be loaded, parsed and imported along with User Agent parsing and IP Address Geolocation retrieval processes within a single main:process_files execution.

main:process_files executions (config.json file) can be configured to only load logs to Server (single process) leaving other processes to be executed within another main:process_files execution (config.json file) on a centralized computer.

Python handles polling of log file folders and executing database LOAD DATA, Procedures, Functions and SQL Statements.

Data Enrichments

IP Geolocation data integration

using MaxMind GeoIP2 Python API provides IP country, subdivision, city, system organization, network and coordinates information stored and normalized into 6 database schema tables.

Application requires two GeoLite databases - City & ASN. GeoLite databases are subsets of the commercial databases with reduced coverage and accuracy. Application tested with these databases:

  1. GeoLite2 databases at MaxMind available under MaxMind continues to incorporate Creative Commons into our GeoLite End User Agreement (EULA).

  2. DB-IP Lite databases at DB-IP available under Creative Commons Attribution 4.0 International License.

User-Agent data integration

using user-agents provides browser, device and operating system information stored and normalized into 11 database schema tables.

Required Python Packages

Single quotes around 'PyMySQL[rsa]' package required on macOS.

Python Package Installation Command GitHub Repository
PyMySQL python -m pip install PyMySQL PyMySQL/PyMySQL
user-agents python -m pip install pyyaml ua-parser user-agents selwin/python-user-agents
watchdog python -m pip install watchdog gorakhargosh/watchdog
python-dotenv python -m pip install python-dotenv theskumar/python-dotenv
geoip2 python -m pip install geoip2 maxmind/GeoIP2-python
tabulate python -m pip install tabulate astanin/python-tabulate

Database schema for file import process | importfileid is FOREIGN KEY

Entity Relationship Diagram

ALTER TABLE http_logs.access_log ADD CONSTRAINT access_log_importfileid FOREIGN KEY (importfileid) REFERENCES system_files.import_file (id) ON DELETE RESTRICT ON UPDATE CASCADE;

MySQL database schema DDL and build scripts

mysql-http-logs includes all database DDL and build scripts for the database schema used in this repository.

Application determines what files have been processed using import_file TABLE. Each imported file has record with name, path, size, created, modified attributes inserted during main:process_files.

Application runs with no need for user interaction. File deletion is not required by application if files desired for later reference.

new NGINX log formats and data - not tested yet

From documentation read NGINX standard access logformat is same as Apache combined. I have not verified yet.

Repository NGINX files are standard access and error formats from new NGINX server

NGINX log files in /data/nginx_combined/ and /data/nginx_error/ are from new NGINX server.

Apache log formats have been thoroughly researched and tested.

new NGINX MySQL procedural code - not tested yet

Each log format has a Stored Procedure. More information will be added over new few days.

The Apache and NGINX code demonstrates how to incorporate without code modification of current processes.

Visual Interface App

in my development queue mysql-to-echarts is a visualization tool for the database schema. The Web interface consists of Express web application frameworks with W2UI drill-down data grids for Data Point Details.

Other Documents

Installation Instructions

Apache Format Information

Support Information

Change Log

About

files-to-MySQL is a Python ETL-frameworks that ingests data from files and creates audit trail from data to file it came from in a fully-normalized relational database. DDL scripts create MySQL or MariaDB RDBMS schemas. Automate file processing from multiple clients to central MySQL server. Small codebase & simple setup.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Contributors

Languages