GitHub - WillTheFarmer/files-to-mysql: files-to-MySQL is a Python ETL-frameworks that ingests data from files and creates audit trail from data to file it came from in a fully-normalized relational database. DDL scripts create MySQL or MariaDB RDBMS schemas. Automate file processing from multiple clients to central MySQL server. Small codebase & simple setup.

Python handles File processing & Database handles Data processing

Application runs on Windows, Linux & MacOS - Database runs on MySQL & MariaDB

JSON data-driven App & growing number of MySQL schemas to automate import of files, build custom "file data ingestion methods" into MySQL or MariaDB databases and generate a well-documented data lineage audit trail.

Every file data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.

Each "file data ingestion method" has separate repository with database schema designed for files-to-MySQL:

mysql-http-logs

mysql-system-logs

mysql-system-files

current Apache and NGINX http log files method is example of a custom "file data ingestion method"

main:process_files Import Load Summary from ingesting log files included in repository.

two Collections and Factory Method

config.json has Processes and Observers configured to share the seven (7) log format folders in repository /data/ folder.

Process properties - application Processes

Observer properties - application watchdog Observers

Some Processes load files from folders into staging LOAD TABLES data_file_loader.py, some execute MySQL stored procedures database_module.py and some processes perform Data Enhancements - data_enrichment_geoip.py and data_enrichment_useragent.py.
All Process Datasets have an attributes property. The attributes property can have any number of properties the Process Module requires.

data_file_loader.py module has attribute properties for log_format, load_table, path, recursive, server, server_port values.

database_module.py module has attribute properties module_name and module_parm1 values.

To process different log format files in different directories the config.json Process collection is populated with different combinations of Process Datasets.

Many config.json Process datasets contain database_module.py and data_file_loader.py for module_name property. These 2 modules are often reused with different attributes property values.

The data-driven properties allow flexibility and expandability

All Process Modules have process method and ProcessProperties subclass properties_process.py.
main:process_files can be passed a collection filter parameter. It can be a Process list (processID) to execute for any number of reasons. This makes the App more integrable and adaptable.

If no parameter is passed main:process_files executes config.json Processes for status = 'Active'.

All Observers watch using Observer datasets : path, recursive, interval for the arrival of new files in path values.

Each Observer dataset also has process_list property. The process_list holds a Python List of [processid]. It is a subset of id property from config.json Processes collection.

List of Observers Watching:

The process_list property and watchdog event.src_path property are passed to main:process_files which will override configured Process executions.

Multiple folders and formats can be processed running different Observers with properties for different log formats and paths.

All processing stages (child processes) are encapsulated within one main:process_files (parent process) that captures process metrics, notifications and errors into database import tables.

Every log data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.

Multiple access and error logs and formats can be loaded, parsed and imported along with User Agent parsing and IP Address Geolocation retrieval processes within a single main:process_files execution.

main:process_files executions (config.json file) can be configured to only load logs to Server (single process) leaving other processes to be executed within another main:process_files execution (config.json file) on a centralized computer.

Python handles polling of log file folders and executing database LOAD DATA, Procedures, Functions and SQL Statements.

Data Enrichments

IP Geolocation data integration

using MaxMind GeoIP2 Python API provides IP country, subdivision, city, system organization, network and coordinates information stored and normalized into 6 database schema tables.

Application requires two GeoLite databases - City & ASN. GeoLite databases are subsets of the commercial databases with reduced coverage and accuracy. Application tested with these databases:

GeoLite2 databases at MaxMind available under MaxMind continues to incorporate Creative Commons into our GeoLite End User Agreement (EULA).
DB-IP Lite databases at DB-IP available under Creative Commons Attribution 4.0 International License.

User-Agent data integration

using user-agents provides browser, device and operating system information stored and normalized into 11 database schema tables.

Required Python Packages

Single quotes around 'PyMySQL[rsa]' package required on macOS.

Python Package	Installation Command	GitHub Repository
PyMySQL	python -m pip install PyMySQL	PyMySQL/PyMySQL
user-agents	python -m pip install pyyaml ua-parser user-agents	selwin/python-user-agents
watchdog	python -m pip install watchdog	gorakhargosh/watchdog
python-dotenv	python -m pip install python-dotenv	theskumar/python-dotenv
geoip2	python -m pip install geoip2	maxmind/GeoIP2-python
tabulate	python -m pip install tabulate	astanin/python-tabulate

Database schema for file import process | importfileid is FOREIGN KEY

ALTER TABLE http_logs.access_log ADD CONSTRAINT access_log_importfileid FOREIGN KEY (importfileid) REFERENCES system_files.import_file (id) ON DELETE RESTRICT ON UPDATE CASCADE;

MySQL database schema DDL and build scripts

mysql-http-logs includes all database DDL and build scripts for the database schema used in this repository.

Application determines what files have been processed using import_file TABLE. Each imported file has record with name, path, size, created, modified attributes inserted during main:process_files.

Application runs with no need for user interaction. File deletion is not required by application if files desired for later reference.

new NGINX log formats and data - not tested yet

From documentation read NGINX standard access logformat is same as Apache combined. I have not verified yet.

Repository NGINX files are standard access and error formats from new NGINX server

NGINX log files in /data/nginx_combined/ and /data/nginx_error/ are from new NGINX server.

Apache log formats have been thoroughly researched and tested.

new NGINX MySQL procedural code - not tested yet

Each log format has a Stored Procedure. More information will be added over new few days.

The Apache and NGINX code demonstrates how to incorporate without code modification of current processes.

Visual Interface App

in my development queue mysql-to-echarts is a visualization tool for the database schema. The Web interface consists of Express web application frameworks with W2UI drill-down data grids for Data Point Details.

Name		Name	Last commit message	Last commit date
Latest commit History 485 Commits
.github		.github
apis		apis
config		config
data		data
scripts		scripts
src		src
.env		.env
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
config-apache & nginx-2025.json		config-apache & nginx-2025.json
config-apache & nginx-2026.json		config-apache & nginx-2026.json
config-default.json		config-default.json
config.json		config.json
config_lists.py		config_lists.py
create_http_logs.sql		create_http_logs.sql
files_import.py		files_import.py
files_watch.py		files_watch.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python handles File processing & Database handles Data processing

Application runs on Windows, Linux & MacOS - Database runs on MySQL & MariaDB

current Apache and NGINX http log files method is example of a custom "file data ingestion method"

two Collections and Factory Method

Process properties - application Processes

Observer properties - application watchdog Observers

The data-driven properties allow flexibility and expandability

Every log data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.

Python handles polling of log file folders and executing database LOAD DATA, Procedures, Functions and SQL Statements.

Data Enrichments

IP Geolocation data integration

User-Agent data integration

Required Python Packages

Database schema for file import process | importfileid is FOREIGN KEY

MySQL database schema DDL and build scripts

new NGINX log formats and data - not tested yet

Repository NGINX files are standard access and error formats from new NGINX server

new NGINX MySQL procedural code - not tested yet

Visual Interface App

Other Documents

About

Uh oh!

Releases 23

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Python handles File processing & Database handles Data processing

Application runs on Windows, Linux & MacOS - Database runs on MySQL & MariaDB

current Apache and NGINX http log files method is example of a custom "file data ingestion method"

two Collections and Factory Method

Process properties - application Processes

Observer properties - application watchdog Observers

The data-driven properties allow flexibility and expandability

Every log data record is traceable back to the computer, path, file, load process, parse process and import process the data originated.

Python handles polling of log file folders and executing database LOAD DATA, Procedures, Functions and SQL Statements.

Data Enrichments

IP Geolocation data integration

User-Agent data integration

Required Python Packages

Database schema for file import process | importfileid is FOREIGN KEY

MySQL database schema DDL and build scripts

new NGINX log formats and data - not tested yet

Repository NGINX files are standard access and error formats from new NGINX server

new NGINX MySQL procedural code - not tested yet

Visual Interface App

Other Documents

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 23

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages