Skip to content
View WilsonH918's full-sized avatar

Block or report WilsonH918

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
WilsonH918/README.md

Hi 👋, I'm Wilson

A passionate Data Architect/Data Engineer

As a Data Architect/Engineer with a background in data engineering and data science, I design and deliver scalable lakehouse architectures, high-quality data pipelines, and reliable analytics ecosystems. I work across Python, SQL, and SAS, and have hands-on experience with Azure, Fabric, GCP, AWS, Snowflake, and Oracle ERP integrations. My recent work includes architecting Databricks Unity Catalog migrations, improving end-to-end pipeline performance, and implementing governance frameworks that ensure consistency and compliance. I also build big-data solutions with Spark and Hadoop and use DevOps practices to support CI/CD workflows.

I am certified as a Microsoft Azure Data Engineer and a Google Cloud Professional Data Engineer. My GitHub projects showcase my skills in data engineering, machine learning, custom Model Context Protocol (MCP) servers, and Retrieval-Augmented Generation (RAG) with vector databases.

Feel free to explore my GitHub profile for more on my projects and contributions to the open-source community.

Connect with me:

wilson-hsieh/

Some of my projects:

Project Link Tools Project Description
EnergyStocks Historical Price DataPipeline Pyspark, SQL, AWS (Lambda, EC2, S3), Snowflake (CDC), PowerBI This is a data pipeline project that retrieves S&P500 listed energy companies' historical stock price data, stores the data in an AWS S3 bucket, and transforms the data in a Snowflake data warehouse. The project is automated using AWS Lambda to trigger a Python script that runs the pipeline on a scheduled basis.
RAG based Document Retrieval with ChromaDB and Vector Embeddings Python, OpenAI API, ChromaDB, LangChain, BeautifulSoup, Requests ChromaQuery is an AI-powered knowledge retrieval system that integrates retrieval-augmented generation (RAG), web scraping, and ChromaDB for accurate and real-time responses. The system uses OpenAI embeddings and vector-based search to retrieve relevant articles and generate contextual answers. It scrapes content from the web, stores articles as chunks in a database, and queries relevant information to generate insightful responses.
FabricOps MCP-Driven Data Architecture Python, Azure Functions, Model Context Protocol (MCP), VNet, Microsoft Fabric REST API, Power BI A custom remote MCP server hosted on Azure Functions that acts as a data architecture integration layer for AI agents. It exposes MCP tools to enable secure orchestration of Microsoft Fabric and Power BI operations, such as workspace creation, report analysis, DAX execution, and automated fixes via REST APIs and enterprise grade governance.
ERC20 Data Ingestion Pipeline Python, Airflow (DAGs), PostgreSQL, Docker, Hadoop This project is designed to extract ERC20 token data from Web3 using the Etherscan API and create an ETL pipeline using Apache Airflow. The extracted data is scheduled to be fed into a local PostgreSQL database daily. The project involves technologies such as Docker, Airflow DAGs, PostgreSQL, and HDFS. Below is the screenshot of the data pipeline in action.
DataOps Fabric Provisioner IaC GitHub Actions, Python, Jinja2, Microsoft Fabric API A CI/CD-style automation framework for provisioning Microsoft Fabric workspaces and lakehouses. It leverages GitHub Actions pipelines for orchestration, Python scripts for authentication and deployment, and Jinja2 templates for dynamic configuration rendering. The solution enables Infrastructure-as-Code practices in Fabric, delivering reproducible, version-controlled, and scalable environment deployments across multiple clients and projects.
Real time Streaming of ERC20 Transactions with Kafka and Python Python, SQL, Kafka, Docker, Web3 This project demonstrates how to build a real-time data pipeline to retrieve ERC20 token transactions and store them in a local CSV file. The project uses Apache Kafka, an open-source distributed streaming platform, to stream real-time data from the Etherscan API, a blockchain explorer for the Ethereum network, and then stores the data in CSV format in a local file.
Thesis Code - Motion Heatmap and Machine Learning for Stair Climbing Detection Pyspark, Pandas, scikit-learn, TensorFlow, Matplotlib, Seaborn This code repository contains the code used to generate the results presented in my thesis titled "Motion Heatmap and Machine Learning for Stair Climbing Detection." In this thesis, we present a dataset of video data that includes bounding boxes information and silhouette images, along with the methods used to process this data to detect human movements, trajectories over time, and the usage of each room in the home environment.
ERC20 MyToken Solidity, Python, Web3, Blockchain This is a simple ERC20 token contract written in Solidity. It allows for the creation, transfer, and burning of tokens. The contract also includes an onlyOwner modifier to restrict access to certain functions.

💻 Tech Stack:

Cloud

Azure Google Cloud AWS

Languages

Python PowerShell Solidity SAS

Databases

Microsoft SQL Server MySQL Oracle SQL BigQuery Snowflake

Data Engineering

Docker Apache Airflow Apache Kafka Azure Data Factory Azure DevOps PySpark Terraform

Data Science & Machine Learning

NumPy Pandas PyTorch scikit-learn SciPy TensorFlow RAG

Visualization

Power BI Matplotlib Looker

Version Control

Git GitHub

Certification

Azure Data Engineer

Azure DP203 cert Azure DP600 cert

Pinned Loading

  1. EnergyStocks_HistoricalPrice_DataPipeline EnergyStocks_HistoricalPrice_DataPipeline Public

    Python 1 1

  2. Data_Pipeline_ETL_Pipeline_Web3_Token Data_Pipeline_ETL_Pipeline_Web3_Token Public

    Data Pipeline Project

    Python 1

  3. Real-time-Streaming-of-ERC20-Transactions-with-Kafka-and-Python Real-time-Streaming-of-ERC20-Transactions-with-Kafka-and-Python Public

    Python 1

  4. RAG-based-Document-Retrieval-with-ChromaDB-and-Vector-Embeddings RAG-based-Document-Retrieval-with-ChromaDB-and-Vector-Embeddings Public

    Python 1

  5. Stair-Climbing-Descending-Analysis-using-Silhouettes Stair-Climbing-Descending-Analysis-using-Silhouettes Public

    Machine Learning Project

    Python 2

  6. Oracle-Fusion-Role-Automation-Pipeline Oracle-Fusion-Role-Automation-Pipeline Public

    Oracle Cloud Uploader

    Python 1