Skip to content

Latest commit

 

History

History
460 lines (404 loc) · 20.5 KB

notes_week_01.md

File metadata and controls

460 lines (404 loc) · 20.5 KB

Docker

  • Provides software in isolated container

Data Pipelines

  • Data pipeline is a term for a transformation of data, including some in- and output data
  • Data pipelines can be concatenated

Postgres

  • Name: PostgreSQL
  • Free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance

Docker Image

  • Snapshot of the container

Spark

  • Apache Spark is an open-source unified analytics engine for large-scale data processing
  • Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance

Serverless

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, VMs, or physical servers. Serverless computing does not hold resources in volatile memory; computing is rather done in short bursts with the results persisted to storage. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing. (Wikipedia)

data_pipeline

container_example

  • You can have multiple container using postgres and the databases in each container do not know from each other, will not interfere with each other
  • You don't need to install the software in your host computer, but only docker
  • Docker helps to make your work distributionable, you can run the docker image in a different environment (e.g. in Google Cloud (kubernetes), AWS Batch)

Why should we care about Docker?

  • Reproducibality
  • Local Experiments
  • Integration tests (CI/CD, use e.g. github actions, not covered in this course)
  • Running pipelines on the cloud (AWS Batch, Kubernetes jobs)
  • Spark (for defining data pipelines)
  • Serverless (usefull for processing data, one record at a time; AWS Lamda, Google functions)

Examples using Docker

  • docker run -it ubuntu bash
    • runs the image ubuntu in an interactive mode and executes the bash
  • docker run -it python:3.9
    • starts the image python 3.9
    • can be used for python
  • docker run -it --entrypoint=bash python:3.9
    • same as previous, but now starts with the bash
    • can e.g. be used to install packages
    • we can then start python with the command "python"
    • Problem: When we install a package here and exit the container, the next time we run it, the package will not be installed
  • In the Dockerfile we can specify all the things we want to have in our container
  • Build the image from a Dockerfile using docker build:
    • docker build -t test:pandas .
    • the . means that the container should be build in this folder
  • After building the image you can run it:
    • docker run -it test:pandas
  • To run the example pipeline with the "day" argument:
    • docker run -it test:pandas 2023-01-13
    • what comes after the image is passed as an argument

Practice SQL with postgres

  • run postgres in Docker
  • put some data to postgres with a python script
  • Use official docker image for postgres
    • postgres needs some configurations:
            -e POSTGRES_USER="root" \     # environmental configurations
            -e POSTGRES_PASSWORD="root" \
            -e POSTGRES_DB="ny_taxi" \     # name of the database
            -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data:rw # mount a volume: possibility to map a folder we have on the host machine to our container
            -p 5432:5432 \  # specify the port, map a port on our host machine to a port of the conatiner
                            # For databases we will send SQL request and the database needs to reply
          postgres:13```
    
  • Using the volume allows us to preserve the changes in our database, i.e. they are not deleted, when we stop the container
  • After running the container postgres constructed some files in "ny_taxi_postgres_data"
  • We can start using ist now!
  • pipcli is used as a client
  • Using pgcli to connect to postgres pgcli -h localhost -p 5432 -U root -d ny_taxi
    • -p: port, -U: user, -d: database
  • Stop the postgres service: sudo service postgresql stop
  • To test the connection type \dt int the terminal. This won't show any tables, since our database is still empty.

Have a look at the Dataset

  • Download the data and upload it to the database.
  • Use python to read the dataset and upload it as databse
    • Data upload: upload-data.ipynb / upload-data-parquet.ipynp (for parquet data)
  • Use SELECT count(1) FROM yellow_taxi_data to check the length of the databse
  • Use \dt to get an overview about the databeses that we have
  • After uploading the data \dt shows us the created table
  • More exploration in pgcli, e.g.
    • SELECT max(tpep_pickup_datetime), min(tpep_pickup_datetime), max(total_amount) FROM yellow_taxi_data

Use a Jupyter Notebook to check the connection to the database

  • As an alternative to pgcli, we can use a jupyter notebook to check the connection to the database (check.conection.ipynb)

Use pgAdmin

  • pgcli is a commant line interface for a quick look at the data, but not very comfortable
  • Use webapplication pgAdmin
    • pgAdmin: feature rich Open Source administration and development platform for PostgreSQL
  • Use Docker image to use pgAdmin:
            -e PGADMIN_DEFAULT_EMAIL="[email protected]" \
            -e PGADMIN_DEFAULT_PASSWORD="root" \
            -p 8080:80 \
            dpage/pgadmin4
    
  • We now have 2 container: 1 for postgres and 1 for pgadmin4. We need to connect them
      1. Create a network: docker create pg-network
      1. When we run postgres, we have to specify that this container runs in this network
          -e POSTGRES_USER="root" \     # environmental configurations
          -e POSTGRES_PASSWORD="root" \
          -e POSTGRES_DB="ny_taxi \     # name of the database
          -v ny_taxi_postgres_data:"./ny_taxi_postgres_data:/var/lib/postgresql/data:rw" # mount a volume: possibility to map a folder we have on the host machine to our container
          -p 5432:5432 \ # specify the port, map a port on our host machine to a port of the conatiner
                         # For databases we will send SQL request and the database needs to reply
          --network=pgnetwork
          --name pg database
          postgres:13```
      
      1. run pgAdmin in the same network
                  -e PGADMIN_DEFAULT_EMAIL="[email protected]" \
                  -e PGADMIN_DEFAULT_PASSWORD="root" \
                  -p 8080:80 \
                  --network=pg-network \
                  --name=admin
                  dpage/pgadmin4 
               ```
      
  • Now go to "localhost:8080" in the browser
    • pgadmin should open
    • Login with: email: [email protected], password: root
    • Create a new server:
      • On the left hand side make right click on server -> create -> server
      • make the following specifications:

General Connection

  • press "save", the it should look like this:

Server

  • Now we can navigate to our data on the left hand side:

data table 1 data table 2

  • For example we can have a look at the first 100 rows:

100rows

  • We can use the "Query Tool" to write queries:

query tool 1 query tool 2

Put Commands from the Notebook to a Script

  • put the commands from the jupyter notebook into a script called pipeline.py
  • In order to do that use jupyter: jupyter nbconvert --to=script upload-data.ipynb
  • Rename script to "ingest_data.py"
    • Clean the script
    • Add argparse and change specific names with parameters
  • run the script:
     URL="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
     python ingest_data.py \
     --user=root \
     --password=root \
     --host=localhost \
     --port=5432 \
     --db=ny_taxi \
     --table_name=yellow_taxi_trips \
     --url=${URL}
    

Now use this in Docker

  • write Dockerfile
  • Build the image: docker build -t taxi_ingest:001 .
  • Run it:
     docker run -it \ 
     --network=pg-network \
     taxi_ingest:001 \
     --user=root \
     --password=root \
     --host=pg-database \
     --port=5432 \
     --db=ny_taxi \
     --table_name=yellow_taxi_trips \
     --url=${URL}
    

Docker compose

  • Specify 1 yaml file with configurations of the two above container
  • In general the configurations of multiple configurations are put in one file
services:
  pgdatabase:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw" # host path:container path:mode(read-write)
    ports:
      - "5432:5432" # local port:container port
  pgadmin:
    image: dpage/pgadmin4
    environment:
      - [email protected]
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8080:80"
  • The two containers are then automatically part of one network
  • Run docker-compose up in the terminal
  • Run docker-compose down to end it
  • It can be run in detached mode, i.e. after execution, we can use the terminal : docker-compose up -d

SQL

Terraform

  • What is Terraform?
    • open source by HashiCorp
    • used for provisioning infrastructure (e.g. virtual machines, storage, networking resources, ...) resources
    • Provides a consistent CLI workflow to manage hundreds of cloud services
    • Supports DevOps best practices, with IaC approach
    • Manages configuration files in source control to maintain an ideal provisioning state for testing and production environments
  • What is IaC
    • Infrastructure-as-code
    • A framework, that allows you to build, change, and manage your infrastructure in a safe, consistent and repeatable way by defining resource configurations that you can version, reuse, and share
    • As a git version control, but for infrastructure
  • Some Advantages
    • Infrastructure lifecycle management
    • version control commits
    • Very usefull for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S,...
    • State-based approach to track resource changes throughout deployments

Local Setup for Terraform and GCP

Terraform

GCP

For this course the free version (up to 300 Euro credis) is used.

  1. Create an account with your Google E-mail ID

  2. Setup your first project, e.g. "DTC DE Course", and note down the project ID

    • Make the project ID unique
    • The project ID can be changed py pressing the refresh button
    • Go to this project
    • Go to IAM & Admin -> Service Accounts
    • "Service Accounts" is an account for services
  3. Setup service account & authentification for this project, and downloaded auth-keys: Service Account

    • Use role "Viewer" to begin
    • When pressing "done":

keys

* There are no keys generated yet
* Go to actions -> Manage keys -> ADD KEY
* keep json
* Download the generated .json
  1. Download SDK for local setup

    • Check if gcloud is already installed: gcloud -v
    • If not install it: SDK
  2. Set environment variable to point to your downloaded GCP auth-keys:

    • Go to the folder containing the downloaded key
    export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
    
    # Refresh token, and verfy authetification
    gcloud auth application-default login
    
    • This is need, so that Google knows that is is your key
    • This will open the browser to authenticate

When this is done the local computer can interact with the cloud!

Exercise

Create infrastructure for our project with Terraform

Create two resourses in the Google environment: Project infrastructure for our project with Terraform

  • Google Cloud Storage (GCS): Data Lake
    • In a "Data Lake" all the raw data is stored
    • In an organized way
  • BigQuery: Data Warehouse
    • Data is more structured (e.g. tables,...)

Add permissions for your service account permissions

Creating GCP Structure with Terraform

  • How to define resources in terraform?
    • main.tf
    • to start, you need a terraform resource (terraform {}), which contains:
      • terraform version
      • backend, in this example "local", in prduction you use your cloud environment
      • required prividers, this is optional (think of it as in Python importing a library)
    • provider section
      • terraform relies on pluggins called providers, to interact with cloud providers
      • adds a set of predefined resourse data sources
    • resource
      • contains arguments to configure the resource
    • the main.tf file contains variables from the variables.tf file, written as "var."
      • variables.tf
        • contains "locals", which are constants
        • variables are generally passed at run-time
        • variables that contain "default" are optional run-time variables
        • variables without "default" are mandantory
  • Terraform has only few execution commands:
    1. terraform init: initialize and install
    2. terraform plan: Match changes against the prviouos state (if you e.g. add another resource)
    3. terraform apply: Apply changes to the cloud
    4. terraform destroy: Remove your stack from the cloud (usually it is advisable to destroy your resourses, until you need them next time)
      • The next time you use "terraform apply" you will have back your state
      • This is a great advantage of using terraform, which has a state file
  • After running "terraform init" some more files are in the folder
    • .terraform: like any package manager, manages all packages
    • .terraform-version
    • .terraform-lock.hcl
  • When running "terraform plan" you need to type the project id, because this is a variable where no default was set
    • terraform plan -var="project=<project-id>"
  • Then two new resources are created
    • BigQuery
    • Bucket
  • Then in "terraform apply" you have to confirm the changes and then the new resources are created
    • To confirm: 1. type the project id,, 2. type "yes"

How to set up an Environment in Google Cloud

Setup an Instance in Google Cloud

  • Go to the project
  • On the left-hand side menu select: Compute Engine -> VM Instances (when you do this the first time the API needs to be activated)
  • Generate SSH key:
    • https://cloud.google.com/compute/docs/connect/create-ssh-keys
    • cd .ssh
    • ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USER -b 2048
    • This creates a public and a private key
    • Put the public key to Google Cloud
    • On the left-habd sided pannel choose "Metadata"
    • Choose "SSH KEYS" -> "ADD SSH KEY" -> add public key there -> "SAVE"
  • Create an instance
    • On the left-hand sided pannel move to "VM instances"
    • Click on "CREATE INSTANCE"
    • Set configurations: Choose Name, Region, Zone
    • Then we already see an estimate of the monthly costs
    • Choose instance type: "e2-standard-4"
    • Optional: change Boot disk to "Ubuntu", 20.04 LTS, size: 30GB
    • "CREATE" (you can also see how to create it from command line: "EQIVALENT COMMAND LINE")
    • Now you can ssh to the instance using the external IP:
      • Move to home directory
      • ssh -i ~/.ssh/gcp @ (ssh-i ~/.ssh/gcp froukje@external-ip)
  • Configure the instance
    • On the instance nothing is installed, excep gcloud (e.g. gcloud -v)
    • Install Anaconda
      • Download Anaconda: https://www.anaconda.com/products/distribution
      • Version for Linux: 64-Bit (x86), copy link
      • In terminal, that is connected to gcp: wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
      • bash Anaconda3-2022.10-Linux-x86_64.sh
  • In folder .ssh create a config file: touch config:
     HostName 34.88.152.223
     User froukje
     IdentityFile /home/frauke/.ssh/gcp```
     * Then we can use ```ssh de-zoomcamp```
    
  • Install Docker
  • Install docker compose:
    • go to github: https://github.com/docker/compose/
    • select release on the right hand side (e.g. latest release)
    • choose "docker-compose-linux-x86 64" and copy link
    • in our VM: create folder: mkdir bin and move there
    • wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-linux-x86_64 -O docker-compose
    • Make the file executable: chmod +x docker-compose
    • Make it visible/executable from everywhere -> add to bin-directory to PATH variable to .bashrc
      • open .bashrc and add at the end:
      • export PATH="${HOME}/bin:${PATH}"
  • Configure vs code to access the VM
    • Install extension remote SSH
    • "connect to host": choose above created config file
  • Clone course repo:
    • git clone https://github.com/DataTalksClub/data-engineering-zoomcamp.git
  • Now we can move to the directory: data-engineering-zoomcamp/week_1_basics_n_setup/2_docker_sql and run docker-compose:
    • docker-compose up -d
  • Install pgcli: pip install pgcli
    • We can the use: pgcli -h localhost -U root -d ny_taxi (pw: root)
    • Use e.g. dt
    • Alternatively: Install with conda: conda install -c conda-forge pgcli (before pip uninstall pgcli)
      • Additionally install: pip install -U mycli
  • Forward postgres port (5432) to our local machine
    • In VS code open terminal (connected to GCP) -> go to tap "PORTS" -> "Forward a Port"
    • Add 5432
    • Now we are able to access this port from our local machine
    • open new terminal and execute pgcli -h localhost -U root -d ny_taxi
    • Add port 8080
    • We can then access postgres via locahost:8080 (login: [email protected], password: root)
  • Start Jupyter from GCP
    • In terminal type: jupyter-notebook &
    • Forward port 8888 (as before)
    • copy link (e.g.: http://localhost:8888/?token=68cabc42701839c0de3ecc8e0e367f3504868ce0b012db24) and open in browser
    • Alternative: login from new terminal and forward the port: ssh -L localhost:8888:localhost:8888 froukje@<external ip>
    • Note: a specific port can be chosen by starting jupyter-notebook -p 8888
    • Then in the browser type: localhost:8888/tree, if needed copy the token to access jupyter notebooks
  • Install terraform:
    • move to directory "bin"
    • https://developer.hashicorp.com/terraform/downloads
    • wget https://releases.hashicorp.com/terraform/1.3.7/terraform_1.3.7_linux_amd64.zip
    • sudo apt-get install unzip
    • unzip terraform_1.3.7_linux_amd64.zip
    • rm terraform_1.3.7_linux_amd64.zip
    • To use terraform we need to get our gcp credentials to the VM:
      • copy the file with e.g. sftp
      • sftp de-zoomcamp
      • cd to folder where it should be stored
      • put stoked-mode-375206-e914d123bc05.json
    • setup google envirenmental credential variable:
      • export GOOGLE_APPLICATION_CREDENTIALS=~/.gcp/stoked-mode-375206-e914d123bc05.json
    • Use this .json file to authenticate our cli
      gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS```
    • Now we can use terraform init, terraform plan ...
  • To stop the instance:
    • From the website: click "stop"
    • From the terminal: sudo shutdown now