Skip to content

Commit 24b4329

Browse files
authored
Merge pull request #3 from end-to-end-mlops-databricks/week1
Week1
2 parents 25dfb31 + 6d3a178 commit 24b4329

29 files changed

+10751
-24
lines changed

.github/workflows/ci.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,10 @@ jobs:
1818
run: uv python install 3.11
1919

2020
- name: Install the dependencies
21-
run: uv sync
21+
run: uv sync
22+
23+
- name: Install pre-commit
24+
run: pip install pre-commit
2225

2326
- name: Run pre-commit checks
24-
run: |
25-
pre-commit run --all-files
27+
run: pre-commit run --all-files

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,3 +95,8 @@ dmypy.json
9595
# VS code configuration
9696
.vscode
9797
.history
98+
99+
.databricks
100+
101+
# Ignore all .parquet files
102+
*.parquet

README.md

Lines changed: 65 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,17 @@ Marvelous MLOps End-to-end MLOps with Databricks course
33

44
## Practical information
55
- Weekly lectures on Wednesdays 16:00-18:00 CET.
6-
- Code for the lecture is shared before the lecture.
6+
- Code for the lecture is shared before the lecture.
77
- Presentation and lecture materials are shared right after the lecture.
88
- Video of the lecture is uploaded within 24 hours after the lecture.
99

10-
- Every week we set up a deliverable, and you implement it with your own dataset.
10+
- Every week we set up a deliverable, and you implement it with your own dataset.
1111
- To submit the deliverable, create a feature branch in that repository, and a PR to main branch. The code can be merged after we review & approve & CI pipeline runs successfully.
1212
- The deliverables can be submitted with a delay (for example, lecture 1 & 2 together), but we expect you to finish all assignments for the course before the 25th of November.
1313

1414

1515
## Set up your environment
16-
In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11.
16+
In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11.
1717
In our examples, we use UV. Check out the documentation on how to install it: https://docs.astral.sh/uv/getting-started/installation/
1818

1919
To create a new environment and create a lockfile, run:
@@ -24,3 +24,65 @@ source venv/bin/activate
2424
uv pip install -r pyproject.toml --all-extras
2525
uv lock
2626
```
27+
28+
29+
30+
Here’s an enhanced explanation of your dataset based on the information from `data_dictionary.csv`, covering each instrument's purpose and the fields it includes:
31+
32+
### 1. **Identifier**
33+
- **`id`**: The unique identifier assigned to each participant, which is used to match records across different files and data sources.
34+
35+
### 2. **Demographics**
36+
- **`Basic_Demos-Enroll_Season`**: The season during which a participant enrolled in the study, which may help in analyzing seasonal trends or impacts.
37+
- **`Basic_Demos-Age`**: The participant’s age, likely a key demographic feature.
38+
- **`Basic_Demos-Sex`**: Gender of the participant, encoded as `0` for Male and `1` for Female.
39+
40+
### 3. **Internet Use and Educational History**
41+
- **`PreInt_EduHx-computerinternet_hoursday`**: Measures daily internet/computer usage hours before any intervention. This could provide a baseline for understanding internet dependency.
42+
- **`Parent-Child Internet Addiction Test (PCIAT)`**: Includes **`PCIAT-PCIAT_Total`**, a total score measuring the severity of internet addiction (compulsivity, escapism, and dependency). This score is pivotal as the **target variable `sii`** is derived from it, categorizing internet addiction into four levels:
43+
- `0`: None
44+
- `1`: Mild
45+
- `2`: Moderate
46+
- `3`: Severe
47+
48+
### 4. **Children's Global Assessment Scale (CGAS)**
49+
- **`CGAS-Season`**: Season when the assessment was conducted.
50+
- **`CGAS-CGAS_Score`**: A numerical scale used by mental health clinicians to assess general functionality in youth, with higher scores indicating better functioning.
51+
52+
### 5. **Physical Measures**
53+
- **`Physical-Season`**: The season of data collection, which could affect measures like weight or blood pressure.
54+
- **`Physical-BMI`, `Physical-Height`, `Physical-Weight`, `Physical-Waist_Circumference`**: These biometric indicators measure aspects of the participant's physical health.
55+
- **`Physical-Diastolic_BP`, `Physical-HeartRate`, `Physical-Systolic_BP`**: Blood pressure and heart rate measurements are vital for understanding cardiovascular health.
56+
57+
### 6. **FitnessGram and Treadmill Data**
58+
- **FitnessGram Vitals and Treadmill**: Cardiovascular fitness assessments, likely involving treadmill-based tests to evaluate endurance and physical capacity.
59+
- **FitnessGram Child**: Measures various aspects of physical fitness, including:
60+
- **Aerobic capacity**, **muscular strength**, **muscular endurance**, **flexibility**, and **body composition**.
61+
- These fields help assess the participant's overall fitness and physical health, relevant for understanding correlations with internet use or sleep quality.
62+
63+
### 7. **Bio-electric Impedance Analysis (BIA)**
64+
- Provides in-depth body composition data, including:
65+
- **BMI**, **body fat percentage**, **lean muscle mass**, and **water content**.
66+
- These measurements are essential for a comprehensive view of physical health and can be related to other health metrics, such as sleep or mental well-being.
67+
68+
### 8. **Physical Activity Questionnaire (PAQ)**
69+
- **`PAQ_A` and `PAQ_C`**: Both assess the participant’s physical activity level over the last week, specifically focusing on vigorous activities. This is relevant for gauging overall physical engagement and comparing it with sedentary behaviors like internet use.
70+
71+
### 9. **Sleep Disturbance Scale (SDS)**
72+
- Designed to categorize sleep disorders in children, this scale includes **Sleep Disturbance Scores** that could help in analyzing the relationship between sleep quality and variables like screen time or physical fitness.
73+
74+
### 10. **Actigraphy Data**
75+
- **Accelerometer Data**: Includes continuous measurements for up to 30 days, capturing data on physical movement and activity trends in natural settings.
76+
- **X, Y, Z axes**: Measure acceleration along each axis to capture movement intensity.
77+
- **ENMO**: Calculates net motion, where zero indicates inactivity, which could correspond to periods of sleep or rest.
78+
- **Angle-Z**: Measures the angle of the arm relative to a horizontal plane, which could help in detecting activity types.
79+
- **Non-wear flag**: Identifies when the accelerometer wasn’t worn, aiding in filtering out non-activity data.
80+
- **Ambient Light, Battery Voltage, Time of Day, Weekday, Quarter, Relative Date**: Provides contextual data that can be used to understand behavioral and temporal patterns.
81+
82+
### Summary of Data Utility
83+
This dataset provides a holistic view of each participant’s demographic, physical, mental, and behavioral characteristics. By combining data on internet use, sleep disturbance, physical fitness, body composition, and actigraphy, the study is positioned to explore the relationships between sedentary behaviors, physical health, mental well-being, and potential internet addiction.
84+
85+
This setup could support various analyses:
86+
1. **Predicting Internet Addiction Levels**: Using `PCIAT` scores and demographic/health data.
87+
2. **Correlating Physical Activity with Internet Use or Sleep**: Using actigraphy and PAQ data.
88+
3. **Analyzing Sleep and Health Relationships**: Leveraging SDS data with physical and mental health scores.

0 commit comments

Comments
 (0)