Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Week1 #3

Merged
merged 36 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
59b642b
added notebooks src project_config.yml data
javedhassans Oct 27, 2024
5cb3447
added visualization feature plot importance code in utils.py
javedhassans Oct 27, 2024
1fa18e5
added new build
javedhassans Oct 28, 2024
830c17d
added new build for databricks
javedhassans Oct 28, 2024
0aac699
added files in notebook named 02 & 03 from trainig. Also childHealth_…
javedhassans Oct 28, 2024
dae772a
fixed dbconnect and ran files till 03 with dbconnect
javedhassans Oct 29, 2024
1476b30
deleted package to rebuild again
javedhassans Oct 30, 2024
028c957
deleted package to rebuild again
javedhassans Oct 30, 2024
31c3985
added ReadME.md gitignore to avoid parquet files added datadictionary…
javedhassans Oct 30, 2024
6a0b5dc
restructured folder and created week1 directorty
javedhassans Oct 30, 2024
c0279f8
added tqdm in project.toml making 01.dataPreprocessing.py
javedhassans Oct 30, 2024
a38f48e
just making commit
javedhassans Oct 31, 2024
bb7494b
working on preprocessing parquet files
javedhassans Oct 31, 2024
9de1dd6
added main codes in the 00.dataexploration.ipynb modified project.tom…
javedhassans Oct 31, 2024
1f1a36a
added featue table code
javedhassans Nov 3, 2024
6e1a9ed
modifed childHealth_model.py
javedhassans Nov 5, 2024
335569a
deleted mlruns
javedhassans Nov 5, 2024
3da76c3
made uv build
javedhassans Nov 5, 2024
34c4ff5
rebuild the wheel
javedhassans Nov 9, 2024
dfc2359
rebuild the wheel
javedhassans Nov 11, 2024
d6ac83e
rebuild the wheel
javedhassans Nov 11, 2024
0927a01
rebuild the wheel
javedhassans Nov 11, 2024
3a5d926
rebuild the wheel
javedhassans Nov 11, 2024
3dba4f0
added create feature table
javedhassans Nov 11, 2024
00b1028
modififed create feature table code
javedhassans Nov 11, 2024
22032b1
modififed create feature table code
javedhassans Nov 11, 2024
6a42c65
updated feature table
javedhassans Nov 11, 2024
0871415
updated feature table
javedhassans Nov 12, 2024
695512c
finished week3
javedhassans Nov 13, 2024
6ba90c6
finished week3
javedhassans Nov 13, 2024
5811b46
finished week2
javedhassans Nov 13, 2024
8c35306
fixing ci.yml for pr
javedhassans Nov 13, 2024
92a82b1
fixed week 2 notebooks commented pip install line 1
javedhassans Nov 13, 2024
2f38f32
fixed week 2 notebooks commented pip install line 1
javedhassans Nov 17, 2024
fd41ed9
fixed week1
javedhassans Nov 19, 2024
6d3a178
fixed week1
javedhassans Nov 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@ jobs:
run: uv python install 3.11

- name: Install the dependencies
run: uv sync
run: uv sync

- name: Install pre-commit
run: pip install pre-commit
Comment on lines +23 to +24
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use uv instead of pip for consistency

The workflow is using uv as the package manager, but this step uses pip. For consistency and to avoid potential dependency resolution issues, consider using uv here as well.

- - name: Install pre-commit
-   run: pip install pre-commit
+ - name: Install pre-commit
+   run: uv pip install pre-commit
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Install pre-commit
run: pip install pre-commit
- name: Install pre-commit
run: uv pip install pre-commit


- name: Run pre-commit checks
run: |
pre-commit run --all-files
run: pre-commit run --all-files
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,8 @@ dmypy.json
# VS code configuration
.vscode
.history

.databricks

# Ignore all .parquet files
*.parquet
68 changes: 65 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@ Marvelous MLOps End-to-end MLOps with Databricks course

## Practical information
- Weekly lectures on Wednesdays 16:00-18:00 CET.
- Code for the lecture is shared before the lecture.
- Code for the lecture is shared before the lecture.
- Presentation and lecture materials are shared right after the lecture.
- Video of the lecture is uploaded within 24 hours after the lecture.

- Every week we set up a deliverable, and you implement it with your own dataset.
- Every week we set up a deliverable, and you implement it with your own dataset.
- To submit the deliverable, create a feature branch in that repository, and a PR to main branch. The code can be merged after we review & approve & CI pipeline runs successfully.
- The deliverables can be submitted with a delay (for example, lecture 1 & 2 together), but we expect you to finish all assignments for the course before the 25th of November.


## Set up your environment
In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11.
In this course, we use Databricks 15.4 LTS runtime, which uses Python 3.11.
In our examples, we use UV. Check out the documentation on how to install it: https://docs.astral.sh/uv/getting-started/installation/

To create a new environment and create a lockfile, run:
Expand All @@ -24,3 +24,65 @@ source venv/bin/activate
uv pip install -r pyproject.toml --all-extras
uv lock
```



Here’s an enhanced explanation of your dataset based on the information from `data_dictionary.csv`, covering each instrument's purpose and the fields it includes:

### 1. **Identifier**
- **`id`**: The unique identifier assigned to each participant, which is used to match records across different files and data sources.

### 2. **Demographics**
- **`Basic_Demos-Enroll_Season`**: The season during which a participant enrolled in the study, which may help in analyzing seasonal trends or impacts.
- **`Basic_Demos-Age`**: The participant’s age, likely a key demographic feature.
- **`Basic_Demos-Sex`**: Gender of the participant, encoded as `0` for Male and `1` for Female.

### 3. **Internet Use and Educational History**
- **`PreInt_EduHx-computerinternet_hoursday`**: Measures daily internet/computer usage hours before any intervention. This could provide a baseline for understanding internet dependency.
- **`Parent-Child Internet Addiction Test (PCIAT)`**: Includes **`PCIAT-PCIAT_Total`**, a total score measuring the severity of internet addiction (compulsivity, escapism, and dependency). This score is pivotal as the **target variable `sii`** is derived from it, categorizing internet addiction into four levels:
- `0`: None
- `1`: Mild
- `2`: Moderate
- `3`: Severe

### 4. **Children's Global Assessment Scale (CGAS)**
- **`CGAS-Season`**: Season when the assessment was conducted.
- **`CGAS-CGAS_Score`**: A numerical scale used by mental health clinicians to assess general functionality in youth, with higher scores indicating better functioning.

### 5. **Physical Measures**
- **`Physical-Season`**: The season of data collection, which could affect measures like weight or blood pressure.
- **`Physical-BMI`, `Physical-Height`, `Physical-Weight`, `Physical-Waist_Circumference`**: These biometric indicators measure aspects of the participant's physical health.
- **`Physical-Diastolic_BP`, `Physical-HeartRate`, `Physical-Systolic_BP`**: Blood pressure and heart rate measurements are vital for understanding cardiovascular health.

### 6. **FitnessGram and Treadmill Data**
- **FitnessGram Vitals and Treadmill**: Cardiovascular fitness assessments, likely involving treadmill-based tests to evaluate endurance and physical capacity.
- **FitnessGram Child**: Measures various aspects of physical fitness, including:
- **Aerobic capacity**, **muscular strength**, **muscular endurance**, **flexibility**, and **body composition**.
- These fields help assess the participant's overall fitness and physical health, relevant for understanding correlations with internet use or sleep quality.

### 7. **Bio-electric Impedance Analysis (BIA)**
- Provides in-depth body composition data, including:
- **BMI**, **body fat percentage**, **lean muscle mass**, and **water content**.
- These measurements are essential for a comprehensive view of physical health and can be related to other health metrics, such as sleep or mental well-being.

### 8. **Physical Activity Questionnaire (PAQ)**
- **`PAQ_A` and `PAQ_C`**: Both assess the participant’s physical activity level over the last week, specifically focusing on vigorous activities. This is relevant for gauging overall physical engagement and comparing it with sedentary behaviors like internet use.

### 9. **Sleep Disturbance Scale (SDS)**
- Designed to categorize sleep disorders in children, this scale includes **Sleep Disturbance Scores** that could help in analyzing the relationship between sleep quality and variables like screen time or physical fitness.

### 10. **Actigraphy Data**
- **Accelerometer Data**: Includes continuous measurements for up to 30 days, capturing data on physical movement and activity trends in natural settings.
- **X, Y, Z axes**: Measure acceleration along each axis to capture movement intensity.
- **ENMO**: Calculates net motion, where zero indicates inactivity, which could correspond to periods of sleep or rest.
- **Angle-Z**: Measures the angle of the arm relative to a horizontal plane, which could help in detecting activity types.
- **Non-wear flag**: Identifies when the accelerometer wasn’t worn, aiding in filtering out non-activity data.
- **Ambient Light, Battery Voltage, Time of Day, Weekday, Quarter, Relative Date**: Provides contextual data that can be used to understand behavioral and temporal patterns.

### Summary of Data Utility
This dataset provides a holistic view of each participant’s demographic, physical, mental, and behavioral characteristics. By combining data on internet use, sleep disturbance, physical fitness, body composition, and actigraphy, the study is positioned to explore the relationships between sedentary behaviors, physical health, mental well-being, and potential internet addiction.

This setup could support various analyses:
1. **Predicting Internet Addiction Levels**: Using `PCIAT` scores and demographic/health data.
2. **Correlating Physical Activity with Internet Use or Sleep**: Using actigraphy and PAQ data.
3. **Analyzing Sleep and Health Relationships**: Leveraging SDS data with physical and mental health scores.
Loading