ML Foundations

prof-rossetti · Sep 17, 2024 · f6f5afa · f6f5afa
1 parent a88c433
commit f6f5afa
Show file tree

Hide file tree

Showing 13 changed files with 227 additions and 120 deletions.
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -268,6 +268,9 @@ website:
               - section:
                 href: notes/predictive-modeling/ml-foundations/data-encoding.qmd
                 text: "Data Encoding"
+              - section:
+                href: notes/applied-stats/data-scaling.qmd
+                text: "Data Scaling"
 
           - "--------------"
           - section:

diff --git a/docs/images/features-labels.png b/docs/images/features-labels.png
diff --git a/docs/images/ml-vs-software-bw.png b/docs/images/ml-vs-software-bw.png
diff --git a/docs/images/ml-vs-software.png b/docs/images/ml-vs-software.png
diff --git a/docs/notes/applied-stats/data-scaling.qmd b/docs/notes/applied-stats/data-scaling.qmd
@@ -50,7 +50,7 @@ Scaling the data will make it easier to plot all these different series on a gra
 
 ## Min-Max Scaling
 
-One scaling approach is by dividing each value over the maximum value in that column, essentially expressing each value as a percentage of the greatest value.
+One scaling approach called **min-max scaling** calls for dividing each value over the maximum value in that column, essentially expressing each value as a percentage of the greatest value.
 
 ```{python}
 scaled_df = df.copy()
@@ -70,7 +70,7 @@ When we use min-max scaling, resulting values will be expressed on a scale betwe
 
 ## Standard Scaling
 
-An alternative, more rigorous, scaling approach mean-centers the data and normalizes by the standard deviation:
+An alternative, more rigorous, scaling approach, called **standard scaling** or z-score normalization, mean-centers the data and normalizes by the standard deviation:
 
 ```{python}
 scaled_df = df.copy()
@@ -93,6 +93,6 @@ Now that we have scaled the data, we can more easily compare the movements of al
 
 ## Importance for Machine Learning
 
-Data scaling is important in machine learning because many algorithms are sensitive to the range of the input data. Algorithms including gradient descent-based methods (e.g. neural networks, logistic regression) and distance-based models (e.g. k-nearest neighbors, support vector machines) perform better when features are on a similar scale.
+Data scaling is relevant in machine learning when using multiple input features that have different scales. Many algorithms are sensitive to the range of the input data, and perform better when all features are on a similar scale. If features are not scaled, those features with larger ranges may disproportionately influence the model, leading to biased predictions or slower convergence during training.
 
-If features are not scaled, those with larger ranges may disproportionately influence the model, leading to biased predictions or slower convergence during training. By scaling data—whether through techniques like min-max scaling or z-score normalization, we ensure that each feature contributes equally, improving model performance, training efficiency, and the accuracy of predictions.
+By scaling data, using techniques like min-max scaling or standard scaling, we ensure that each feature contributes equally, improving model performance, training efficiency, and the accuracy of predictions.
diff --git a/docs/notes/predictive-modeling/classification/index.qmd b/docs/notes/predictive-modeling/classification/index.qmd
@@ -1 +1,26 @@
 # Classification
+
+## Classification Objectives
+
+
+
+## Classification Models
+
+Classification Models:
+
+  + Logistic Regression (yes, this is a classification, not a regression model)
+  + Decision Tree
+  + Random Forest
+  + etc.
+
+## Classification Metrics
+
+
+Classification Metrics:
+
+  + Accuracy
+  + Precision
+  + Recall
+  + F-1 Score
+  + ROC AUC
+  + etc.
diff --git a/docs/notes/predictive-modeling/dimensionality-reduction/index.qmd b/docs/notes/predictive-modeling/dimensionality-reduction/index.qmd
@@ -0,0 +1,11 @@
+# Dimensionality Reduction
+
+
+## Dimensionality Reduction Models
+
+Dimensionality Reduction Models:
+
+  + Principal Component Analysis (PCA)
+  + T-SNE
+  + UMAP
+  + etc.
diff --git a/docs/notes/predictive-modeling/index.qmd b/docs/notes/predictive-modeling/index.qmd
@@ -1 +1,60 @@
 # Predictive Modeling in Python (for Finance)
+
+## What is Predictive Modeling?
+
+**Predictive modeling** refers to the use of statistical techniques and machine learning algorithms to predict future outcomes based on historical data. It involves creating models that learn patterns from past observations and use them to forecast future trends or behavior. At its core, predictive modeling is about understanding the relationships between variables and using these relationships to make informed predictions.
+
+
+
+
+## Predictive Modeling Process
+
+The process of predictive modeling can generally be broken down into several steps:
+
+  1. **Data Collection and Preparation**: Gather historical data and prepare it for analysis. This involves cleaning the data, handling missing values, and transforming features for better interpretability and accuracy.
+
+  2. **Model Selection**: Choose the right algorithm for the problem, whether it's a regression model, a classification algorithm, or a time-series forecasting model.
+
+  3. **Model Training**: Fit the model to the data by using training datasets to find patterns and relationships.
+
+  4. **Model Evaluation**: Validate the model to ensure that it generalizes well to new data. This typically involves splitting the data into training and testing sets or using cross-validation techniques.
+
+  5. **Prediction and Forecasting**: Once validated, the model can be used to predict outcomes on new or unseen data, providing valuable foresight for decision-making.
+
+In predictive modeling, the quality of the predictions is closely linked to the quality of the input data, the robustness of the algorithms, and the appropriate handling of uncertainties inherent in the real world.
+
+## Relevance of Predictive Modeling in Finance
+
+Predictive modeling has become increasingly relevant in the financial industry due to its ability to analyze vast amounts of data and provide forecasts that support strategic decision-making. In finance, the ability to make accurate predictions can provide a significant competitive edge. Whether it's forecasting stock prices, assessing credit risk, or predicting customer behavior, predictive models have transformed how financial professionals make decisions.
+
+Here are some key areas where predictive modeling is applied in finance:
+
+  + **Risk Management**: Predictive models are used to assess the likelihood of defaults in loans or investments. By analyzing historical data on borrowers, lenders can develop models that predict the probability of default, helping them manage credit risk effectively.
+
+  + **Market Forecasting**: Traders and analysts use predictive models to forecast stock prices, commodity trends, or exchange rates. These models incorporate historical price movements, trading volume, and external factors to provide insights into future market behavior.
+
+  + **Fraud Detection**: Predictive analytics plays a crucial role in identifying fraudulent transactions. By examining patterns in transaction data, algorithms can detect anomalies that suggest fraudulent behavior, enabling financial institutions to respond quickly.
+
+  + **Customer Behavior Prediction**: Banks and other financial institutions use predictive models to anticipate customer needs, identify upsell opportunities, or predict churn. This helps tailor services and products to individual customers, improving retention and profitability.
+
+  + **Portfolio Management**: In portfolio optimization, predictive models forecast asset returns and risks. These models assist investors in making data-driven decisions, optimizing portfolios to meet their risk preferences and return objectives.
+
+## Why Python for Predictive Modeling in Finance?
+
+Python has emerged as a dominant language in the finance industry for predictive modeling due to its simplicity, flexibility, and vast ecosystem of libraries. Here's why Python is particularly suited for finance:
+
+  + **Ease of Use**: Python's simple syntax makes it easy for both beginners and experienced developers to build complex models without getting bogged down by the intricacies of coding.
+
+  + **Rich Libraries**: Python has a comprehensive set of libraries for data analysis and machine learning, such as `pandas`, `numpy`, `scikit-learn`, `statsmodels`, and `tensorflow`. These libraries provide pre-built functions that simplify the implementation of predictive models.
+
+  + **Visualization Capabilities**: With libraries like `matplotlib`, `seaborn`, and `plotly`, Python excels in visualizing financial data, helping professionals interpret and present their findings effectively.
+
+  + **Community Support**: Python has a large, active community that contributes to continuous development and troubleshooting, making it a great choice for building and maintaining predictive models in a fast-moving industry.
+
+  + **Integration with Financial Systems**: Python seamlessly integrates with databases, APIs, and financial platforms, enabling real-time data analysis and model deployment in production environments.
+
+In this course, we'll explore how to leverage Python's capabilities to build predictive models tailored to various financial applications, starting from data preparation to model validation and prediction.
+
+## Summary
+
+Predictive modeling is a powerful tool for making data-driven decisions, and its importance in finance cannot be overstated. With growing amounts of data and advances in machine learning, the ability to make accurate financial predictions is now more accessible than ever. In the following chapters, we will dive deep into the practical aspects of building these models using Python, from basic concepts to advanced techniques, helping you develop the skills to apply predictive analytics in real-world finance.