Data_privacy_and_ethics

Data Privacy and Ethics: Post Module
Capstone: One America Works
Abhijit Choudhary, Kristen Dardia Turner, Jared Lee, Somna Pati, Yogesh Phadnis 
Capstone Background

Our Capstone team is focused on an effort to address a socio-economic issue in which cities with great value and potential in middle America are being overlooked by growth companies in favor of big cities such as New York City, San Francisco, and Austin. This is creating a disparity in the amount of invested resources and development of these more “desirable” cities vs. cities in middle America. Therefore, we are partnering with a non-profit organization called One America Works (OAW) to create a city-specific web-based recommendation tool and dashboard. We are anticipating that the tool will have three main users: 

•	Growth Companies: to determine what cities would be an ideal location for their operations.
•	City Administrators: to gain insight on how to make their cities more attractive and to encourage new businesses to select them as a location for their operations.
•	Job Seekers: to decide where they can find gainful employment that aligns with their personal interests and professional goals.

The tool has been created to help the above audiences with their decision making by allowing users to input priorities and select cities resulting in dynamic information. The user will have the ability to adjust the value that they assign to certain decision-making criteria (talent, connectivity, quality of life, and cost) to align with their priorities/objectives. The tool also recommends cities which are “nearest neighbors” based on the city that the user selects depending on their objectives as listed above.  Further, we have created dashboards where the user can interactively compare cities, industries, talent and other considerations at a granular level.   

Our hope is that the resulting analytical tool will have a meaningful impact in the proliferation of growth companies and talent to geographies that are underserved by the populations that flock to traditional tech hubs.  However, there are ethical issues that we have identified in our project.  They span various aspects of our research, from the type of data that we gathered, the determination of the target variables, modeling and analytics techniques and the tool deployment itself. This paper will address each issue and provide potential solutions that attempt to minimize any ethical concerns.  
Data Sources/Collection
We gathered the data through a combination of API connections, ordinary downloads, and manual methods. The following resources were utilized to gather a comprehensive list of variables which are available and useful for our purposes: 
•	American Census Data
•	American Fact Finder
•	American Payroll Association
•	Bureau of Economic Analysis
•	Business Insider
•	Citylab
•	Governing
•	Federal Aviation Association
•	Henry J Kaiser Family Foundation
•	MIT International Center for Air Transportation
•	Numbeo
•	Tax Foundation
•	U.S. Energy Information Administration
•	U.S. News
•	United State Patent and Trademark Office
•	Workers Compensation Class Codes
Our data was collected from publicly available sources and consolidated at the American Metropolitan Statistical Area (MSA) level. The following sections describe some of the ethical issues related to the dataset: data collection, validation of data sources, unfairness to cities, and the handling of missing data. 

Data Collection
When sourcing data, we took all encompassing approach by collecting data that we surmised as relevant.  Given that one of our goals was to predict potential tech-cities, our thought process was geared towards identifying variables that would correlate with the identification of a tech-city, a city already established as a known tech-hub. Collecting variables that only favor tech-cities could lead to us perpetuating the disparity between tech cities and underserved cities as the tool would then only recommend established tech hubs. As the objective of our project was to recommend other potential cities that do not necessarily have all the attributes of the current tech-cities, such as higher population, we considered other variables that are likely important to our target audience to create a more comprehensive analysis. These variables include pollution, traffic, CO2 index, cost of living, etc. and were incorporated into the analytics in hopes of preventing the pro-tech variables (population, talent, etc.) from masking cities that are not currently classified as “tech hubs”. 
As we were limited to data that was publicly available, we did not have the accessibility to other potential variables that may be helpful for the user. For instance, collecting a variable for how many tech companies currently exist within an MSA may help a growth company using the tool decide whether they want their operations to be in a MSA where growth companies already exist or would like to be a pioneer in a MSA where growth companies are not currently located. 
Validating Data Sources
Our data was collected from sources that were publicly available such as the Census Bureau, Department of Transportation, Tax Foundation, etc. We also utilized sources such as Numbeo, a crowd sourced database, for city specific information and Business Insider to determine tech hubs. Our predominant reason for using these sources is that we were focused on gathering data that would be updated year over year and therefore would enable our tool to stay current as it would utilize the latest year’s data. Furthermore, we did not have funding and therefore were limited to publicly available data. There is a risk that our data collection methodology may be considered as having a convenience factor since these sources are available through a simple Google search and we would not necessarily be able to validate the data and the source’s data collection methodology. For instance, being that Numbeo is a crowd sourced database, the data they receive may be biased and only representative of Numbeo users’ perspectives and not the US population as a whole. Our solution to this risk was to primarily use reputable data sources such as ones associated with government data. In addition, we corroborated subjective variables such as quality of life, top schools, tech hubs, etc. with other publicly available data to gain comfort on the validity of the classification. Finally, we provided citations for all sources of data to allow the user to draw their own conclusions on the credibility of our data. 

MSA Level Data Causes Unfairness to Certain Cities/Geographies
As noted above, our tool is based on the entire population of MSAs in the US.  According to the Census Bureau, the general concept of a MSA is a “core area containing a substantial population nucleus, together with adjacent communities having a high degree of economic and social integration with that core”.  Therefore, we determined that representing the data at the MSA level would give us the most accurate and practical depiction of a geography that a company or individual might consider as a “place to locate”.  However, there is a risk of unfairness since there are cities that are not considered as, or within, MSAs. In other words, cities not in close proximity to an area which has a population that the Census Bureau has determined as “substantial,” would not be considered within our tool.
The cities that do not meet the criteria to be considered as a MSA may still have other favorable characteristics that companies consider when expanding, such as low cost of living or high quality of life. This may result in unfairness towards a geography that is looking to attract more businesses but is not even on the “radar” of the tool as they are not included in the data that powers it. 
In addition, we noted that there were instances of MSAs that were heterogeneous and included multiple large localities which had distinct value propositions within the same MSA. Examples of these instances would include the following:
•	New York City/Newark/Jersey City
•	Washington DC/Arlington/Alexandria
•	Miami/Ft. Lauderdale/West Palm Beach
In partnership with our sponsor at OAW, we determined that the best course of action was to separate the significantly distinct locales within one MSA.  For the most part, data was available at the Micropolitan Statistical Area which allows us to separate the above cities without a reduced confidence in our data quality.  
Although further detailed data collection process can be undertaken to collect the data at the city level to alleviate this issue, it was out of scope for this project given that using the MSA level was a requirement from our sponsor at OAW.  We have informed OAW the importance of considering additional geographies and suggested that this be a consideration for future iterations of the tool. In the case that OAW prefers to operate at an MSA level, we recommended that they do consistent and periodic reviews of the MSAs as new MSAs may be formed throughout the years based on population and economic growth. Currently our solution to this issue is to be transparent about the decisions related to gathering data at the MSA level and recommend gathering microlevel data to OAW.

Missing Data Solutions
The proportion of missing values varied amongst the 132 variables in our dataset; 82 of them had 0%, 9 had less than 15%, 16 had less than 50%, and only 25 had greater than 50%. There was a direct correlation between size of the MSA and the amount of missing values. To avoid favoring larger MSAs, which would be counterintuitive to the purpose of our tool since it would perpetuate the bias that already exists towards established tech hubs, we determined that the best course of action would be to come up with a methodology to remove the missing variables without losing the significance of the variables. Given the low number of variables with a high percent missing, we removed all variables with less than a 50% fill rate from our analysis. 
The appearance of missing values which is correlated with MSAs that have smaller populations could increase the possibility of less accurate analytics for these small or underserved MSAs.  In order to best address this, we used a multiple imputation technique to fill in the missing values for the variables with a less than 50% missing value rate.  The resulting imputations were carefully inspected to ensure a healthy convergence of the iterative procedure and that similarly shaped distributions were taken by the populated values vs. the imputed values.  
Twenty of the twenty-five variables with greater than 50% missing values came from the same source, CBRE Research: “2018 Scoring Tech Talent”.  In order to not lose the value of the CBRE data, we used it to serve as a validation of our scoring methodology and analysis.  

Target Variables
Different aspects of our analysis utilized two different, but related target variables.  
•	The total revenue generated by all industries within a MSA 
•	The total revenue generated by tech-related industries within a MSA

Revenue dollars collected via corporations located in their city can be viewed as a quantitative measure of that city’s attractiveness to corporations/businesses.  As the purpose of this model was to inform city administrators how to make their cities more attractive to corporations, this serves as a logical target variable for our purpose.  As a corollary, if a city wanted to make their location more attractive, specifically to tech industry, that would justify the use of the second target listed above - total revenue generated by tech-related industries within a MSA.  

There is no known or scientific measure of a city’s attractiveness, so we determined it was best to identify a variable that would be highly correlated with a theoretical measure of a city’s attractiveness.  However, this leads to ethical concerns from conclusions drawn from our analysis as it would assume that our target variable is a true representation of city attractiveness. Due to the subject nature of the variable, this would be a caution case for such conclusions.
Additionally, the total revenue variables came from census data which consists of subset variables for the revenue by industry in each MSA.  There is a risk in the data integrity that newly established business may not yet be reflected in the government survey and therefore not represented in our data. In these cases, our data would be biased towards large and older business, respectively. Also, another risk is the misclassification of the businesses within industries and ambiguity in the actual industry classification. This may lead to inaccuracies in the percentage of industries per MSAs that are represented on our dashboard.
As a means of validation and an attempt to address ethical issues caused by the use of proxy targets, we tested alternative proxy variables, such as a binary tech-hub indicator, to validate the original target variables.  Similar results were drawn from these efforts, giving us comfort in proceeding with using our proxy targets.  
Modeling/Analytics
The underlying analytics which powers the tool include:
•	Multiple Linear Regression: provide insight on which variables will increase a city’s desirability
•	Scoring: create an interactive ranking of the four overall categories that define a city’s attractiveness (Connectivity, Talent, Quality of Life, and Cost). 
•	Nearest Neighbor Analysis (KNN):  identify the cities with similar attributes to the currently selected city based on user preferences.
Although our analysis was helpful in creating the tool and dashboards, there were inherent risks that may lead to unfairness and bias such as using a small dataset, feature selection bias, and perpetuation bias via the recommendation engine. 
Small Dataset Limitations
Our entire dataset consists of 356 MSAs (after separating certain MSAs into smaller units as mentioned above in a previous section) and 132 variables.  Because of this, we are not concerned with sampling bias (outside of the fact that we don’t consider non-MSA geographies at all) since this dataset represents the entire population of MSAs.   However, this is not a large dataset and there might be bias caused by modeling error which, all else equal, is higher for smaller datasets. With smaller datasets, there is also more of a concern of outliers distorting our results. To combat this, we transformed certain variables prior to modeling and utilized Cook’s Distance measures after modeling.  We did some sensitivity testing both including and excluding potential outliers to assess the potential bias.  
One way to produce a better performing model, would be to increase the number of records, or cities. Theoretically, the ideal scenario would be to raise funding to gain access to city specific data within the U.S. or collect the data firsthand from the cities prior to our analysis (though self-reported data has its own issues). As this could potentially be at least a year’s worth of effort, it was not feasible for our Capstone project but can be a future consideration for OAW. 
Feature Selection Bias
To begin our analysis, we ran a multiple linear regression which included all of the variables after the preprocessing steps. As could be expected, we had a significant degree of multicollinearity.  To reduce the dimensionality and multicollinearity, we opted to use ridge regression.  The first run of the procedure resulted in keeping 25 variables, of which 11 were descriptive of the revenue generated by specific industries. Together, these 11 represent over half of all the variables used to quantify the revenue generated from all industries.  In order to keep the necessary predictive information but continue to address the model’s high dimensionality and degree of multicollinearity, we utilized Principal Component Analysis. 
In this process, there is a chance that we may have eliminated variables that might be more important with respect to the target variable. To overcome this issue, we tested alternative variable reduction techniques such as Lasso regression and tree-based methods and used Variance Inflation Factors (VIF) to gain more insights into variable importance. All methods led to similar conclusions.  
Ultimately, the outcome of this regression analysis was to conclude that the data related to revenue generated by specific industries per MSA was very predictive of the amount of revenue per capita each MSA will bring in.  Therefore, the insight provided in the tool is a MSA level, industry profiler dashboard.  Given the fact that this tool does not depend on specific modeling coefficients, we are comfortable concluding that our regression work robustly suggests that the use of this industry benchmarking tool is valuable. From our perspective, this alleviates the concern of bias caused by our feature selection techniques.  
Recommendations Perpetuating Bias
Our recommendation tool uses the nearest neighbor methodology to determine ten cities with similar attributes that the user may consider based on their selection. It’s possible that this recommendation engine will simply recommend other tech-hubs, which is counterintuitive to our goal.  There are two related concerns to this point, we used the Euclidean distance which is a straight-line distance measure and data we sourced might be bias towards data elements that favor tech-hubs.

The decision to use Euclidean distance was a bit subjective. There could be arguments made that if we had considered other dimensions within our KNN analysis, the recommendations would be different. However, since we are giving the user the ability to interact with the model via the weights, interpretability is important and therefore we felt that the Euclidean distance was the most appropriate measure. 
As mentioned above, our source data may be bias towards data elements that favor tech-hubs. As such, we were careful to source a broad range of data elements to describe a geography including ones that wouldn’t favor tech-hubs (pollution, traffic, CO2 index, etc.).  We also adjusted certain measures in light of this, e.g. population adjusted per capita and cost of living adjusted income.  Finally, by allowing the users to adjust the weights of the scores (talent, cost, connectivity, and quality of life), the nearest neighbor recommendations would adjust accordingly. Therefore, a user who is very cost conscious would not be recommended an established tech hub as those geographies are often well above average in terms of expense.

Deployment
We have built an interactive web-based platform which combines analytics with numerous Tableau based visualizations which consists of three different sections customized according to their respective user profiles: growth companies, city administrators and the talent/workforce viewpoints.  The decision to tailor our deliverables to these audiences matches the goals of OAW.  
As the ultimate goal is to hand the tool off to OAW, there are handoff risks that may pose data integrity issues. The top risks associated with the handoff would be the potential lack of familiarity with the analytics behind the tool and the tool maintenance on the part of our sponsor. 

Familiarity with Analytics
The tool relies on analytical techniques. Our sponsors at OAW have relied on us for this expertise as they do not have as in-depth knowledge with regards to analytics. Therefore, there is a risk that OAW may not be able to properly deploy the tool or deliver the right message to the users after the formal handoff. 
As such, we have taken multiple steps to address the handoff risks. First, we have had regular and reoccurring meetings with OAW to explain the details of our analytics and our findings. In addition, we have streamlined our models and commented our code with written explanations so that they can follow along in a logical manner. We will also provide them our final capstone report which details our thought process and the background of the analytics. Essentially, this can serve as a user guide for when they have questions regarding our logic after the handoff. Lastly, we will have a final meeting where we will ensure a knowledge transfer from our Capstone team to OAW. In the case that we feel that the knowledge is not being transferred to someone who has a basic analytics knowledge, we will recommend that they bring on an existing OAW team member with analytics understanding or potentially look into hiring someone with relevant background.
Tool Maintenance
Our tool’s success is dependent on the underlying data which describes geographies that are constantly changing.  As such, maintenance of the tool and a regular refresh of the data is important.  If not done, the tool will quickly become outdated and provide inaccurate solutions to the users. 
To help with the ease of updating the data, we are documenting the deliverables in GitHub so that there is full transparency into our data, data sources, and code. We are also providing clear instructions on how to gather the data via API where possible and will emphasize the importance of gathering updated data and disclosing the last updated date on the tool itself to provide users with a reference point for their decision making.  

Closing Remarks
The ultimate goal of our Capstone project was to partner with OAW to deliver a web-based tool and dashboard to address the issue of select cities experiencing high growth while a vast majority of the U.S.  cities have remained stagnant. This website is directed towards companies, cities, and job seekers to help them gain insight to make better decisions to achieve their objectives. 
Our project is dependent on publicly available data which is combined with analytical methods and fed into the dashboard. The creation of the dashboard can create fairness and bias issues stemming from the type of data that we gathered, the determination of the target variables, modeling and analytics techniques, and the tool deployment. While we admittedly cannot address all possible concerns, we have taken action to reduce the potential unfairness and bias wherever possible within our data and analytics.