This repository contains Python code and data used for the research paper "Distribution of Risk Information among Industry Peers and the Informativeness of Textual Risk Disclosures". The paper investigates how risk information is shared across industry peers and examines the informativeness of textual risk disclosures. The full paper is available on SSRN: https://ssrn.com/abstract=4474130.
The Item 1A sections of the 10-K filings are collected from edgar database using SEC EDGAR Filings API
- Extracting risk factors: Run the script in
RFs_prep.py
to read the collected Item 1A files and extract the individual risk factors and save them in a pandas DataFrame - Data Preprocessing: Run the scripts
clean_docs.py
to preprocess the raw risk factors: extract named entities and transform risk factors into a list of tokens after removing numbers, stop words and special charactors. - Train Top2Vec model: Run the script in
Top2Vec_train.py
to train the Top2Vec model on the cleaned risk factors.