|
| 1 | +# AstroNet: A Neural Network for Identifying Exoplanets in Light Curves |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## Code Author |
| 6 | + |
| 7 | +Chris Shallue: [@cshallue](https://github.com/cshallue) |
| 8 | + |
| 9 | +## Background |
| 10 | + |
| 11 | +This directory contains TensorFlow models and data processing code for |
| 12 | +identifying exoplanets in astrophysical light curves. For complete background, |
| 13 | +see [our paper](http://adsabs.harvard.edu/abs/2018AJ....155...94S) in |
| 14 | +*The Astronomical Journal*. |
| 15 | + |
| 16 | +For shorter summaries, see: |
| 17 | + |
| 18 | +* ["Earth to Exoplanet"](https://www.blog.google/topics/machine-learning/hunting-planets-machine-learning/) on the Google blog |
| 19 | +* [This blog post](https://www.cfa.harvard.edu/~avanderb/page1.html#kepler90) by Andrew Vanderburg |
| 20 | +* [This great article](https://milesobrien.com/artificial-intelligence-gains-intuition-hunting-exoplanets/) by Fedor Kossakovski |
| 21 | +* [NASA's press release](https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star) article |
| 22 | + |
| 23 | +## Citation |
| 24 | + |
| 25 | +If you find this code useful, please cite our paper: |
| 26 | + |
| 27 | +Shallue, C. J., & Vanderburg, A. (2018). Identifying Exoplanets with Deep |
| 28 | +Learning: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet |
| 29 | +around Kepler-90. *The Astronomical Journal*, 155(2), 94. |
| 30 | + |
| 31 | +Full text available at [*The Astronomical Journal*](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta). |
| 32 | + |
| 33 | +## Walkthrough |
| 34 | + |
| 35 | +### Required Packages |
| 36 | + |
| 37 | +First, ensure that you have installed the |
| 38 | +[required packages](../README.md#required-packages) and that the |
| 39 | +[unit tests](../README.md#run-unit-tests) pass. |
| 40 | + |
| 41 | +### Download Kepler Data |
| 42 | + |
| 43 | +A *light curve* is a plot of the brightness of a star over time. We will be |
| 44 | +focusing on light curves produced by the Kepler space telescope, which monitored |
| 45 | +the brightness of 200,000 stars in our milky way galaxy for 4 years. An example |
| 46 | +light curve produced by Kepler is shown below. |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | +To train a model to identify planets in Kepler light curves, you will need a |
| 51 | +training set of labeled *Threshold Crossing Events* (TCEs). A TCE is a periodic |
| 52 | +signal that has been detected in a Kepler light curve, and is associated with a |
| 53 | +*period* (the number of days between each occurrence of the detected signal), |
| 54 | +a *duration* (the time taken by each occurrence of the signal), an *epoch* (the |
| 55 | +time of the first observed occurrence of the signal), and possibly additional |
| 56 | +metadata like the signal-to-noise ratio. An example TCE is shown below. The |
| 57 | +labels are ground truth classifications (decided by humans) that indicate which |
| 58 | +TCEs in the training set are actual planets signals and which are caused by |
| 59 | +other phenomena. |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | +You can download the DR24 TCE Table in CSV format from the [NASA Exoplanet |
| 64 | +Archive](https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=q1_q17_dr24_tce). Ensure the following columns are selected: |
| 65 | + |
| 66 | +* `rowid`: Integer ID of the row in the TCE table. |
| 67 | +* `kepid`: Kepler ID of the target star. |
| 68 | +* `tce_plnt_num`: TCE number within the target star. |
| 69 | +* `tce_period`: Period of the detected event, in days. |
| 70 | +* `tce_time0bk`: The time corresponding to the center of the first detected |
| 71 | + event in Barycentric Julian Day (BJD) minus a constant offset of |
| 72 | + 2,454,833.0 days. |
| 73 | +* `tce_duration`: Duration of the detected event, in hours. |
| 74 | +* `av_training_set`: Autovetter training set label; one of PC (planet candidate), |
| 75 | + AFP (astrophysical false positive), NTP (non-transiting phenomenon), |
| 76 | + UNK (unknown). |
| 77 | + |
| 78 | +Next, you will need to download the light curves of the stars corresponding to |
| 79 | +the TCEs in the training set. These are available at the |
| 80 | +[Mikulski Archive for Space Telescopes](https://archive.stsci.edu/). However, |
| 81 | +you almost certainly don't want all of the Kepler data, which consists of almost |
| 82 | +3 million files, takes up over a terabyte of space, and may take several weeks |
| 83 | +to download! To train our model, we only need to download the subset of light |
| 84 | +curves that are associated with TCEs in the DR24 file. To download just those |
| 85 | +light curves, follow these steps: |
| 86 | + |
| 87 | +**NOTE:** Even though we are only downloading a subset of the entire Kepler |
| 88 | +dataset, the files downloaded by the following script take up about **90 GB**. |
| 89 | + |
| 90 | +```bash |
| 91 | +# Filename containing the CSV file of TCEs in the training set. |
| 92 | +TCE_CSV_FILE="${HOME}/astronet/dr24_tce.csv" |
| 93 | + |
| 94 | +# Directory to download Kepler light curves into. |
| 95 | +KEPLER_DATA_DIR="${HOME}/astronet/kepler/" |
| 96 | + |
| 97 | +# Generate a bash script that downloads the Kepler light curves in the training set. |
| 98 | +python astronet/data/generate_download_script.py \ |
| 99 | + --kepler_csv_file=${TCE_CSV_FILE} \ |
| 100 | + --download_dir=${KEPLER_DATA_DIR} |
| 101 | + |
| 102 | +# Run the download script to download Kepler light curves. |
| 103 | +./get_kepler.sh |
| 104 | +``` |
| 105 | + |
| 106 | +The final line should read: `Finished downloading 12669 Kepler targets to |
| 107 | +${KEPLER_DATA_DIR}` |
| 108 | + |
| 109 | +Let's explore the downloaded light curve of the Kepler-90 star! Note that Kepler |
| 110 | +light curves are divided into |
| 111 | +[four quarters each year](https://keplerscience.arc.nasa.gov/data-products.html#kepler-data-release-notes), which are separated by the quarterly rolls that the spacecraft |
| 112 | +made to reorient its solar panels. In the downloaded light curves, each `.fits` |
| 113 | +file corresponds to a specific Kepler quarter, but some quarters are divided |
| 114 | +into multiple `.fits` files. |
| 115 | + |
| 116 | +```python |
| 117 | +# Launch iPython (or Python) from the tensorflow_models/astronet/ directory. |
| 118 | +ipython |
| 119 | + |
| 120 | +In[1]: |
| 121 | +from light_curve import kepler_io |
| 122 | +import matplotlib.pyplot as plt |
| 123 | +import numpy as np |
| 124 | + |
| 125 | +In[2]: |
| 126 | +KEPLER_DATA_DIR = "/path/to/kepler/" |
| 127 | +KEPLER_ID = 11442793 # Kepler-90. |
| 128 | + |
| 129 | +In[3]: |
| 130 | +# Read the light curve. |
| 131 | +file_names = kepler_io.kepler_filenames(KEPLER_DATA_DIR, KEPLER_ID) |
| 132 | +assert file_names, "Failed to find .fits files in {}".format(KEPLER_DATA_DIR) |
| 133 | +all_time, all_flux = kepler_io.read_kepler_light_curve(file_names) |
| 134 | +print("Read light curve with {} segments".format(len(all_time))) |
| 135 | + |
| 136 | +In[4]: |
| 137 | +# Plot the fourth segment. |
| 138 | +plt.plot(all_time[3], all_flux[3], ".") |
| 139 | +plt.show() |
| 140 | + |
| 141 | +In[5]: |
| 142 | +# Plot all light curve segments. We first divide by the median flux in each |
| 143 | +# segment, because the segments are on different scales. |
| 144 | +for f in all_flux: |
| 145 | + f /= np.median(f) |
| 146 | +plt.plot(np.concatenate(all_time), np.concatenate(all_flux), ".") |
| 147 | +plt.show() |
| 148 | +``` |
| 149 | +The output plots should look something like this: |
| 150 | + |
| 151 | + |
| 152 | + |
| 153 | + |
| 154 | + |
| 155 | +The first plot is a single segment of approximately 20 days. You can see a |
| 156 | +planet transit --- that's Kepler-90 g! Also, notice that the brightness of the |
| 157 | +star is not flat over time --- there is natural variation in the brightness, |
| 158 | +even away from the planet transit. |
| 159 | + |
| 160 | +The second plot is the full light curve over the entire Kepler mission |
| 161 | +(aproximately 4 years). You can easily see two transiting planets by eye --- |
| 162 | +they are Kepler-90 h (the biggest known planet in the system with the deepest |
| 163 | +transits) and Kepler-90 g (the second biggest known planet in the system with |
| 164 | +the second deepest transits). |
| 165 | + |
| 166 | + |
| 167 | +### Process Kepler Data |
| 168 | + |
| 169 | +To train a model to identify exoplanets, you will need to provide TensorFlow |
| 170 | +with training data in |
| 171 | +[TFRecord](https://www.tensorflow.org/programmers_guide/datasets) format. The |
| 172 | +TFRecord format consists of a set of sharded files containing serialized |
| 173 | +`tf.Example` [protocol buffers](https://developers.google.com/protocol-buffers/). |
| 174 | + |
| 175 | +The command below will generate a set of sharded TFRecord files for the TCEs in |
| 176 | +the training set. Each `tf.Example` proto will contain the following light curve |
| 177 | +representations: |
| 178 | + |
| 179 | +* `global_view`: Vector of length 2001: a "global view" of the TCE. |
| 180 | +* `local_view`: Vector of length 201: a "local view" of the TCE. |
| 181 | + |
| 182 | +In addition, each `tf.Example` will contain the value of each column in the |
| 183 | +input TCE CSV file. The columns include: |
| 184 | + |
| 185 | +* `rowid`: Integer ID of the row in the TCE table. |
| 186 | +* `kepid`: Kepler ID of the target star. |
| 187 | +* `tce_plnt_num`: TCE number within the target star. |
| 188 | +* `av_training_set`: Autovetter training set label. |
| 189 | +* `tce_period`: Period of the detected event, in days. |
| 190 | + |
| 191 | +```bash |
| 192 | +# Use Bazel to create executable Python scripts. |
| 193 | +# |
| 194 | +# Alternatively, since all code is pure Python and does not need to be compiled, |
| 195 | +# we could invoke the source scripts with the following addition to PYTHONPATH: |
| 196 | +# export PYTHONPATH="/path/to/source/dir/:${PYTHONPATH}" |
| 197 | +bazel build astronet/... |
| 198 | + |
| 199 | +# Directory to save output TFRecord files into. |
| 200 | +TFRECORD_DIR="${HOME}/astronet/tfrecord" |
| 201 | + |
| 202 | +# Preprocess light curves into sharded TFRecord files using 5 worker processes. |
| 203 | +bazel-bin/astronet/data/generate_input_records \ |
| 204 | + --input_tce_csv_file=${TCE_CSV_FILE} \ |
| 205 | + --kepler_data_dir=${KEPLER_DATA_DIR} \ |
| 206 | + --output_dir=${TFRECORD_DIR} \ |
| 207 | + --num_worker_processes=5 |
| 208 | +``` |
| 209 | + |
| 210 | +When the script finishes you will find 8 training files, 1 validation file and |
| 211 | +1 test file in `TFRECORD_DIR`. The files will match the patterns |
| 212 | +`train-0000?-of-00008`, `val-00000-of-00001` and `test-00000-of-00001` |
| 213 | +respectively. |
| 214 | + |
| 215 | +Here's a quick description of what the script does. For a full description, see |
| 216 | +Section 3 of [our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta). |
| 217 | + |
| 218 | +For each light curve, we first fit a normalization spline to remove any |
| 219 | +low-frequency variability (that is, the natural variability in light from star) |
| 220 | +without removing any deviations caused by planets or other objects. For example, |
| 221 | +the following image shows the normalization spline for the segment of Kepler-90 |
| 222 | +that we considered above: |
| 223 | + |
| 224 | + |
| 225 | + |
| 226 | +Next, we divide by the spline to make the star's baseline brightness |
| 227 | +approximately flat. Notice that after normalization the transit of Kepler-90 g |
| 228 | +is still preserved: |
| 229 | + |
| 230 | + |
| 231 | + |
| 232 | +Finally, for each TCE in the input CSV table, we generate two representations of |
| 233 | +the light curve of that star. Both representations are *phase-folded*, which |
| 234 | +means that we combine all periods of the detected TCE into a single curve, with |
| 235 | +the detected event centered. |
| 236 | + |
| 237 | +Let's explore the generated representations of Kepler-90 g in the output. |
| 238 | + |
| 239 | +```python |
| 240 | +# Launch iPython (or Python) from the tensorflow_models/astronet/ directory. |
| 241 | +ipython |
| 242 | + |
| 243 | +In[1]: |
| 244 | +import matplotlib.pyplot as plt |
| 245 | +import numpy as np |
| 246 | +import os.path |
| 247 | +import tensorflow as tf |
| 248 | + |
| 249 | +In[2]: |
| 250 | +KEPLER_ID = 11442793 # Kepler-90 |
| 251 | +TFRECORD_DIR = "/path/to/tfrecords/dir" |
| 252 | + |
| 253 | +In[3]: |
| 254 | +# Helper function to find the tf.Example corresponding to a particular TCE. |
| 255 | +def find_tce(kepid, tce_plnt_num, filenames): |
| 256 | + for filename in filenames: |
| 257 | + for record in tf.python_io.tf_record_iterator(filename): |
| 258 | + ex = tf.train.Example.FromString(record) |
| 259 | + if (ex.features.feature["kepid"].int64_list.value[0] == kepid and |
| 260 | + ex.features.feature["tce_plnt_num"].int64_list.value[0] == tce_plnt_num): |
| 261 | + print("Found {}_{} in file {}".format(kepid, tce_plnt_num, filename)) |
| 262 | + return ex |
| 263 | + raise ValueError("{}_{} not found in files: {}".format(kepid, tce_plnt_num, filenames)) |
| 264 | + |
| 265 | +In[4]: |
| 266 | +# Find Kepler-90 g. |
| 267 | +filenames = tf.gfile.Glob(os.path.join(TFRECORD_DIR, "*")) |
| 268 | +assert filenames, "No files found in {}".format(TFRECORD_DIR) |
| 269 | +ex = find_tce(KEPLER_ID, 1, filenames) |
| 270 | + |
| 271 | +In[5]: |
| 272 | +# Plot the global and local views. |
| 273 | +global_view = np.array(ex.features.feature["global_view"].float_list.value) |
| 274 | +local_view = np.array(ex.features.feature["local_view"].float_list.value) |
| 275 | +fig, axes = plt.subplots(1, 2, figsize=(20, 6)) |
| 276 | +axes[0].plot(global_view, ".") |
| 277 | +axes[1].plot(local_view, ".") |
| 278 | +plt.show() |
| 279 | +``` |
| 280 | + |
| 281 | +The output should look something like this: |
| 282 | + |
| 283 | + |
| 284 | + |
| 285 | +### Train an AstroNet Model |
| 286 | + |
| 287 | +This directory contains several types of neural network architecture and various |
| 288 | +configuration options. To train a convolutional neural network to classify |
| 289 | +Kepler TCEs as either "planet" or "not planet", using the best configuration |
| 290 | +from |
| 291 | +[our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta), |
| 292 | +run the following training script: |
| 293 | + |
| 294 | +```bash |
| 295 | +# Directory to save model checkpoints into. |
| 296 | +MODEL_DIR="${HOME}/astronet/model/" |
| 297 | + |
| 298 | +# Run the training script. |
| 299 | +bazel-bin/astronet/train \ |
| 300 | + --model=AstroCNNModel \ |
| 301 | + --config_name=local_global \ |
| 302 | + --train_files=${TFRECORD_DIR}/train* \ |
| 303 | + --eval_files=${TFRECORD_DIR}/val* \ |
| 304 | + --model_dir=${MODEL_DIR} |
| 305 | +``` |
| 306 | + |
| 307 | +Optionally, you can also run a [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard) |
| 308 | +server in a separate process for real-time |
| 309 | +monitoring of training progress and evaluation metrics. |
| 310 | + |
| 311 | +```bash |
| 312 | +# Launch TensorBoard server. |
| 313 | +tensorboard --logdir ${MODEL_DIR} |
| 314 | +``` |
| 315 | + |
| 316 | +The TensorBoard server will show a page like this: |
| 317 | + |
| 318 | + |
| 319 | + |
| 320 | +### Evaluate an AstroNet Model |
| 321 | + |
| 322 | +Run the following command to evaluate a model on the test set. The result will |
| 323 | +be printed on the screen, and a summary file will also be written to the model |
| 324 | +directory, which will be visible in TensorBoard. |
| 325 | + |
| 326 | +```bash |
| 327 | +# Run the evaluation script. |
| 328 | +bazel-bin/astronet/evaluate \ |
| 329 | + --model=AstroCNNModel \ |
| 330 | + --config_name=local_global \ |
| 331 | + --eval_files=${TFRECORD_DIR}/test* \ |
| 332 | + --model_dir=${MODEL_DIR} |
| 333 | +``` |
| 334 | + |
| 335 | +The output should look something like this: |
| 336 | + |
| 337 | +```bash |
| 338 | +INFO:tensorflow:Saving dict for global step 10000: accuracy/accuracy = 0.9625159, accuracy/num_correct = 1515.0, auc = 0.988882, confusion_matrix/false_negatives = 10.0, confusion_matrix/false_positives = 49.0, confusion_matrix/true_negatives = 1165.0, confusion_matrix/true_positives = 350.0, global_step = 10000, loss = 0.112445444, losses/weighted_cross_entropy = 0.11295206, num_examples = 1574. |
| 339 | +``` |
| 340 | + |
| 341 | +### Make Predictions |
| 342 | + |
| 343 | +Suppose you detect a weak TCE in the light curve of the Kepler-90 star, with |
| 344 | +period 14.44912 days, duration 2.70408 hours (0.11267 days) beginning 2.2 days |
| 345 | +after 12:00 on 1/1/2009 (the year the Kepler telescope launched). To run this |
| 346 | +TCE though your trained model, execute the following command: |
| 347 | + |
| 348 | +```bash |
| 349 | +# Generate a prediction for a new TCE. |
| 350 | +bazel-bin/astronet/predict \ |
| 351 | + --model=AstroCNNModel \ |
| 352 | + --config_name=local_global \ |
| 353 | + --model_dir=${MODEL_DIR} \ |
| 354 | + --kepler_data_dir=${KEPLER_DATA_DIR} \ |
| 355 | + --kepler_id=11442793 \ |
| 356 | + --period=14.44912 \ |
| 357 | + --t0=2.2 \ |
| 358 | + --duration=0.11267 \ |
| 359 | + --output_image_file="${HOME}/astronet/kepler-90i.png" |
| 360 | +``` |
| 361 | + |
| 362 | +The output should look like this: |
| 363 | + |
| 364 | +```Prediction: 0.9480018``` |
| 365 | + |
| 366 | +This means the model is about 95% confident that the input TCE is a planet. |
| 367 | +Of course, this is only a small step in the overall process of discovering and |
| 368 | +validating an exoplanet: the model’s prediction is not proof one way or the |
| 369 | +other. The process of validating this signal as a real exoplanet requires |
| 370 | +significant follow-up work by an expert astronomer --- see Sections 6.3 and 6.4 |
| 371 | +of [our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta) |
| 372 | +for the full details. In this particular case, our follow-up analysis validated |
| 373 | +this signal as a bona fide exoplanet: it’s now called |
| 374 | +[Kepler-90 i](https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star), |
| 375 | +and is the record-breaking eighth planet discovered around the Kepler-90 star! |
| 376 | + |
| 377 | +In addition to the output prediction, the script will also produce a plot of the |
| 378 | +input representations. For Kepler-90 i, the plot should look something like |
| 379 | +this: |
| 380 | + |
| 381 | + |
0 commit comments