Skip to content

Commit 61c91e9

Browse files
committed
Split the README files between the top level Exoplanet ML directory and the AstroNet subdirectory.
PiperOrigin-RevId: 223437592
1 parent 9b287ea commit 61c91e9

File tree

14 files changed

+422
-374
lines changed

14 files changed

+422
-374
lines changed

research/astronet/README.md

Lines changed: 19 additions & 368 deletions
Large diffs are not rendered by default.

research/astronet/astronet/README.md

Lines changed: 381 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,381 @@
1+
# AstroNet: A Neural Network for Identifying Exoplanets in Light Curves
2+
3+
![Transit Animation](docs/transit.gif)
4+
5+
## Code Author
6+
7+
Chris Shallue: [@cshallue](https://github.com/cshallue)
8+
9+
## Background
10+
11+
This directory contains TensorFlow models and data processing code for
12+
identifying exoplanets in astrophysical light curves. For complete background,
13+
see [our paper](http://adsabs.harvard.edu/abs/2018AJ....155...94S) in
14+
*The Astronomical Journal*.
15+
16+
For shorter summaries, see:
17+
18+
* ["Earth to Exoplanet"](https://www.blog.google/topics/machine-learning/hunting-planets-machine-learning/) on the Google blog
19+
* [This blog post](https://www.cfa.harvard.edu/~avanderb/page1.html#kepler90) by Andrew Vanderburg
20+
* [This great article](https://milesobrien.com/artificial-intelligence-gains-intuition-hunting-exoplanets/) by Fedor Kossakovski
21+
* [NASA's press release](https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star) article
22+
23+
## Citation
24+
25+
If you find this code useful, please cite our paper:
26+
27+
Shallue, C. J., & Vanderburg, A. (2018). Identifying Exoplanets with Deep
28+
Learning: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet
29+
around Kepler-90. *The Astronomical Journal*, 155(2), 94.
30+
31+
Full text available at [*The Astronomical Journal*](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta).
32+
33+
## Walkthrough
34+
35+
### Required Packages
36+
37+
First, ensure that you have installed the
38+
[required packages](../README.md#required-packages) and that the
39+
[unit tests](../README.md#run-unit-tests) pass.
40+
41+
### Download Kepler Data
42+
43+
A *light curve* is a plot of the brightness of a star over time. We will be
44+
focusing on light curves produced by the Kepler space telescope, which monitored
45+
the brightness of 200,000 stars in our milky way galaxy for 4 years. An example
46+
light curve produced by Kepler is shown below.
47+
48+
![Kepler-934](docs/kepler-943.png)
49+
50+
To train a model to identify planets in Kepler light curves, you will need a
51+
training set of labeled *Threshold Crossing Events* (TCEs). A TCE is a periodic
52+
signal that has been detected in a Kepler light curve, and is associated with a
53+
*period* (the number of days between each occurrence of the detected signal),
54+
a *duration* (the time taken by each occurrence of the signal), an *epoch* (the
55+
time of the first observed occurrence of the signal), and possibly additional
56+
metadata like the signal-to-noise ratio. An example TCE is shown below. The
57+
labels are ground truth classifications (decided by humans) that indicate which
58+
TCEs in the training set are actual planets signals and which are caused by
59+
other phenomena.
60+
61+
![Kepler-934 Transits](docs/kepler-943-transits.png)
62+
63+
You can download the DR24 TCE Table in CSV format from the [NASA Exoplanet
64+
Archive](https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=q1_q17_dr24_tce). Ensure the following columns are selected:
65+
66+
* `rowid`: Integer ID of the row in the TCE table.
67+
* `kepid`: Kepler ID of the target star.
68+
* `tce_plnt_num`: TCE number within the target star.
69+
* `tce_period`: Period of the detected event, in days.
70+
* `tce_time0bk`: The time corresponding to the center of the first detected
71+
event in Barycentric Julian Day (BJD) minus a constant offset of
72+
2,454,833.0 days.
73+
* `tce_duration`: Duration of the detected event, in hours.
74+
* `av_training_set`: Autovetter training set label; one of PC (planet candidate),
75+
AFP (astrophysical false positive), NTP (non-transiting phenomenon),
76+
UNK (unknown).
77+
78+
Next, you will need to download the light curves of the stars corresponding to
79+
the TCEs in the training set. These are available at the
80+
[Mikulski Archive for Space Telescopes](https://archive.stsci.edu/). However,
81+
you almost certainly don't want all of the Kepler data, which consists of almost
82+
3 million files, takes up over a terabyte of space, and may take several weeks
83+
to download! To train our model, we only need to download the subset of light
84+
curves that are associated with TCEs in the DR24 file. To download just those
85+
light curves, follow these steps:
86+
87+
**NOTE:** Even though we are only downloading a subset of the entire Kepler
88+
dataset, the files downloaded by the following script take up about **90 GB**.
89+
90+
```bash
91+
# Filename containing the CSV file of TCEs in the training set.
92+
TCE_CSV_FILE="${HOME}/astronet/dr24_tce.csv"
93+
94+
# Directory to download Kepler light curves into.
95+
KEPLER_DATA_DIR="${HOME}/astronet/kepler/"
96+
97+
# Generate a bash script that downloads the Kepler light curves in the training set.
98+
python astronet/data/generate_download_script.py \
99+
--kepler_csv_file=${TCE_CSV_FILE} \
100+
--download_dir=${KEPLER_DATA_DIR}
101+
102+
# Run the download script to download Kepler light curves.
103+
./get_kepler.sh
104+
```
105+
106+
The final line should read: `Finished downloading 12669 Kepler targets to
107+
${KEPLER_DATA_DIR}`
108+
109+
Let's explore the downloaded light curve of the Kepler-90 star! Note that Kepler
110+
light curves are divided into
111+
[four quarters each year](https://keplerscience.arc.nasa.gov/data-products.html#kepler-data-release-notes), which are separated by the quarterly rolls that the spacecraft
112+
made to reorient its solar panels. In the downloaded light curves, each `.fits`
113+
file corresponds to a specific Kepler quarter, but some quarters are divided
114+
into multiple `.fits` files.
115+
116+
```python
117+
# Launch iPython (or Python) from the tensorflow_models/astronet/ directory.
118+
ipython
119+
120+
In[1]:
121+
from light_curve import kepler_io
122+
import matplotlib.pyplot as plt
123+
import numpy as np
124+
125+
In[2]:
126+
KEPLER_DATA_DIR = "/path/to/kepler/"
127+
KEPLER_ID = 11442793 # Kepler-90.
128+
129+
In[3]:
130+
# Read the light curve.
131+
file_names = kepler_io.kepler_filenames(KEPLER_DATA_DIR, KEPLER_ID)
132+
assert file_names, "Failed to find .fits files in {}".format(KEPLER_DATA_DIR)
133+
all_time, all_flux = kepler_io.read_kepler_light_curve(file_names)
134+
print("Read light curve with {} segments".format(len(all_time)))
135+
136+
In[4]:
137+
# Plot the fourth segment.
138+
plt.plot(all_time[3], all_flux[3], ".")
139+
plt.show()
140+
141+
In[5]:
142+
# Plot all light curve segments. We first divide by the median flux in each
143+
# segment, because the segments are on different scales.
144+
for f in all_flux:
145+
f /= np.median(f)
146+
plt.plot(np.concatenate(all_time), np.concatenate(all_flux), ".")
147+
plt.show()
148+
```
149+
The output plots should look something like this:
150+
151+
![Kepler 90 Q4](docs/kep90-q4-raw.png)
152+
153+
![Kepler 90 All](docs/kep90-all.png)
154+
155+
The first plot is a single segment of approximately 20 days. You can see a
156+
planet transit --- that's Kepler-90 g! Also, notice that the brightness of the
157+
star is not flat over time --- there is natural variation in the brightness,
158+
even away from the planet transit.
159+
160+
The second plot is the full light curve over the entire Kepler mission
161+
(aproximately 4 years). You can easily see two transiting planets by eye ---
162+
they are Kepler-90 h (the biggest known planet in the system with the deepest
163+
transits) and Kepler-90 g (the second biggest known planet in the system with
164+
the second deepest transits).
165+
166+
167+
### Process Kepler Data
168+
169+
To train a model to identify exoplanets, you will need to provide TensorFlow
170+
with training data in
171+
[TFRecord](https://www.tensorflow.org/programmers_guide/datasets) format. The
172+
TFRecord format consists of a set of sharded files containing serialized
173+
`tf.Example` [protocol buffers](https://developers.google.com/protocol-buffers/).
174+
175+
The command below will generate a set of sharded TFRecord files for the TCEs in
176+
the training set. Each `tf.Example` proto will contain the following light curve
177+
representations:
178+
179+
* `global_view`: Vector of length 2001: a "global view" of the TCE.
180+
* `local_view`: Vector of length 201: a "local view" of the TCE.
181+
182+
In addition, each `tf.Example` will contain the value of each column in the
183+
input TCE CSV file. The columns include:
184+
185+
* `rowid`: Integer ID of the row in the TCE table.
186+
* `kepid`: Kepler ID of the target star.
187+
* `tce_plnt_num`: TCE number within the target star.
188+
* `av_training_set`: Autovetter training set label.
189+
* `tce_period`: Period of the detected event, in days.
190+
191+
```bash
192+
# Use Bazel to create executable Python scripts.
193+
#
194+
# Alternatively, since all code is pure Python and does not need to be compiled,
195+
# we could invoke the source scripts with the following addition to PYTHONPATH:
196+
# export PYTHONPATH="/path/to/source/dir/:${PYTHONPATH}"
197+
bazel build astronet/...
198+
199+
# Directory to save output TFRecord files into.
200+
TFRECORD_DIR="${HOME}/astronet/tfrecord"
201+
202+
# Preprocess light curves into sharded TFRecord files using 5 worker processes.
203+
bazel-bin/astronet/data/generate_input_records \
204+
--input_tce_csv_file=${TCE_CSV_FILE} \
205+
--kepler_data_dir=${KEPLER_DATA_DIR} \
206+
--output_dir=${TFRECORD_DIR} \
207+
--num_worker_processes=5
208+
```
209+
210+
When the script finishes you will find 8 training files, 1 validation file and
211+
1 test file in `TFRECORD_DIR`. The files will match the patterns
212+
`train-0000?-of-00008`, `val-00000-of-00001` and `test-00000-of-00001`
213+
respectively.
214+
215+
Here's a quick description of what the script does. For a full description, see
216+
Section 3 of [our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta).
217+
218+
For each light curve, we first fit a normalization spline to remove any
219+
low-frequency variability (that is, the natural variability in light from star)
220+
without removing any deviations caused by planets or other objects. For example,
221+
the following image shows the normalization spline for the segment of Kepler-90
222+
that we considered above:
223+
224+
![Kepler 90 Q4 Spline](docs/kep90-q4-spline.png)
225+
226+
Next, we divide by the spline to make the star's baseline brightness
227+
approximately flat. Notice that after normalization the transit of Kepler-90 g
228+
is still preserved:
229+
230+
![Kepler 90 Q4 Normalized](docs/kep90-q4-normalized.png)
231+
232+
Finally, for each TCE in the input CSV table, we generate two representations of
233+
the light curve of that star. Both representations are *phase-folded*, which
234+
means that we combine all periods of the detected TCE into a single curve, with
235+
the detected event centered.
236+
237+
Let's explore the generated representations of Kepler-90 g in the output.
238+
239+
```python
240+
# Launch iPython (or Python) from the tensorflow_models/astronet/ directory.
241+
ipython
242+
243+
In[1]:
244+
import matplotlib.pyplot as plt
245+
import numpy as np
246+
import os.path
247+
import tensorflow as tf
248+
249+
In[2]:
250+
KEPLER_ID = 11442793 # Kepler-90
251+
TFRECORD_DIR = "/path/to/tfrecords/dir"
252+
253+
In[3]:
254+
# Helper function to find the tf.Example corresponding to a particular TCE.
255+
def find_tce(kepid, tce_plnt_num, filenames):
256+
for filename in filenames:
257+
for record in tf.python_io.tf_record_iterator(filename):
258+
ex = tf.train.Example.FromString(record)
259+
if (ex.features.feature["kepid"].int64_list.value[0] == kepid and
260+
ex.features.feature["tce_plnt_num"].int64_list.value[0] == tce_plnt_num):
261+
print("Found {}_{} in file {}".format(kepid, tce_plnt_num, filename))
262+
return ex
263+
raise ValueError("{}_{} not found in files: {}".format(kepid, tce_plnt_num, filenames))
264+
265+
In[4]:
266+
# Find Kepler-90 g.
267+
filenames = tf.gfile.Glob(os.path.join(TFRECORD_DIR, "*"))
268+
assert filenames, "No files found in {}".format(TFRECORD_DIR)
269+
ex = find_tce(KEPLER_ID, 1, filenames)
270+
271+
In[5]:
272+
# Plot the global and local views.
273+
global_view = np.array(ex.features.feature["global_view"].float_list.value)
274+
local_view = np.array(ex.features.feature["local_view"].float_list.value)
275+
fig, axes = plt.subplots(1, 2, figsize=(20, 6))
276+
axes[0].plot(global_view, ".")
277+
axes[1].plot(local_view, ".")
278+
plt.show()
279+
```
280+
281+
The output should look something like this:
282+
283+
![Kepler 90 g Processed](docs/kep90h-localglobal.png)
284+
285+
### Train an AstroNet Model
286+
287+
This directory contains several types of neural network architecture and various
288+
configuration options. To train a convolutional neural network to classify
289+
Kepler TCEs as either "planet" or "not planet", using the best configuration
290+
from
291+
[our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta),
292+
run the following training script:
293+
294+
```bash
295+
# Directory to save model checkpoints into.
296+
MODEL_DIR="${HOME}/astronet/model/"
297+
298+
# Run the training script.
299+
bazel-bin/astronet/train \
300+
--model=AstroCNNModel \
301+
--config_name=local_global \
302+
--train_files=${TFRECORD_DIR}/train* \
303+
--eval_files=${TFRECORD_DIR}/val* \
304+
--model_dir=${MODEL_DIR}
305+
```
306+
307+
Optionally, you can also run a [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard)
308+
server in a separate process for real-time
309+
monitoring of training progress and evaluation metrics.
310+
311+
```bash
312+
# Launch TensorBoard server.
313+
tensorboard --logdir ${MODEL_DIR}
314+
```
315+
316+
The TensorBoard server will show a page like this:
317+
318+
![TensorBoard](docs/tensorboard.png)
319+
320+
### Evaluate an AstroNet Model
321+
322+
Run the following command to evaluate a model on the test set. The result will
323+
be printed on the screen, and a summary file will also be written to the model
324+
directory, which will be visible in TensorBoard.
325+
326+
```bash
327+
# Run the evaluation script.
328+
bazel-bin/astronet/evaluate \
329+
--model=AstroCNNModel \
330+
--config_name=local_global \
331+
--eval_files=${TFRECORD_DIR}/test* \
332+
--model_dir=${MODEL_DIR}
333+
```
334+
335+
The output should look something like this:
336+
337+
```bash
338+
INFO:tensorflow:Saving dict for global step 10000: accuracy/accuracy = 0.9625159, accuracy/num_correct = 1515.0, auc = 0.988882, confusion_matrix/false_negatives = 10.0, confusion_matrix/false_positives = 49.0, confusion_matrix/true_negatives = 1165.0, confusion_matrix/true_positives = 350.0, global_step = 10000, loss = 0.112445444, losses/weighted_cross_entropy = 0.11295206, num_examples = 1574.
339+
```
340+
341+
### Make Predictions
342+
343+
Suppose you detect a weak TCE in the light curve of the Kepler-90 star, with
344+
period 14.44912 days, duration 2.70408 hours (0.11267 days) beginning 2.2 days
345+
after 12:00 on 1/1/2009 (the year the Kepler telescope launched). To run this
346+
TCE though your trained model, execute the following command:
347+
348+
```bash
349+
# Generate a prediction for a new TCE.
350+
bazel-bin/astronet/predict \
351+
--model=AstroCNNModel \
352+
--config_name=local_global \
353+
--model_dir=${MODEL_DIR} \
354+
--kepler_data_dir=${KEPLER_DATA_DIR} \
355+
--kepler_id=11442793 \
356+
--period=14.44912 \
357+
--t0=2.2 \
358+
--duration=0.11267 \
359+
--output_image_file="${HOME}/astronet/kepler-90i.png"
360+
```
361+
362+
The output should look like this:
363+
364+
```Prediction: 0.9480018```
365+
366+
This means the model is about 95% confident that the input TCE is a planet.
367+
Of course, this is only a small step in the overall process of discovering and
368+
validating an exoplanet: the model’s prediction is not proof one way or the
369+
other. The process of validating this signal as a real exoplanet requires
370+
significant follow-up work by an expert astronomer --- see Sections 6.3 and 6.4
371+
of [our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta)
372+
for the full details. In this particular case, our follow-up analysis validated
373+
this signal as a bona fide exoplanet: it’s now called
374+
[Kepler-90 i](https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star),
375+
and is the record-breaking eighth planet discovered around the Kepler-90 star!
376+
377+
In addition to the output prediction, the script will also produce a plot of the
378+
input representations. For Kepler-90 i, the plot should look something like
379+
this:
380+
381+
![Kepler 90 h Processed](docs/kep90i-localglobal.png)
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)