Skip to content

Commit 61c91e9

Browse files
committed
Split the README files between the top level Exoplanet ML directory and the AstroNet subdirectory.
PiperOrigin-RevId: 223437592
1 parent 9b287ea commit 61c91e9

14 files changed

+422
-374
lines changed

research/astronet/README.md

+19-368
Large diffs are not rendered by default.

research/astronet/astronet/README.md

+381
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,381 @@
1+
# AstroNet: A Neural Network for Identifying Exoplanets in Light Curves
2+
3+
![Transit Animation](docs/transit.gif)
4+
5+
## Code Author
6+
7+
Chris Shallue: [@cshallue](https://github.com/cshallue)
8+
9+
## Background
10+
11+
This directory contains TensorFlow models and data processing code for
12+
identifying exoplanets in astrophysical light curves. For complete background,
13+
see [our paper](http://adsabs.harvard.edu/abs/2018AJ....155...94S) in
14+
*The Astronomical Journal*.
15+
16+
For shorter summaries, see:
17+
18+
* ["Earth to Exoplanet"](https://www.blog.google/topics/machine-learning/hunting-planets-machine-learning/) on the Google blog
19+
* [This blog post](https://www.cfa.harvard.edu/~avanderb/page1.html#kepler90) by Andrew Vanderburg
20+
* [This great article](https://milesobrien.com/artificial-intelligence-gains-intuition-hunting-exoplanets/) by Fedor Kossakovski
21+
* [NASA's press release](https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star) article
22+
23+
## Citation
24+
25+
If you find this code useful, please cite our paper:
26+
27+
Shallue, C. J., & Vanderburg, A. (2018). Identifying Exoplanets with Deep
28+
Learning: A Five-planet Resonant Chain around Kepler-80 and an Eighth Planet
29+
around Kepler-90. *The Astronomical Journal*, 155(2), 94.
30+
31+
Full text available at [*The Astronomical Journal*](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta).
32+
33+
## Walkthrough
34+
35+
### Required Packages
36+
37+
First, ensure that you have installed the
38+
[required packages](../README.md#required-packages) and that the
39+
[unit tests](../README.md#run-unit-tests) pass.
40+
41+
### Download Kepler Data
42+
43+
A *light curve* is a plot of the brightness of a star over time. We will be
44+
focusing on light curves produced by the Kepler space telescope, which monitored
45+
the brightness of 200,000 stars in our milky way galaxy for 4 years. An example
46+
light curve produced by Kepler is shown below.
47+
48+
![Kepler-934](docs/kepler-943.png)
49+
50+
To train a model to identify planets in Kepler light curves, you will need a
51+
training set of labeled *Threshold Crossing Events* (TCEs). A TCE is a periodic
52+
signal that has been detected in a Kepler light curve, and is associated with a
53+
*period* (the number of days between each occurrence of the detected signal),
54+
a *duration* (the time taken by each occurrence of the signal), an *epoch* (the
55+
time of the first observed occurrence of the signal), and possibly additional
56+
metadata like the signal-to-noise ratio. An example TCE is shown below. The
57+
labels are ground truth classifications (decided by humans) that indicate which
58+
TCEs in the training set are actual planets signals and which are caused by
59+
other phenomena.
60+
61+
![Kepler-934 Transits](docs/kepler-943-transits.png)
62+
63+
You can download the DR24 TCE Table in CSV format from the [NASA Exoplanet
64+
Archive](https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=q1_q17_dr24_tce). Ensure the following columns are selected:
65+
66+
* `rowid`: Integer ID of the row in the TCE table.
67+
* `kepid`: Kepler ID of the target star.
68+
* `tce_plnt_num`: TCE number within the target star.
69+
* `tce_period`: Period of the detected event, in days.
70+
* `tce_time0bk`: The time corresponding to the center of the first detected
71+
event in Barycentric Julian Day (BJD) minus a constant offset of
72+
2,454,833.0 days.
73+
* `tce_duration`: Duration of the detected event, in hours.
74+
* `av_training_set`: Autovetter training set label; one of PC (planet candidate),
75+
AFP (astrophysical false positive), NTP (non-transiting phenomenon),
76+
UNK (unknown).
77+
78+
Next, you will need to download the light curves of the stars corresponding to
79+
the TCEs in the training set. These are available at the
80+
[Mikulski Archive for Space Telescopes](https://archive.stsci.edu/). However,
81+
you almost certainly don't want all of the Kepler data, which consists of almost
82+
3 million files, takes up over a terabyte of space, and may take several weeks
83+
to download! To train our model, we only need to download the subset of light
84+
curves that are associated with TCEs in the DR24 file. To download just those
85+
light curves, follow these steps:
86+
87+
**NOTE:** Even though we are only downloading a subset of the entire Kepler
88+
dataset, the files downloaded by the following script take up about **90 GB**.
89+
90+
```bash
91+
# Filename containing the CSV file of TCEs in the training set.
92+
TCE_CSV_FILE="${HOME}/astronet/dr24_tce.csv"
93+
94+
# Directory to download Kepler light curves into.
95+
KEPLER_DATA_DIR="${HOME}/astronet/kepler/"
96+
97+
# Generate a bash script that downloads the Kepler light curves in the training set.
98+
python astronet/data/generate_download_script.py \
99+
--kepler_csv_file=${TCE_CSV_FILE} \
100+
--download_dir=${KEPLER_DATA_DIR}
101+
102+
# Run the download script to download Kepler light curves.
103+
./get_kepler.sh
104+
```
105+
106+
The final line should read: `Finished downloading 12669 Kepler targets to
107+
${KEPLER_DATA_DIR}`
108+
109+
Let's explore the downloaded light curve of the Kepler-90 star! Note that Kepler
110+
light curves are divided into
111+
[four quarters each year](https://keplerscience.arc.nasa.gov/data-products.html#kepler-data-release-notes), which are separated by the quarterly rolls that the spacecraft
112+
made to reorient its solar panels. In the downloaded light curves, each `.fits`
113+
file corresponds to a specific Kepler quarter, but some quarters are divided
114+
into multiple `.fits` files.
115+
116+
```python
117+
# Launch iPython (or Python) from the tensorflow_models/astronet/ directory.
118+
ipython
119+
120+
In[1]:
121+
from light_curve import kepler_io
122+
import matplotlib.pyplot as plt
123+
import numpy as np
124+
125+
In[2]:
126+
KEPLER_DATA_DIR = "/path/to/kepler/"
127+
KEPLER_ID = 11442793 # Kepler-90.
128+
129+
In[3]:
130+
# Read the light curve.
131+
file_names = kepler_io.kepler_filenames(KEPLER_DATA_DIR, KEPLER_ID)
132+
assert file_names, "Failed to find .fits files in {}".format(KEPLER_DATA_DIR)
133+
all_time, all_flux = kepler_io.read_kepler_light_curve(file_names)
134+
print("Read light curve with {} segments".format(len(all_time)))
135+
136+
In[4]:
137+
# Plot the fourth segment.
138+
plt.plot(all_time[3], all_flux[3], ".")
139+
plt.show()
140+
141+
In[5]:
142+
# Plot all light curve segments. We first divide by the median flux in each
143+
# segment, because the segments are on different scales.
144+
for f in all_flux:
145+
f /= np.median(f)
146+
plt.plot(np.concatenate(all_time), np.concatenate(all_flux), ".")
147+
plt.show()
148+
```
149+
The output plots should look something like this:
150+
151+
![Kepler 90 Q4](docs/kep90-q4-raw.png)
152+
153+
![Kepler 90 All](docs/kep90-all.png)
154+
155+
The first plot is a single segment of approximately 20 days. You can see a
156+
planet transit --- that's Kepler-90 g! Also, notice that the brightness of the
157+
star is not flat over time --- there is natural variation in the brightness,
158+
even away from the planet transit.
159+
160+
The second plot is the full light curve over the entire Kepler mission
161+
(aproximately 4 years). You can easily see two transiting planets by eye ---
162+
they are Kepler-90 h (the biggest known planet in the system with the deepest
163+
transits) and Kepler-90 g (the second biggest known planet in the system with
164+
the second deepest transits).
165+
166+
167+
### Process Kepler Data
168+
169+
To train a model to identify exoplanets, you will need to provide TensorFlow
170+
with training data in
171+
[TFRecord](https://www.tensorflow.org/programmers_guide/datasets) format. The
172+
TFRecord format consists of a set of sharded files containing serialized
173+
`tf.Example` [protocol buffers](https://developers.google.com/protocol-buffers/).
174+
175+
The command below will generate a set of sharded TFRecord files for the TCEs in
176+
the training set. Each `tf.Example` proto will contain the following light curve
177+
representations:
178+
179+
* `global_view`: Vector of length 2001: a "global view" of the TCE.
180+
* `local_view`: Vector of length 201: a "local view" of the TCE.
181+
182+
In addition, each `tf.Example` will contain the value of each column in the
183+
input TCE CSV file. The columns include:
184+
185+
* `rowid`: Integer ID of the row in the TCE table.
186+
* `kepid`: Kepler ID of the target star.
187+
* `tce_plnt_num`: TCE number within the target star.
188+
* `av_training_set`: Autovetter training set label.
189+
* `tce_period`: Period of the detected event, in days.
190+
191+
```bash
192+
# Use Bazel to create executable Python scripts.
193+
#
194+
# Alternatively, since all code is pure Python and does not need to be compiled,
195+
# we could invoke the source scripts with the following addition to PYTHONPATH:
196+
# export PYTHONPATH="/path/to/source/dir/:${PYTHONPATH}"
197+
bazel build astronet/...
198+
199+
# Directory to save output TFRecord files into.
200+
TFRECORD_DIR="${HOME}/astronet/tfrecord"
201+
202+
# Preprocess light curves into sharded TFRecord files using 5 worker processes.
203+
bazel-bin/astronet/data/generate_input_records \
204+
--input_tce_csv_file=${TCE_CSV_FILE} \
205+
--kepler_data_dir=${KEPLER_DATA_DIR} \
206+
--output_dir=${TFRECORD_DIR} \
207+
--num_worker_processes=5
208+
```
209+
210+
When the script finishes you will find 8 training files, 1 validation file and
211+
1 test file in `TFRECORD_DIR`. The files will match the patterns
212+
`train-0000?-of-00008`, `val-00000-of-00001` and `test-00000-of-00001`
213+
respectively.
214+
215+
Here's a quick description of what the script does. For a full description, see
216+
Section 3 of [our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta).
217+
218+
For each light curve, we first fit a normalization spline to remove any
219+
low-frequency variability (that is, the natural variability in light from star)
220+
without removing any deviations caused by planets or other objects. For example,
221+
the following image shows the normalization spline for the segment of Kepler-90
222+
that we considered above:
223+
224+
![Kepler 90 Q4 Spline](docs/kep90-q4-spline.png)
225+
226+
Next, we divide by the spline to make the star's baseline brightness
227+
approximately flat. Notice that after normalization the transit of Kepler-90 g
228+
is still preserved:
229+
230+
![Kepler 90 Q4 Normalized](docs/kep90-q4-normalized.png)
231+
232+
Finally, for each TCE in the input CSV table, we generate two representations of
233+
the light curve of that star. Both representations are *phase-folded*, which
234+
means that we combine all periods of the detected TCE into a single curve, with
235+
the detected event centered.
236+
237+
Let's explore the generated representations of Kepler-90 g in the output.
238+
239+
```python
240+
# Launch iPython (or Python) from the tensorflow_models/astronet/ directory.
241+
ipython
242+
243+
In[1]:
244+
import matplotlib.pyplot as plt
245+
import numpy as np
246+
import os.path
247+
import tensorflow as tf
248+
249+
In[2]:
250+
KEPLER_ID = 11442793 # Kepler-90
251+
TFRECORD_DIR = "/path/to/tfrecords/dir"
252+
253+
In[3]:
254+
# Helper function to find the tf.Example corresponding to a particular TCE.
255+
def find_tce(kepid, tce_plnt_num, filenames):
256+
for filename in filenames:
257+
for record in tf.python_io.tf_record_iterator(filename):
258+
ex = tf.train.Example.FromString(record)
259+
if (ex.features.feature["kepid"].int64_list.value[0] == kepid and
260+
ex.features.feature["tce_plnt_num"].int64_list.value[0] == tce_plnt_num):
261+
print("Found {}_{} in file {}".format(kepid, tce_plnt_num, filename))
262+
return ex
263+
raise ValueError("{}_{} not found in files: {}".format(kepid, tce_plnt_num, filenames))
264+
265+
In[4]:
266+
# Find Kepler-90 g.
267+
filenames = tf.gfile.Glob(os.path.join(TFRECORD_DIR, "*"))
268+
assert filenames, "No files found in {}".format(TFRECORD_DIR)
269+
ex = find_tce(KEPLER_ID, 1, filenames)
270+
271+
In[5]:
272+
# Plot the global and local views.
273+
global_view = np.array(ex.features.feature["global_view"].float_list.value)
274+
local_view = np.array(ex.features.feature["local_view"].float_list.value)
275+
fig, axes = plt.subplots(1, 2, figsize=(20, 6))
276+
axes[0].plot(global_view, ".")
277+
axes[1].plot(local_view, ".")
278+
plt.show()
279+
```
280+
281+
The output should look something like this:
282+
283+
![Kepler 90 g Processed](docs/kep90h-localglobal.png)
284+
285+
### Train an AstroNet Model
286+
287+
This directory contains several types of neural network architecture and various
288+
configuration options. To train a convolutional neural network to classify
289+
Kepler TCEs as either "planet" or "not planet", using the best configuration
290+
from
291+
[our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta),
292+
run the following training script:
293+
294+
```bash
295+
# Directory to save model checkpoints into.
296+
MODEL_DIR="${HOME}/astronet/model/"
297+
298+
# Run the training script.
299+
bazel-bin/astronet/train \
300+
--model=AstroCNNModel \
301+
--config_name=local_global \
302+
--train_files=${TFRECORD_DIR}/train* \
303+
--eval_files=${TFRECORD_DIR}/val* \
304+
--model_dir=${MODEL_DIR}
305+
```
306+
307+
Optionally, you can also run a [TensorBoard](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard)
308+
server in a separate process for real-time
309+
monitoring of training progress and evaluation metrics.
310+
311+
```bash
312+
# Launch TensorBoard server.
313+
tensorboard --logdir ${MODEL_DIR}
314+
```
315+
316+
The TensorBoard server will show a page like this:
317+
318+
![TensorBoard](docs/tensorboard.png)
319+
320+
### Evaluate an AstroNet Model
321+
322+
Run the following command to evaluate a model on the test set. The result will
323+
be printed on the screen, and a summary file will also be written to the model
324+
directory, which will be visible in TensorBoard.
325+
326+
```bash
327+
# Run the evaluation script.
328+
bazel-bin/astronet/evaluate \
329+
--model=AstroCNNModel \
330+
--config_name=local_global \
331+
--eval_files=${TFRECORD_DIR}/test* \
332+
--model_dir=${MODEL_DIR}
333+
```
334+
335+
The output should look something like this:
336+
337+
```bash
338+
INFO:tensorflow:Saving dict for global step 10000: accuracy/accuracy = 0.9625159, accuracy/num_correct = 1515.0, auc = 0.988882, confusion_matrix/false_negatives = 10.0, confusion_matrix/false_positives = 49.0, confusion_matrix/true_negatives = 1165.0, confusion_matrix/true_positives = 350.0, global_step = 10000, loss = 0.112445444, losses/weighted_cross_entropy = 0.11295206, num_examples = 1574.
339+
```
340+
341+
### Make Predictions
342+
343+
Suppose you detect a weak TCE in the light curve of the Kepler-90 star, with
344+
period 14.44912 days, duration 2.70408 hours (0.11267 days) beginning 2.2 days
345+
after 12:00 on 1/1/2009 (the year the Kepler telescope launched). To run this
346+
TCE though your trained model, execute the following command:
347+
348+
```bash
349+
# Generate a prediction for a new TCE.
350+
bazel-bin/astronet/predict \
351+
--model=AstroCNNModel \
352+
--config_name=local_global \
353+
--model_dir=${MODEL_DIR} \
354+
--kepler_data_dir=${KEPLER_DATA_DIR} \
355+
--kepler_id=11442793 \
356+
--period=14.44912 \
357+
--t0=2.2 \
358+
--duration=0.11267 \
359+
--output_image_file="${HOME}/astronet/kepler-90i.png"
360+
```
361+
362+
The output should look like this:
363+
364+
```Prediction: 0.9480018```
365+
366+
This means the model is about 95% confident that the input TCE is a planet.
367+
Of course, this is only a small step in the overall process of discovering and
368+
validating an exoplanet: the model’s prediction is not proof one way or the
369+
other. The process of validating this signal as a real exoplanet requires
370+
significant follow-up work by an expert astronomer --- see Sections 6.3 and 6.4
371+
of [our paper](http://iopscience.iop.org/article/10.3847/1538-3881/aa9e09/meta)
372+
for the full details. In this particular case, our follow-up analysis validated
373+
this signal as a bona fide exoplanet: it’s now called
374+
[Kepler-90 i](https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star),
375+
and is the record-breaking eighth planet discovered around the Kepler-90 star!
376+
377+
In addition to the output prediction, the script will also produce a plot of the
378+
input representations. For Kepler-90 i, the plot should look something like
379+
this:
380+
381+
![Kepler 90 h Processed](docs/kep90i-localglobal.png)

research/astronet/astrowavenet/README.md

+3-6
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,10 @@ Chris Shallue: [@cshallue](https://github.com/cshallue)
1515

1616
## Additional Dependencies
1717

18-
This package requires TensorFlow 1.12 or greater. As of October 2018, this
19-
requires the **TensorFlow nightly build**
20-
([instructions](https://www.tensorflow.org/install/pip)).
21-
22-
In addition to the dependencies listed in the top-level README, this package
23-
requires:
18+
In addition to the [required packages](../README.md#required-packages) listed in
19+
the top-level README, this package requires:
2420

21+
* **TensorFlow 1.12 or greater** ([instructions](https://www.tensorflow.org/install/))
2522
* **TensorFlow Probability** ([instructions](https://www.tensorflow.org/probability/install))
2623
* **Six** ([instructions](https://pypi.org/project/six/))
2724

0 commit comments

Comments
 (0)