Skip to content

Commit cfbba16

Browse files
Merge pull request #9560 from kyleziegler:pretraining_updates
PiperOrigin-RevId: 348278830
2 parents d57ba59 + ec29c2f commit cfbba16

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

official/nlp/bert/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,23 @@ which is essentially branched from [BERT research repo](https://github.com/googl
129129
to get processed pre-training data and it adapts to TF2 symbols and python3
130130
compatibility.
131131

132+
Running the pre-training script requires an input and output directory, as well as a vocab file. Note that max_seq_length will need to match the sequence length parameter you specify when you run pre-training.
133+
134+
Example shell script to call create_pretraining_data.py
135+
```
136+
export WORKING_DIR='local disk or cloud location'
137+
export BERT_DIR='local disk or cloud location'
138+
python models/official/nlp/data/create_pretraining_data.py \
139+
--input_file=$WORKING_DIR/input/input.txt \
140+
--output_file=$WORKING_DIR/output/tf_examples.tfrecord \
141+
--vocab_file=$BERT_DIR/wwm_uncased_L-24_H-1024_A-16/vocab.txt \
142+
--do_lower_case=True \
143+
--max_seq_length=512 \
144+
--max_predictions_per_seq=76 \
145+
--masked_lm_prob=0.15 \
146+
--random_seed=12345 \
147+
--dupe_factor=5
148+
```
132149

133150
### Fine-tuning
134151

0 commit comments

Comments
 (0)