Skip to content

Commit 46ec74f

Browse files
authored
Merge pull request Azure#627 from jingyanwangms/jingywa/lightgbm-notebook
add Lightgbm Estimator notebook
2 parents 86c1b3d + 8d2e362 commit 46ec74f

File tree

6 files changed

+15381
-0
lines changed

6 files changed

+15381
-0
lines changed

contrib/gbdt/lightgbm/binary0.test

+500
Large diffs are not rendered by default.

contrib/gbdt/lightgbm/binary0.train

+7,000
Large diffs are not rendered by default.

contrib/gbdt/lightgbm/binary1.test

+500
Large diffs are not rendered by default.

contrib/gbdt/lightgbm/binary1.train

+7,000
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"Copyright (c) Microsoft Corporation. All rights reserved. \n",
8+
"Licensed under the MIT License."
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"metadata": {},
14+
"source": [
15+
"![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/contrib/gbdt/lightgbm/lightgbm-example.png)"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"metadata": {},
21+
"source": [
22+
"# Use LightGBM Estimator in Azure Machine Learning\n",
23+
"In this notebook we will demonstrate how to run a training job using LightGBM Estimator. [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms. "
24+
]
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"metadata": {},
29+
"source": [
30+
"## Prerequisites\n",
31+
"This notebook uses azureml-contrib-gbdt package, if you don't already have the package, please install by uncommenting below cell."
32+
]
33+
},
34+
{
35+
"cell_type": "code",
36+
"execution_count": null,
37+
"metadata": {},
38+
"outputs": [],
39+
"source": [
40+
"#!pip install azureml-contrib-gbdt --extra-index-url https://azuremlsdktestpypi.azureedge.net/LightGBMPrivateRelease"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": null,
46+
"metadata": {},
47+
"outputs": [],
48+
"source": [
49+
"from azureml.core import Workspace, Run, Experiment\n",
50+
"import shutil, os\n",
51+
"from azureml.widgets import RunDetails\n",
52+
"from azureml.contrib.gbdt import LightGBM\n",
53+
"from azureml.train.dnn import Mpi\n",
54+
"from azureml.core.compute import AmlCompute, ComputeTarget\n",
55+
"from azureml.core.compute_target import ComputeTargetException"
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"If you are using an AzureML Compute Instance, you are all set. Otherwise, go through the [configuration.ipynb](../../../configuration.ipynb) notebook to install the Azure Machine Learning Python SDK and create an Azure ML Workspace"
63+
]
64+
},
65+
{
66+
"cell_type": "markdown",
67+
"metadata": {},
68+
"source": [
69+
"## Set up machine learning resources"
70+
]
71+
},
72+
{
73+
"cell_type": "code",
74+
"execution_count": null,
75+
"metadata": {},
76+
"outputs": [],
77+
"source": [
78+
"ws = Workspace.from_config()\n",
79+
"\n",
80+
"print('Workspace name: ' + ws.name, \n",
81+
" 'Azure region: ' + ws.location, \n",
82+
" 'Subscription id: ' + ws.subscription_id, \n",
83+
" 'Resource group: ' + ws.resource_group, sep = '\\n')"
84+
]
85+
},
86+
{
87+
"cell_type": "code",
88+
"execution_count": null,
89+
"metadata": {},
90+
"outputs": [],
91+
"source": [
92+
"cluster_vm_size = \"STANDARD_DS14_V2\"\n",
93+
"cluster_min_nodes = 0\n",
94+
"cluster_max_nodes = 20\n",
95+
"cpu_cluster_name = 'TrainingCompute' \n",
96+
"\n",
97+
"try:\n",
98+
" cpu_cluster = AmlCompute(ws, cpu_cluster_name)\n",
99+
" if cpu_cluster and type(cpu_cluster) is AmlCompute:\n",
100+
" print('found compute target: ' + cpu_cluster_name)\n",
101+
"except ComputeTargetException:\n",
102+
" print('creating a new compute target...')\n",
103+
" provisioning_config = AmlCompute.provisioning_configuration(vm_size = cluster_vm_size, \n",
104+
" vm_priority = 'lowpriority', \n",
105+
" min_nodes = cluster_min_nodes, \n",
106+
" max_nodes = cluster_max_nodes)\n",
107+
" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, provisioning_config)\n",
108+
" \n",
109+
" # can poll for a minimum number of nodes and for a specific timeout. \n",
110+
" # if no min node count is provided it will use the scale settings for the cluster\n",
111+
" cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)\n",
112+
" \n",
113+
" # For a more detailed view of current Azure Machine Learning Compute status, use get_status()\n",
114+
" print(cpu_cluster.get_status().serialize())"
115+
]
116+
},
117+
{
118+
"cell_type": "markdown",
119+
"metadata": {},
120+
"source": [
121+
"From this point, you can either upload training data file directly or use Datastore for training data storage\n",
122+
"## Upload training file from local"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": null,
128+
"metadata": {},
129+
"outputs": [],
130+
"source": [
131+
"scripts_folder = \"scripts_folder\"\n",
132+
"if not os.path.isdir(scripts_folder):\n",
133+
" os.mkdir(scripts_folder)\n",
134+
"shutil.copy('./train.conf', os.path.join(scripts_folder, 'train.conf'))\n",
135+
"shutil.copy('./binary0.train', os.path.join(scripts_folder, 'binary0.train'))\n",
136+
"shutil.copy('./binary1.train', os.path.join(scripts_folder, 'binary1.train'))\n",
137+
"shutil.copy('./binary0.test', os.path.join(scripts_folder, 'binary0.test'))\n",
138+
"shutil.copy('./binary1.test', os.path.join(scripts_folder, 'binary1.test'))"
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": null,
144+
"metadata": {},
145+
"outputs": [],
146+
"source": [
147+
"training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
148+
"validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
149+
"lgbm = LightGBM(source_directory=scripts_folder, \n",
150+
" compute_target=cpu_cluster, \n",
151+
" distributed_training=Mpi(),\n",
152+
" node_count=2,\n",
153+
" lightgbm_config='train.conf',\n",
154+
" data=training_data_list,\n",
155+
" valid=validation_data_list\n",
156+
" )\n",
157+
"experiment_name = 'lightgbm-estimator-test'\n",
158+
"experiment = Experiment(ws, name=experiment_name)\n",
159+
"run = experiment.submit(lgbm, tags={\"test public docker image\": None})\n",
160+
"RunDetails(run).show()"
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": null,
166+
"metadata": {},
167+
"outputs": [],
168+
"source": [
169+
"run.wait_for_completion(show_output=True)"
170+
]
171+
},
172+
{
173+
"cell_type": "markdown",
174+
"metadata": {},
175+
"source": [
176+
"## Use data reference"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": null,
182+
"metadata": {},
183+
"outputs": [],
184+
"source": [
185+
"from azureml.core.datastore import Datastore\n",
186+
"from azureml.data.data_reference import DataReference\n",
187+
"datastore = ws.get_default_datastore()"
188+
]
189+
},
190+
{
191+
"cell_type": "code",
192+
"execution_count": null,
193+
"metadata": {},
194+
"outputs": [],
195+
"source": [
196+
"datastore.upload(src_dir='.',\n",
197+
" target_path='.',\n",
198+
" show_progress=True)"
199+
]
200+
},
201+
{
202+
"cell_type": "code",
203+
"execution_count": null,
204+
"metadata": {},
205+
"outputs": [],
206+
"source": [
207+
"training_data_list=[\"binary0.train\", \"binary1.train\"]\n",
208+
"validation_data_list = [\"binary0.test\", \"binary1.test\"]\n",
209+
"lgbm = LightGBM(source_directory='.', \n",
210+
" compute_target=cpu_cluster, \n",
211+
" distributed_training=Mpi(),\n",
212+
" node_count=2,\n",
213+
" inputs=[datastore.as_mount()],\n",
214+
" lightgbm_config='train.conf',\n",
215+
" data=training_data_list,\n",
216+
" valid=validation_data_list\n",
217+
" )\n",
218+
"experiment_name = 'lightgbm-estimator-test'\n",
219+
"experiment = Experiment(ws, name=experiment_name)\n",
220+
"run = experiment.submit(lgbm, tags={\"use datastore.as_mount()\": None})\n",
221+
"RunDetails(run).show()"
222+
]
223+
},
224+
{
225+
"cell_type": "code",
226+
"execution_count": null,
227+
"metadata": {},
228+
"outputs": [],
229+
"source": [
230+
"run.wait_for_completion(show_output=True)"
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"metadata": {},
237+
"outputs": [],
238+
"source": [
239+
"# uncomment below and run if compute resources are no longer needed\n",
240+
"# cpu_cluster.delete() "
241+
]
242+
}
243+
],
244+
"metadata": {
245+
"authors": [
246+
{
247+
"name": "jingywa"
248+
}
249+
],
250+
"kernelspec": {
251+
"display_name": "Python 3.6",
252+
"language": "python",
253+
"name": "python36"
254+
},
255+
"language_info": {
256+
"codemirror_mode": {
257+
"name": "ipython",
258+
"version": 3
259+
},
260+
"file_extension": ".py",
261+
"mimetype": "text/x-python",
262+
"name": "python",
263+
"nbconvert_exporter": "python",
264+
"pygments_lexer": "ipython3",
265+
"version": "3.6.9"
266+
}
267+
},
268+
"nbformat": 4,
269+
"nbformat_minor": 2
270+
}

contrib/gbdt/lightgbm/train.conf

+111
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# task type, support train and predict
2+
task = train
3+
4+
# boosting type, support gbdt for now, alias: boosting, boost
5+
boosting_type = gbdt
6+
7+
# application type, support following application
8+
# regression , regression task
9+
# binary , binary classification task
10+
# lambdarank , lambdarank task
11+
# alias: application, app
12+
objective = binary
13+
14+
# eval metrics, support multi metric, delimite by ',' , support following metrics
15+
# l1
16+
# l2 , default metric for regression
17+
# ndcg , default metric for lambdarank
18+
# auc
19+
# binary_logloss , default metric for binary
20+
# binary_error
21+
metric = binary_logloss,auc
22+
23+
# frequence for metric output
24+
metric_freq = 1
25+
26+
# true if need output metric for training data, alias: tranining_metric, train_metric
27+
is_training_metric = true
28+
29+
# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy.
30+
max_bin = 255
31+
32+
# training data
33+
# if exsting weight file, should name to "binary.train.weight"
34+
# alias: train_data, train
35+
data = binary.train
36+
37+
# validation data, support multi validation data, separated by ','
38+
# if exsting weight file, should name to "binary.test.weight"
39+
# alias: valid, test, test_data,
40+
valid_data = binary.test
41+
42+
# number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
43+
num_trees = 100
44+
45+
# shrinkage rate , alias: shrinkage_rate
46+
learning_rate = 0.1
47+
48+
# number of leaves for one tree, alias: num_leaf
49+
num_leaves = 63
50+
51+
# type of tree learner, support following types:
52+
# serial , single machine version
53+
# feature , use feature parallel to train
54+
# data , use data parallel to train
55+
# voting , use voting based parallel to train
56+
# alias: tree
57+
tree_learner = feature
58+
59+
# number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu.
60+
# num_threads = 8
61+
62+
# feature sub-sample, will random select 80% feature to train on each iteration
63+
# alias: sub_feature
64+
feature_fraction = 0.8
65+
66+
# Support bagging (data sub-sample), will perform bagging every 5 iterations
67+
bagging_freq = 5
68+
69+
# Bagging farction, will random select 80% data on bagging
70+
# alias: sub_row
71+
bagging_fraction = 0.8
72+
73+
# minimal number data for one leaf, use this to deal with over-fit
74+
# alias : min_data_per_leaf, min_data
75+
min_data_in_leaf = 50
76+
77+
# minimal sum hessians for one leaf, use this to deal with over-fit
78+
min_sum_hessian_in_leaf = 5.0
79+
80+
# save memory and faster speed for sparse feature, alias: is_sparse
81+
is_enable_sparse = true
82+
83+
# when data is bigger than memory size, set this to true. otherwise set false will have faster speed
84+
# alias: two_round_loading, two_round
85+
use_two_round_loading = false
86+
87+
# true if need to save data to binary file and application will auto load data from binary file next time
88+
# alias: is_save_binary, save_binary
89+
is_save_binary_file = false
90+
91+
# output model file
92+
output_model = LightGBM_model.txt
93+
94+
# support continuous train from trained gbdt model
95+
# input_model= trained_model.txt
96+
97+
# output prediction file for predict task
98+
# output_result= prediction.txt
99+
100+
# support continuous train from initial score file
101+
# input_init_score= init_score.txt
102+
103+
104+
# number of machines in parallel training, alias: num_machine
105+
num_machines = 2
106+
107+
# local listening port in parallel training, alias: local_port
108+
local_listen_port = 12400
109+
110+
# machines list file for parallel training, alias: mlist
111+
machine_list_file = mlist.txt

0 commit comments

Comments
 (0)