Skip to content

Commit dd6317a

Browse files
authored
Merge pull request Azure#336 from rastala/master
adding work-with-data
2 parents 59a01c1 + 82d8353 commit dd6317a

File tree

66 files changed

+18422
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+18422
-0
lines changed

Diff for: how-to-use-azureml/work-with-data/dataprep/README.md

+177
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Azure Machine Learning Data Prep SDK
2+
3+
You will find in this repo:
4+
- [How-To Guide Notebooks](how-to-guides) for more in-depth feature examples.
5+
- [Case Study Notebooks](case-studies/new-york-taxi) that show in-depth scenario examples of features.
6+
- [Getting Started Tutorial](tutorials/getting-started/getting-started.ipynb) for a quick introduction to the Data Prep SDK and some of its main features.
7+
8+
## Installation
9+
Here are the [SDK installation steps](https://docs.microsoft.com/python/api/overview/azure/dataprep/intro?view=azure-dataprep-py#install).
10+
11+
## Documentation
12+
Here is more information on how to use the new Data Prep SDK:
13+
- [SDK overview and API reference docs](http://aka.ms/data-prep-sdk) that show different classes, methods, and function parameters for the SDK.
14+
- [Tutorial: Prep NYC taxi data](https://docs.microsoft.com/azure/machine-learning/service/tutorial-data-prep) for regression modeling and then run automated machine learning to build the model.
15+
- [How to load data](https://docs.microsoft.com/azure/machine-learning/service/how-to-load-data) is an overview guide on how to load data using the Data Prep SDK.
16+
- [How to transform data](https://docs.microsoft.com/azure/machine-learning/service/how-to-transform-data) is an overview guide on how to transform data.
17+
- [How to write data](https://docs.microsoft.com/azure/machine-learning/service/how-to-write-data) is an overview guide on how to write data to different storage locations.
18+
19+
## Known Issues
20+
21+
- **If running version 0.1.0**: To fix "Error Message: Cannot run the event loop while another loop is running", downgrade Tornado version to 4.5.3. Restart any running kernels for the change to take effect.
22+
```
23+
pip install -U tornado==4.5.3
24+
```
25+
26+
## Release Notes
27+
28+
### 2019-03-25 (version 1.1.0)
29+
30+
Breaking changes
31+
- The concept of the Data Prep Package has been deprecated and is no longer supported. Instead of persisting multiple Dataflows in one Package, you can persist Dataflows individually.
32+
- How-to guide: [Opening and Saving Dataflows notebook](https://aka.ms/aml-data-prep-open-save-dataflows-nb)
33+
34+
New features
35+
- Data Prep can now recognize columns that match a particular Semantic Type, and split accordingly. The STypes currently supported include: email address, geographic coordinates (latitude & longitude), IPv4 and IPv6 addresses, US phone number, and US zip code.
36+
- How-to guide: [Semantic Types notebook](https://aka.ms/aml-data-prep-semantic-types-nb)
37+
- Data Prep now supports the following operations to generate a resultant column from two numeric columns: subtract, multiply, divide, and modulo.
38+
- You can call `verify_has_data()` on a Dataflow to check whether the Dataflow would produce records if executed.
39+
40+
Bug fixes and improvements
41+
- You can now specify the number of bins to use in a histogram for numeric column profiles.
42+
- The `read_pandas_dataframe` transform now requires the DataFrame to have string- or byte- typed column names.
43+
- Fixed a bug in the `fill_nulls` transform, where values were not correctly filled in if the column was missing.
44+
45+
### 2019-03-11 (version 1.0.17)
46+
47+
New features
48+
- Now supports adding two numeric columns to generate a resultant column using the expression language.
49+
50+
Bug fixes and improvements
51+
- Improved the documentation and parameter checking for random_split.
52+
53+
### 2019-02-27 (version 1.0.16)
54+
55+
Bug fix
56+
- Fixed a Service Principal authentication issue that was caused by an API change.
57+
58+
### 2019-02-25 (version 1.0.15)
59+
60+
New features
61+
- Data Prep now supports writing file streams from a dataflow. Also provides the ability to manipulate the file stream names to create new file names.
62+
- How-to guide: [Working With File Streams notebook](https://aka.ms/aml-data-prep-file-stream-nb)
63+
64+
Bug fixes and improvements
65+
- Improved performance of t-Digest on large data sets.
66+
- Data Prep now supports reading data from a DataPath.
67+
- One hot encoding now works on boolean and numeric columns.
68+
- Other miscellaneous bug fixes.
69+
70+
### 2019-02-11 (version 1.0.12)
71+
72+
New features
73+
- Data Prep now supports reading from an Azure SQL database using Datastore.
74+
75+
Changes
76+
- Significantly improved the memory performance of certain operations on large data.
77+
- `read_pandas_dataframe()` now requires `temp_folder` to be specified.
78+
- The `name` property on `ColumnProfile` has been deprecated - use `column_name` instead.
79+
80+
### 2019-01-28 (version 1.0.8)
81+
82+
Bug fixes
83+
- Significantly improved the performance of getting data profiles.
84+
- Fixed minor bugs related to error reporting.
85+
86+
### 2019-01-14 (version 1.0.7)
87+
88+
New features
89+
- Datastore improvements (documented in [Datastore how-to-guide](https://aka.ms/aml-data-prep-datastore-nb))
90+
- Added ability to read from and write to Azure File Share and ADLS Datastores in scale-up.
91+
- When using Datastores, Data Prep now supports using service principal authentication instead of interactive authentication.
92+
- Added support for wasb and wasbs urls.
93+
94+
### 2019-01-09 (version 1.0.6)
95+
96+
Bug fixes
97+
- Fixed bug with reading from public readable Azure Blob containers on Spark.
98+
99+
### 2018-12-19 (version 1.0.4)
100+
101+
New features
102+
- `to_bool` function now allows mismatched values to be converted to Error values. This is the new default mismatch behavior for `to_bool` and `set_column_types`, whereas the previous default behavior was to convert mismatched values to False.
103+
- When calling `to_pandas_dataframe`, there is a new option to interpret null/missing values in numeric columns as NaN.
104+
- Added ability to check the return type of some expressions to ensure type consistency and fail early.
105+
- You can now call `parse_json` to parse values in a column as JSON objects and expand them into multiple columns.
106+
107+
Bug fixes
108+
- Fixed a bug that crashed `set_column_types` in Python 3.5.2.
109+
- Fixed a bug that crashed when connecting to Datastore using an AML image.
110+
111+
### 2018-12-07 (version 0.5.3)
112+
113+
Fixed missing dependency issue for .NET Core2 on Ubuntu 16.
114+
115+
### 2018-12-03 (version 0.5.2)
116+
117+
Breaking changes
118+
- `SummaryFunction.N` was renamed to `SummaryFunction.Count`.
119+
120+
Bug fixes
121+
- Use latest AML Run Token when reading from and writing to datastores on remote runs. Previously, if the AML Run Token is updated in Python, the Data Prep runtime will not be updated with the updated AML Run Token.
122+
- Additional clearer error messages
123+
- to_spark_dataframe() will no longer crash when Spark uses Kryo serialization
124+
- Value Count Inspector can now show more than 1000 unique values
125+
- Random Split no longer fails if the original Dataflow doesn’t have a name
126+
127+
### 2018-11-19 (version 0.5.0)
128+
129+
New features
130+
- Created a new DataPrep CLI to execute DataPrep packages and view the data profile for a dataset or dataflow
131+
- Redesigned SetColumnType API to improve usability
132+
- Renamed smart_read_file to auto_read_file
133+
- Now includes skew and kurtosis in the Data Profile
134+
- Can sample with stratified sampling
135+
- Can read from zip files that contain CSV files
136+
- Can split datasets row-wise with Random Split (e.g. into test-train sets)
137+
- Can get all the column data types from a dataflow or a data profile by calling .dtypes
138+
- Can get the row count from a dataflow or a data profile by calling .row_count
139+
140+
Bug fixes
141+
- Fixed long to double conversion
142+
- Fixed assert after any add column
143+
- Fixed an issue with FuzzyGrouping, where it would not detect groups in some cases
144+
- Fixed sort function to respect multi-column sort order
145+
- Fixed and/or expressions to be similar to how Pandas handles them
146+
- Fixed reading from dbfs path.
147+
- Made error messages more understandable
148+
- Now no longer fails when reading on remote compute target using AML token
149+
- Now no longer fails on Linux DSVM
150+
- Now no longer crashes when non-string values are in string predicates
151+
- Now handles assertion errors when Dataflow should fail correctly
152+
- Now supports dbutils mounted storage locations on Azure Databricks
153+
154+
### 2018-11-05 (version 0.4.0)
155+
156+
New features
157+
- Type Count added to Data Profile
158+
- Value Count and Histogram is now available
159+
- More percentiles in Data Profile
160+
- The Median is available in Summarize
161+
- Python 3.7 is now supported
162+
- When you save a dataflow that contains datastores to a Data Prep package, the datastore information will be persisted as part of the Data Prep package
163+
- Writing to datastore is now supported
164+
165+
Bug fixes
166+
- 64bit unsigned integer overflows are now handled properly on Linux
167+
- Fixed incorrect text label for plain text files in smart_read
168+
- String column type now shows up in metrics view
169+
- Type count now is fixed to show ValueKinds mapped to single FieldType instead of individual ones
170+
- Write_to_csv no longer fails when path is provided as a string
171+
- When using Replace, leaving “find” blank will no longer fail
172+
173+
## Datasets License Information
174+
175+
IMPORTANT: Please read the notice and find out more about this NYC Taxi and Limousine Commission dataset here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
176+
177+
IMPORTANT: Please read the notice and find out more about this Chicago Police Department dataset here: https://catalog.data.gov/dataset/crimes-2001-to-present-398a4

0 commit comments

Comments
 (0)