|
| 1 | +# Schema Discovery |
| 2 | + |
| 3 | +## Automated Discovery Behavior |
| 4 | + |
| 5 | +When the JDBC driver connection needs to get the schema for the collection in the database, |
| 6 | +it will poll for all the collections in the database. |
| 7 | + |
| 8 | +The driver will determine if a cached version of the schema for that collection already exists. |
| 9 | +If a cached version does not exist, it will sample the collection for documents and create a schema |
| 10 | +based on the following behavior. |
| 11 | + |
| 12 | +### Schema Generation Limitations |
| 13 | + |
| 14 | +The DocumentDB JDBC driver imposes a limit on the length of identifiers at 128 characters. |
| 15 | +The schema generator may truncate the length of generated identifiers (table names and column names) |
| 16 | +to ensure they fit that limit. |
| 17 | + |
| 18 | +### Scanning Method Options |
| 19 | + |
| 20 | +The sampling behavior can be modified using connection string or datasource options. |
| 21 | + |
| 22 | +- `scanMethod=<option>` |
| 23 | + - `random` - (default) The sample documents are returned in _random_ order. |
| 24 | + - `idForward` - The sample documents are returned in order of id. |
| 25 | + - `idReverse` - The sample documents are returned in reverse order of id. |
| 26 | + - `all` - Sample all the documents in the collection. |
| 27 | +- `scanLimit=<n>` - The number of documents to sample. The value must be a positive integer. |
| 28 | + The default value is `1000`. If `scanMethod` is set to `all`, this option is ignored. |
| 29 | + |
| 30 | +### DocumentDB Data Types |
| 31 | + |
| 32 | +The DocumentDB server supports a number of MongoDB data types. Listed below are the supported data |
| 33 | +types, and their associated JDBC data types. |
| 34 | + |
| 35 | +| MongoDB Data Type | Supported in DocumentDB | JDBC Data Type | |
| 36 | +| ----------- | ----------- | ----------- | |
| 37 | +| Binary Data| Yes | VARBINARY | |
| 38 | +| Boolean | Yes | BOOLEAN | |
| 39 | +| Double | Yes | DOUBLE | |
| 40 | +| 32-bit Integer | Yes | INTEGER | |
| 41 | +| 64-bit Integer | Yes | BIGINT | |
| 42 | +| String | Yes | VARCHAR | |
| 43 | +| ObjectId | Yes | VARCHAR | |
| 44 | +| Date | Yes | TIMESTAMP | |
| 45 | +| Null | Yes | VARCHAR | |
| 46 | +| Regular Expression | Yes | VARCHAR | |
| 47 | +| Timestamp | Yes | VARCHAR | |
| 48 | +| MinKey | Yes | VARCHAR | |
| 49 | +| MaxKey | Yes | VARCHAR | |
| 50 | +| Object | Yes | _virtual table_ | |
| 51 | +| Array | Yes | _virtual table_ | |
| 52 | +| Decimal128 | No | DECIMAL | |
| 53 | +| JavaScript | No | VARCHAR | |
| 54 | +| JavaScript (with scope) | No | VARCHAR | |
| 55 | +| Undefined | No | VARCHAR | |
| 56 | +| Symbol | No | VARCHAR | |
| 57 | +| DBPointer (4.0+) | No | VARCHAR | |
| 58 | + |
| 59 | +### Mapping Scalar Document Fields |
| 60 | + |
| 61 | +When scanning a sample of documents from a collection, the JDBC driver will create one or more |
| 62 | +schema to represent the samples in the collection. In general, a scalar field in the document |
| 63 | +maps to a column in the table schema. For example, in a collection named `team`, and a single |
| 64 | +document `{ "_id" : "112233", "name" : "Alastair", "age" : 25 }`, this would map to schema: |
| 65 | + |
| 66 | +| Table Name | Column Name | Data Type | Key | |
| 67 | +| ---| --- | --- | --- | |
| 68 | +| team | _**team__id**_ | VARCHAR | PK | |
| 69 | +| team | name | VARCHAR | | |
| 70 | +| team | age | INTEGER | | |
| 71 | + |
| 72 | +### Data Type Conflict Promotion |
| 73 | + |
| 74 | +When scanning the sampled documents, it is possible that the data types for a field are not |
| 75 | +consistent from document to document. In this case, the JDBC driver will _promote_ the JDBC data |
| 76 | +type to a common data type that will suit all data types from the sampled documents. |
| 77 | + |
| 78 | +#### Example |
| 79 | + |
| 80 | +```json |
| 81 | +{ |
| 82 | + "_id" : "112233", |
| 83 | + "name" : "Alastair", |
| 84 | + "age" : 25 |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +```json |
| 89 | +{ |
| 90 | + "_id" : "112244", |
| 91 | + "name" : "Benjamin", |
| 92 | + "age" : "32" |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +The `age` field is of type _32-bit integer_ in the first document but _string_ in the second document. |
| 97 | +Here the JDBC driver will promote the JDBC data type to VARCHAR to handle either data type when |
| 98 | +encountered. |
| 99 | + |
| 100 | +| Table Name | Column Name | Data Type | Key | |
| 101 | +| ---| --- | --- | --- | |
| 102 | +| team | _**team__id**_ | VARCHAR | PK | |
| 103 | +| team | name | VARCHAR | | |
| 104 | +| team | age | VARCHAR | | |
| 105 | + |
| 106 | +### Scalar-Scalar Conflict Promotion |
| 107 | + |
| 108 | +The following diagram shows the way in which scalar-scalar data type conflicts are resolved. |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | +### Object and Array Data Type Handling |
| 113 | + |
| 114 | +So far, we've only described how scalar data types are mapped. Object and Array data types are |
| 115 | +(currently) mapped to virtual tables. The JDBC driver will create a virtual table to represent |
| 116 | +either object or array fields in a document. The name of the mapped virtual table will concatenate the |
| 117 | +original collection's name followed by the field's name separated by an underscore character ("_"). |
| 118 | + |
| 119 | +The base table's primary key ("_id") takes on a new name in the new |
| 120 | +virtual table and is provided as a foreign key to the associated base table. |
| 121 | + |
| 122 | +For embedded array type fields, index columns are generated to represent the |
| 123 | +index into the array at each level of the array. |
| 124 | + |
| 125 | +#### Embedded Object Field Example |
| 126 | + |
| 127 | +For object fields in a document, a mapping to a virtual table is created by the |
| 128 | +JDBC driver. |
| 129 | + |
| 130 | +Collection: `customer` |
| 131 | + |
| 132 | +```json |
| 133 | +{ |
| 134 | + "_id" : "112233", |
| 135 | + "name" : "George Jackson", |
| 136 | + "address" : { |
| 137 | + "address1" : "123 Avenue Way", |
| 138 | + "address2" : "Apt. 5", |
| 139 | + "city" : "Hollywood", |
| 140 | + "region" : "California", |
| 141 | + "country" : "USA", |
| 142 | + "code" : "90210" |
| 143 | + } |
| 144 | +} |
| 145 | +``` |
| 146 | + |
| 147 | +maps to schema for `customer` table, ... |
| 148 | + |
| 149 | +| Table Name | Column Name | Data Type | Key | |
| 150 | +| --- | --- | --- | --- | |
| 151 | +| customer | _**customer__id**_ | VARCHAR | PK | |
| 152 | +| customer | name | VARCHAR | | |
| 153 | + |
| 154 | +... and the `customer_address` virtual table |
| 155 | + |
| 156 | +| Table Name | Column Name | Data Type | Key | |
| 157 | +| --- | --- | --- | --- | |
| 158 | +| customer_address | _**customer__id**_ | VARCHAR | PK/FK | |
| 159 | +| customer_address | address1 | VARCHAR | | |
| 160 | +| customer_address | address2 | VARCHAR | | |
| 161 | +| customer_address | city | VARCHAR | | |
| 162 | +| customer_address | region | VARCHAR | | |
| 163 | +| customer_address | country | VARCHAR | | |
| 164 | +| customer_address | code | VARCHAR | | |
| 165 | + |
| 166 | +So the resulting data in the two tables would look like this... |
| 167 | + |
| 168 | +#### Table: customer |
| 169 | + |
| 170 | +| _**customer__id**_ | name | |
| 171 | +| --- | --- | |
| 172 | +| "112233" | "George Jackson" | |
| 173 | + |
| 174 | +#### Virtual Table: customer_address |
| 175 | + |
| 176 | +| _**customer__id**_ | address1 | address2 | city | region | country | code | |
| 177 | +| --- | --- | --- | --- | --- | --- | --- | |
| 178 | +| "112233" | "123 Avenue Way" | "Apt. 5" | "Hollywood" | "California" | "USA" | "90210" | |
| 179 | + |
| 180 | +To query the data and return all columns, use the following query with a |
| 181 | +JOIN statement to get the matching address data. |
| 182 | + |
| 183 | +```mysql-sql |
| 184 | +SELECT * FROM "customer" |
| 185 | + INNER JOIN "customer_address" |
| 186 | + ON "customer"."customer__id" = "customer_address"."customer__id" |
| 187 | +``` |
| 188 | + |
| 189 | +#### Embedded Array Field Example |
| 190 | + |
| 191 | +For array fields in a document, a mapping to a virtual table is also created by the |
| 192 | +JDBC driver. |
| 193 | + |
| 194 | +Collection: `customer1` |
| 195 | + |
| 196 | +```json |
| 197 | +{ |
| 198 | + "_id" : "112233", |
| 199 | + "name" : "George Jackson", |
| 200 | + "subscriptions" : [ |
| 201 | + "Vogue", |
| 202 | + "People", |
| 203 | + "USA Today" |
| 204 | + ] |
| 205 | +} |
| 206 | +``` |
| 207 | + |
| 208 | +maps to schema for the `customer1` table, ... |
| 209 | + |
| 210 | +| Table Name | Column Name | Data Type | Key | |
| 211 | +| --- | --- | --- | --- | |
| 212 | +| customer1 | _**customer1__id**_ | VARCHAR | PK | |
| 213 | +| customer1 | name | VARCHAR | | |
| 214 | + |
| 215 | +... and the `customer1_subscriptions` virtual table |
| 216 | + |
| 217 | +| Table Name | Column Name | Data Type | Key | |
| 218 | +| --- | --- | --- | --- | |
| 219 | +| customer1_subscriptions | _**customer1__id**_ | VARCHAR | PK/FK | |
| 220 | +| customer1_subscriptions | subscriptions_index_lvl0 | BIGINT | PK | |
| 221 | +| customer1_subscriptions | value | VARCHAR | | |
| 222 | + |
| 223 | +So the resulting data in the two tables would look like this... |
| 224 | + |
| 225 | +#### Table: customer1 |
| 226 | + |
| 227 | +| _**customer1__id**_ | name | |
| 228 | +| --- | --- | |
| 229 | +| "112233" | "George Jackson" | |
| 230 | + |
| 231 | +#### Virtual Table: customer1_subscriptions |
| 232 | + |
| 233 | +| _**customer1__id**_ | subscriptions_index_lvl0 | value | |
| 234 | +| --- | --- | --- | |
| 235 | +| "112233" | 0 | "Vogue" | |
| 236 | +| "112233" | 1 | "People" | |
| 237 | +| "112233" | 2 | "USA Today" | |
| 238 | + |
| 239 | +To query the data and return all columns, use the following query with a |
| 240 | +JOIN statement to get the matching _subscriptions_ data. |
| 241 | + |
| 242 | +```mysql-sql |
| 243 | +SELECT * FROM "customer1" |
| 244 | + INNER JOIN "customer1_subscriptions" |
| 245 | + ON "customer"."customer1__id" = "customer_address"."customer1__id" |
| 246 | +``` |
| 247 | + |
| 248 | +### Scalar-Complex Type Conflict Promotion |
| 249 | + |
| 250 | +Like the scalar-scalar type conflicts, the same field in different documents can have conflicting |
| 251 | +data types between complex (array and object) and scalar (integer, boolean, etc.). All of these |
| 252 | +conflicts are resolved (promoted) to VARCHAR for those fields. In this case, array and object data |
| 253 | +is returned as the JSON representation. |
| 254 | + |
| 255 | +#### Embedded Array - String Field Conflict Example |
| 256 | + |
| 257 | +Collection: `customer2` |
| 258 | + |
| 259 | +```json |
| 260 | +{ |
| 261 | + "_id" : "112233", |
| 262 | + "name" : "George Jackson", |
| 263 | + "subscriptions" : [ |
| 264 | + "Vogue", |
| 265 | + "People", |
| 266 | + "USA Today" |
| 267 | + ] |
| 268 | +} |
| 269 | +``` |
| 270 | + |
| 271 | +```json |
| 272 | +{ |
| 273 | + "_id" : "112244", |
| 274 | + "name" : "Joan Starr", |
| 275 | + "subscriptions" : 1 |
| 276 | +} |
| 277 | +``` |
| 278 | + |
| 279 | +maps to schema for the `customer2` table, ... |
| 280 | + |
| 281 | +| Table Name | Column Name | Data Type | Key | |
| 282 | +| --- | --- | --- | --- | |
| 283 | +| customer2 | _**customer2__id**_ | VARCHAR | PK | |
| 284 | +| customer2 | name | VARCHAR | | |
| 285 | +| customer2 | subscription | VARCHAR | | |
| 286 | + |
| 287 | +So the resulting data in the table would look like this... |
| 288 | + |
| 289 | +#### Table: customer2 |
| 290 | + |
| 291 | +| _**customer2__id**_ | name | subscriptions | |
| 292 | +| --- | --- | --- | |
| 293 | +| "112233" | "George Jackson" | "\[ \\"Vogue\\", \\"People\\", \\"USA Today\\" \]" | |
| 294 | +| "112244" | "Joan Starr" | "1" | |
0 commit comments