@@ -32,7 +32,7 @@ Among the motivations for this project, it is possible to highlight:
3232
3333- Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays)
3434
35- - Supports nested structures and arrays (including "flattened" nested names)
35+ - Supports nested structures and arrays
3636
3737- Supports HDFS as well as local file systems
3838
@@ -350,8 +350,18 @@ Currently, specifying multiple paths in `load()` is not supported. Use the follo
350350### Spark SQL schema extraction
351351This library also provides convenient methods to extract Spark SQL schemas and Cobol layouts from copybooks.
352352
353- If you want to extract a Spark SQL schema from a copybook:
353+ If you want to extract a Spark SQL schema from a copybook by providing same options you provide to Spark:
354+ ``` scala
355+ // Same options that you use for spark.read.format("cobol").option()
356+ val options = Map (" schema_retention_policy" -> " keep_original" )
357+
358+ val cobolSchema = CobolSchema .fromSparkOptions(Seq (copybook), options)
359+ val sparkSchema = cobolSchema.getSparkSchema.toString()
360+
361+ println(sparkSchema)
362+ ```
354363
364+ If you want to extract a Spark SQL schema from a copybook using the Cobol parser directly:
355365``` scala
356366import za .co .absa .cobrix .cobol .parser .CopybookParser
357367import za .co .absa .cobrix .cobol .reader .policies .SchemaRetentionPolicy
@@ -1397,6 +1407,74 @@ When using `9` 8 refers to the number of digits the number has. Here, the size o
13971407```
13981408You can have decimals when using COMP-3 as well.
13991409
1410+ ### Flattening schema with GROUPs and OCCURS
1411+ Flattening could be helpful when migrating data from mainframe data with fields that have OCCURs (arrays) to a relational
1412+ databases that do not support nested arrays.
1413+
1414+ Cobrix has a method that can flatten the schema automatically given a DataFrame produced by ` spark-cobol ` .
1415+
1416+ Spark Scala example:
1417+ ``` scala
1418+ val dfFlat = SparkUtils .flattenSchema(df, useShortFieldNames = false )
1419+ ```
1420+
1421+ PySpark example
1422+ ``` python
1423+ from pyspark.sql import SparkSession, DataFrame, SQLContext
1424+ from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
1425+ from py4j.java_gateway import java_import
1426+
1427+ schema = StructType([
1428+ StructField(" id" , IntegerType(), True ),
1429+ StructField(" name" , StringType(), True ),
1430+ StructField(" subjects" , ArrayType(StringType()), True )
1431+ ])
1432+
1433+ # Sample data
1434+ data = [
1435+ (1 , " Alice" , [" Math" , " Science" ]),
1436+ (2 , " Bob" , [" History" , " Geography" ]),
1437+ (3 , " Charlie" , [" English" , " Math" , " Physics" ])
1438+ ]
1439+
1440+ # Create a test DataFrame
1441+ df = spark.createDataFrame(data, schema)
1442+
1443+ # Show the Dataframe before flattening
1444+ df.show()
1445+
1446+ # Flatten the schema using Cobrix Scala 'SparkUtils.flattenSchema' method
1447+ sc = spark.sparkContext
1448+ java_import(sc._gateway.jvm, " za.co.absa.cobrix.spark.cobol.utils.SparkUtils" )
1449+ dfFlatJvm = spark._jvm.SparkUtils.flattenSchema(df._jdf, False )
1450+ dfFlat = DataFrame(dfFlatJvm, SQLContext(sc))
1451+
1452+ # Show the Dataframe after flattening
1453+ dfFlat.show(truncate = False )
1454+ dfFlat.printSchema()
1455+ ```
1456+
1457+ The output looks like this:
1458+ ```
1459+ # Before flaattening
1460+ +---+-------+------------------------+
1461+ |id |name |subjects |
1462+ +---+-------+------------------------+
1463+ |1 |Alice |[Math, Science] |
1464+ |2 |Bob |[History, Geography] |
1465+ |3 |Charlie|[English, Math, Physics]|
1466+ +---+-------+------------------------+
1467+
1468+ # After flattenning
1469+ +---+-------+----------+----------+----------+
1470+ |id |name |subjects_0|subjects_1|subjects_2|
1471+ +---+-------+----------+----------+----------+
1472+ |1 |Alice |Math |Science |null |
1473+ |2 |Bob |History |Geography |null |
1474+ |3 |Charlie|English |Math |Physics |
1475+ +---+-------+----------+----------+----------+
1476+ ```
1477+
14001478## Summary of all available options
14011479
14021480##### File reading options
0 commit comments