Reading Variable Length File with OCCCURS DEPENDING #666

pinakigit · 2024-04-10T17:26:14Z

Hi,

We are sending file from Mainframe to ADLS through FTP in binary mode. The binary data we are reading through Cobrix and creating parquet file out of it. The FB files are working like a charm.

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").load("/data/example1/data")

We can also read VB Files with occur clause

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("variable_size_occurs", "true").load("/data/example1/data")

Question

Below is a sample copybook which is for a variable length file. It has 2 record formats i.e if the value of the field PKLR1-PAR-PEN-REG-CODE is 'N' then the data has values till the field PKLR1-VALUATION-CODE and if it is 'Y' then it maps to the entire copybook including the OCCURS DEPNDING CLAUSE.

01 PKLR1-DETAIL-LOAN-RECORD.
10 PKLR1-BASIC-SECTION.
20 PKLR1-SORT-CONTROL-FIELD.
30 PKLR1-USER-IDENT PIC X(1).
30 PKLR1-EXTRACT-CODE PIC X(1).
88 PKLR1-DATA-RECORD VALUE '0'.
88 PKLR1-END-OF-FILE VALUE '9'.
30 PKLR1-SECTION PIC X(1).
30 PKLR1-TYPE PIC X(1).
30 PKLR1-NUMERIC-STATE-CODE PIC X(2).
30 PKLR1-CONTRACT-NUMBER PIC X(10).
20 PKLR1-PAR-PEN-REG-CODE PIC X(1).
88 PKLR1-PAR VALUE 'Y'.
88 PKLR1-NAPR VALUE 'N'.
20 PKLR1-VALUATION-CODE.
30 PKLR1-MORTALITY-TABLE PIC X(2).
30 PKLR1-LIVES-CODE PIC X(1).
30 PKLR1-FUNCTION PIC X(1).
30 PKLR1-VAL-INTEREST PIC S9(2)V9(3) COMP-3.
30 PKLR1-MODIFICATION PIC X(1).
30 PKLR1-INSURANCE-CLASS PIC X(1).
30 PKLR1-SERIES PIC X(5).
20 PKLR1-POLICY-STATUS PIC X(2).
20 PKLR1-PAR-CODES.
30 PKLR1-PAR-TYPE PIC X(1).
30 PKLR1-DIVIDEND-OPTION PIC X(1).
30 PKLR1-OTHER-OPTION PIC X(1).
20 PKLR1-ALPHA-STATE-CODE PIC X(2).
20 PKLR1-OUT-LOC-DTLS OCCURS 1 TO 5 TIMES
DEPENDING ON PKLR1-OUT-NO-OF-LOC.
30 PKLR1-OUT-LOC PIC X(10).
30 PKLR1-OUT-LOC-QTY PIC S9(9) COMP-3.

Query 1:
How can I read this file? I tried the below thigs and seems nothing is working.

spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("variable_size_occurs", "true").load("/data/example1/data")- Doesn't pull any record
spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("record_format", "V").option("variable_size_occurs", "true").load("/data/example1/data") - Doesn't pull any record
spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").option("PKLR1-PAR-PEN-REG-CODE", "Y").option("variable_size_occurs", "true").load("/data/example1/data") - It pull the record properly till it encounter a record with value PKLR1-PAR-PEN-REG-CODE = 'N' and it doesn't pull any record after that
Created another copybook test4_copybook which will have fields till PKLR1-SERIES
spark.read.format("cobol").option("copybook", "/data/example1/test4_copybook.cob").option("PKLR1-PAR-PEN-REG-CODE", "Y").option("variable_size_occurs", "true").load("/data/example1/data") - It pull the record properly till it encounter a record with value PKLR1-PAR-PEN-REG-CODE = 'Y' and it doesn't pull any record after that

How can I read this file and create a parquet file out of it.

Query2:
When writing the parquet file the field is getting created as an array and struct. Is there a way I can flatten it i.e. it will always create
5 occurences of fiels PKLR1-OUT-LOC, PKLR1-OUT-LOC-QTY i.e. PKLR1-OUT-LOC1, PKLR1-OUT-LOC-QTY1, PKLR1-OUT-LOC2, PKLR1-OUT-LOC-QTY2 ,PKLR1-OUT-LOC3, PKLR1-OUT-LOC-QTY3 ,PKLR1-OUT-LOC4, PKLR1-OUT-LOC-QTY4 ,PKLR1-OUT-LOC5, PKLR1-OUT-LOC-QTY5 and depending on PKLR1-OUT-NO-OF-LOC these fields will be populated or set as NULL.

Query 3:
How do I when I receive the file in ADLS whether it's coming as VB or FB. Tried using the VB header examples (have both BDW and RDW headers) and it throw3s error as BDW header have non-zero values.

yruslan · 2024-04-12T09:17:29Z

Hi @pinakigit

.option("record_format", "V")

is only for files that have RDW headers for each record

.option("record_format", "VB")

is only for files that have BDW for record blocks, and RDW for each record.

Do I understand it correctly that records have variable size, but there is no numerical records that specify record size?

If PEN-REG-CODE='N' the size of the record is one, but when PEN-REG-CODE='Y' the size is different?

pinakigit · 2024-04-12T12:57:40Z

Hi @yruslan

Yes. If PEN-REG-CODE='N' the size of the record is one which is kind of fixed record length in the VB File, but when PEN-REG-CODE='Y' the size is vareiable depending on th4e occurs clause?

The file is single file which has 2 different type of records. One with PEN-REG-CODE='N' and pother with PEN-REG-CODE='Y' which has OCCURS Depending clause.

The file in Mianframe is VB but I am not sure whether the BDW and RDW are retained after FTP from Mainframe . Also I am not sure how to check the RDW and BDW values in Mainframe as well as in the ADLS.

I am able to read the records without .option("record_format", "V") or .option("record_format", "VB") with .option("PKLR1-PAR-PEN-REG-CODE", "Y") or .option("PKLR1-PAR-PEN-REG-CODE", "N") but it shows the record only till the record of the other type appears.

If I am putting .option("record_format", "V") it doesn't return any record. If I am putting .option("record_format", "VB") it throws below error.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 255.0 failed 4 times, most recent failure: Lost task 0.3 in stage 255.0 (TID 2441) (10.189.204.152 executor 16): java.lang.IllegalStateException: BDW headers contain non-zero values where zeros are expected (check 'rdw_big_endian' flag. Header: 64,193,194,64, offset: 0.

yruslan · 2024-04-15T07:38:56Z

Files with RDW headers has record length as a binary field in first 4 bytes of each record. Also, either first 2 bytes are zeros or last 2 bytes are zeros. You can check if you file has RDW by looking at first 4 bytes.

Here are more details on record formats: https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets

In the meantime, I'm going to implement an option that allows mapping between a field and record size.

For example, '9(7)PPPV' is the same as '9(7)PPP'.

pinakigit · 2024-04-18T02:44:29Z

Hi yruslan,

How the option that allows mapping between a field and record size will help us? I hope it's related to Query1. Did you have a chance to look at Query 2 i.e. flattening the occurs clause values i.e. if the OCCURS 1 TO 5 TIMES
DEPENDING ON PKLR1-OUT-NO-OF-LOC. If PKLR1-OUT-NO-OF-LOC has value 2 then have value for 2 occurences and have the other 3 occurence as NULL.

yruslan · 2024-04-18T08:16:38Z

Hi @pinakigit ,

For the Query 2, please see #668

It is a similar issue and was fixed using a Cobrix option. Let me know if it works for you as well.

Query 3: No matter which filesystem the file comes from if you are using 'V' or 'VB' you need to ensure the headers are in place. Otherwise these record formats can't be used, and you need to use some other format and parsing options instead.

pinakigit · 2024-04-19T12:08:30Z

Thanks yruslan. I have rephrased my question in #668. We have already implemented the solution provided there. But we need some additional capability. Let me know if the comments there are clear for understanding else will create some data and provide wit examples. Basically we want to fit in that record into a RDBMS kind of layout without splitting one record into multiple records.

And I hope the option that allows mapping between a field and record size will help us resolving the main issue reported here. In the meantime I will do more research on RDW. I have to somehow read the binary file in Unix to see if it has the RDW or not which is present in MF

yruslan · 2024-04-19T13:03:47Z

You can flatten arrays in the output dataframe using one of these options:

Cobrix flatten utility: method SparkUtils.flattenSchema(df) (Scala only)
Spark's array explode function (https://spark.apache.org/docs/latest/api/sql/index.html#explode). This has a different semantics though

pinakigit · 2024-04-19T18:56:47Z

Thanks yruslan. It worked like a charm. Coming to the original question, our FTP from Mainframes is dropping the RDW and BDW and probably that’s the reason I am not able to use the VB option. Is there a way to FTP from Mainframe to retain the RDW? I tried LOCSITE RDW but its not working.

yruslan · 2024-04-22T07:13:25Z

I'm glad variable OCCURS worked for you. Regarding retention of RDWs, it all depends on tools used to load files from mainframes. I can't advice you any particular tool, unfortunately.

But the record length field to size papping that is being developed should help you even if you don't have RDW headers in mainframe files.

yruslan · 2024-04-22T14:43:53Z

The mapping between record length field values and record sized is now merged to the master branch: #674

Please, let know if it works.

Example:

val df = spark.read
  .format("cobol")
  .option("copybook_contents", copybook)
  .option("record_format", "F")
  .option("record_length_field", "SEG-ID")
  .option("record_length_map", """{"A":4,"B":7,"C":8}""") // <---- this
  .load(tempFile)

pinakigit · 2024-04-22T16:25:59Z

Thanks yruslan. Will test and get back to you. In the meantime, I manually added the RDW which is 4 bytes to the original Mainframe File (i.e. 2 Times RDW). So my original VB File was 100 bytes (4 Bytes of RDW and 96 bytes of Data). I recreated it as 104 bytes (4 bytes RDW, 4bytes RDW and 96 bytes of Data) . Now when I SFTP I am getting 100 bytes of data alongwith RDW with the original RDW being dropped. My RDW has Hex values "01AB0000" which translates to 427, Hex "00B00000" which translates to 176. I am able to read the file without segment option with the below code

val cobolDataframe = spark
.read
.format("cobol")
.option("copybook", "data/test1_copybook.cob")
.option("record_format", "V")
.option("is_rdw_big_endian", "true")
.option("rdw_adjustment", -4)
.option("variable_size_occurs", "true")
.load("data/test2_data")

I am jot sire though why record_format V instead of VB and rdw_adjustment -4 worked

yruslan · 2024-04-23T07:05:28Z

The record length field value to record size mapping is available in spark-cobol:2.7.0

pinakigit added the question Further information is requested label Apr 10, 2024

jaysara mentioned this issue Apr 11, 2024

Not able to parse the content correctly when copybook has OCCURS X TIMES DEPENDING ON FIELD_NAME #668

Closed

yruslan added enhancement New feature or request accepted Accepted for implementation labels Apr 15, 2024

yruslan added a commit that referenced this issue Apr 17, 2024

#666 Allow 'V' to be at the end of scaled PICs.

26228fc

For example, '9(7)PPPV' is the same as '9(7)PPP'.

yruslan added a commit that referenced this issue Apr 22, 2024

#666 Add record length value mapping option to Cobrix.

3aa4cf3

yruslan added a commit that referenced this issue Apr 22, 2024

#666 Implement record splitting based on record length field mapping.

57ed010

yruslan added a commit that referenced this issue Apr 22, 2024

#666 Add support for file offsets with record length fields.

662b804

yruslan added a commit that referenced this issue Apr 22, 2024

#666 Add record length value mapping option to Cobrix.

896c885

yruslan added a commit that referenced this issue Apr 22, 2024

#666 Implement record splitting based on record length field mapping.

0d5ee08

yruslan added a commit that referenced this issue Apr 22, 2024

#666 Add support for file offsets with record length fields.

9115d8b

yruslan mentioned this issue Apr 23, 2024

Release Cobrix v2.7.0 #675

Merged

yruslan closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading Variable Length File with OCCCURS DEPENDING #666

Reading Variable Length File with OCCCURS DEPENDING #666

pinakigit commented Apr 10, 2024

yruslan commented Apr 12, 2024

pinakigit commented Apr 12, 2024

yruslan commented Apr 15, 2024

pinakigit commented Apr 18, 2024

yruslan commented Apr 18, 2024

pinakigit commented Apr 19, 2024

yruslan commented Apr 19, 2024

pinakigit commented Apr 19, 2024

yruslan commented Apr 22, 2024

yruslan commented Apr 22, 2024

pinakigit commented Apr 22, 2024

yruslan commented Apr 23, 2024

Reading Variable Length File with OCCCURS DEPENDING #666

Reading Variable Length File with OCCCURS DEPENDING #666

Comments

pinakigit commented Apr 10, 2024

Question

yruslan commented Apr 12, 2024

pinakigit commented Apr 12, 2024

yruslan commented Apr 15, 2024

pinakigit commented Apr 18, 2024

yruslan commented Apr 18, 2024

pinakigit commented Apr 19, 2024

yruslan commented Apr 19, 2024

pinakigit commented Apr 19, 2024

yruslan commented Apr 22, 2024

yruslan commented Apr 22, 2024

pinakigit commented Apr 22, 2024

yruslan commented Apr 23, 2024