Binary file comparison

Introduction

The binary files comparison will compare, in two folders, files with the same structure. More precisely, it will compare files with the same path inside the two folders.  

Binary files are arbitrary files with a format that is described in another file named a fileformat file. The fileformat files are in another dedicated fileformat folder. 

Fileformat files

The fileformat files are xml description files that explain the various fields in the file, their length and their format. A simplest file would look like below, though other attributes are allowed: 


<?xml version="1.0" encoding="ASCII"?> 
<FileFormat ConversionTable="Cp037"> 
<RecordFormat> 
<FieldFormat Name="NAME" Type="X" Size="15" /> 
<FieldFormat Name="CODE" Type="9" Size="5" /> 
<FieldFormat Name="AMOUNT" Type="9" Size="12" Decimal="2" Signed="true" /> 
</RecordFormat> 
</FileFormat> 
FileFormatMandatory. The main xml tag. 
 ConversionTableMandatory. The used character set. 
 headerSizeOptional, numeric, defaults to 0.  
Instructs the parser to skip the first x characters, which are a file header and to not respect the format.
 
 newLineSizeOptional, numeric, defaults to 0.  
Instructs the parser to skip x characters after each record, corresponding to files with records line by line.  
When used, usually set to 1 (Unix \n) or 2 (Windows \r\n)
 
 distinguishedFieldSizeOptional, numeric, defaults to 0.  
Mandatory if several records are used.
See Multiple record formats
RecordFormatMandatory. May have several of them if multiple record formats are used.See Multiple record formats
 cobolRecordNameOptional string.  
Mandatory if several record formats are used, enables to identify the record format.
See Multiple record formats
 distinguishedFieldValueOptional pattern.  
Mandatory if several records are used, enables to decide the correct format for a record.
See Multiple record formats
FieldsGroupOptional. Even if there are groups in the legacy structure, this tag is optional unless it bears an ‘Occurs’ attribute. Otherwise, the group can be “flattened”.See Repeated data
 NameMandatory in a group. Identifies the group. 
 OccursOptional, numeric, defaults to 1. Used when group is repeated.See Repeated data
 DependingOnOptional string. Used when occurs is variable.See Repeated data
FieldFormatMandatory. Describes each field to compare 
 NameMandatory. Identifies the field. 
 TypeMandatory. Determines the field type, based on Cobol pictures.See Field formats
 SizeMandatory. Determines the total field size (integral + decimal).See Field formats
 DecimalOptional, numeric, defaults to 0. Determines the number of decimal digits.See Field formats
 SignedOptional, boolean, defaults to false. Determines whether the number is signed.See Field formats
 OccursOptional, numeric, defaults to 1. Used when field is repeated.See Repeated data
 DependingOnOptional string. Used when occurs is variable.See Repeated data

Field formats

The field formats use a terminology similar as in Cobol. The main difference is that the “Size” attribute corresponds to the total size.  
For instance, this definition: NUM1 PIC S9(6)V9(2) corresponds to (notice the 8): <FieldFormat Name="NUM1" Type="9" Size="8" Decimal="2" Signed="true" />  

The handled types are: X, 9, 1, 2, 3, 5, 7, 8, B, H, T, Z.  
 

Repeated data

The “Occurs” attribute enables to define that a field is repeated.  
<FieldFormat Name="ADDRLINE" Type="X" Size="30" Occurs="3" />  

The “DependingOn” attributes can be used if the occurs depends on another field.  
<FieldFormat Name="CNT" Type="9" Size="2" />  
<FieldFormat Name="ARRAY" Type="X" Size="20" Occurs="50" DependingOn="CNT" />  

In this case, the “Occurs” value corresponds to the maximum value.  
If a RDW is in the file, the tool must be instructed to skip it. See Additional propertiers for more details.  

These attributes can also apply to fields group  
<FieldFormat Name="CNT" Type="9" Size="2" />  
<FieldsGroup Name="ARRAY" Occurs="50" DependingOn="CNT">  
<FieldFormat Name="KEY" Type="9" Size="1" />  
<FieldFormat Name="VALUE" Type="X" Size="10" />  
</FieldsGroup>  
 

Multiple record formats

When the record can have various formats, multiple record formats can be used. In this case, the record will first be parsed by the tool to determine which record format is suitable, thanks to a pattern.  

By default, one HTML page will be generated, as a report, for each record. This way the HTML table describing the columns will be consistent in each page.  

The amount of data that must be read to verify the pattern must be setup at the <FieldFormat> level, in the distinguishedFieldSize attribute and each <RecordFormat> must bear a pattern in its distinguishedFieldValue attribute.  

In below example, the tool is instructed to look if the 16th character of the record is a A or a B (in this case it will match record RF1), or a Z (in this case it will match RF2). In this case, 16 characters must be read, but the 15 first can be anything. 


<FileFormat ConversionTable="Cp037" distinguishFieldSize="16"> 
<RecordFormat cobolRecordName="RF1" distinguishFieldValue=".{15}(A|B)"> 
... 
</RecordFormat> 
<RecordFormat cobolRecordName="RF2" distinguishFieldValue=".{15}(Z)"> 
... 
</RecordFormat> 
</FileFormat> 

Configuration file

A minimal bacmp configuration file for binary files will be as below. It will scan the default left and right folders and compare each file found with same name, using the fileformat file with the same name found in the default fileformat folder. 

{"comparisonType“: "Binary“} 

The files to compare and the corresponding fileformat files will be matched this way: 

  • Try to match a file like ‘test.ebc’ with a fileformat named ‘test.ebc.fileformat’
  • Try to match a file like ‘test.ebc’ with a fileformat named ‘test.fileformat’
  • Use the fileformat in the folder, whatever its name, if it is unique in its folder

Business keys, removed and rejected columns

You can specify, for each file, columns which are business keys, rejected columns or removed columns. You can also manually specify the fileformat file, thus overriding the logic explained in previous chapter.  

Business keys allows you to specify which records are the same in left and right file. (See Business keys for an example, it works the same as in databases).  

Removed columns will not appear at all in comparison, rejected columns will appear but values will be replaced by ‘(REJECTED)’.

image.png

Please note that the file list that appears in the ‘details’ here does not restrict the list of compared files. All files in the folders are compared, and these files only need additional details. In order to compare only some files, you can change the left and right folder, or use a file filter. See Additional properties for more details.

Additional properties

In addition to the configurations above, there are additional properties. Most have default values that are used if the property does not appear in the file, but can also be specified to another value. Others are not used if left unspecified.  

Default values are written in green below when they exist, a red ‘no value’ otherwise. 

BinaryFileComaprison-AdditionalProp.PNG
fileFilterComma-separated extension list to compare. For instance, setting to "ebc" will compare only the ebc files in the folders. 
leftFolderCustomize the left folder for the files to compare 
rightFolderCustomize the right folder for the files to compare 
fileformatFolderCustomize the folder containing the fileformat files for the files to compare 
splitRecordFormatsInstructs the tool to handle, compare and report distinct record formats separately. Value is true by default, and it is highly recommended to let it this way.See Multiple record formats
hasRdwSpecifies that the file uses Rdw. Defaults to false. The Rdw itself will not be parsed, it will be skipped. In order to define the record length, the tool uses the DependingOn attributes in the fileformat file.See Repeated data