Understanding how to get NLS data into Infosphere DataStage using ODBC Enterprise Stage and ODBC Connector Stage
Infosphere DataStage parallel canvas has two types of string datatypes , string and ustring. string is the datatype used when char/varchar is selected on the DataStage Designer canvas. ustring is the datatype used when nchar/nvarchar is selected , or when char/varchar with extended option unicode is selected. Both the types carry data differently. For string datatype the data is in a single byte character set . This can be single byte char code page like ISO-8859 or multi byte char code page like SJIS. The ustring data type carries the data as 2 byte unicode. The default conversion from string->ustring and vice versa is governed by the NLS Map defined at the job level. This could be the project default if setting is not done at the job level.
For ODBC Enterprise stage source stage, the schema for char/varchar columns is determined as string data type. If the char/varchar columns contains NLS data in some charset apart from ISO-8859-1 , the customer may get invalid data due to conversion issues. Since the column is defined as char/varchar in the table, ODBC Enterprise Stage will determine its type as SQL_CHAR and read it as SQL_C_CHAR. When the data is read as char , then IANAAppCodePage will determine the code page in which the application gets the data. This is easy to understand when retrieving data from DB2 z/OS. Since the z/OS is EBCDIC and Infosphere DataStage is an ASCII application, data conversion has to be done from EBCDIC to ASCII. This is accomplished by the ODBC driver using IANAAppCodePage. For EBCDIC Latin characters code page 37, the default IANAAppCodePage=4 is sufficient. For Other EBCDIC NLS character sets, equivalent ASCII character set will have to be determined and IANAAppCodePage updated.
For ODBC Connector, selecting nchar/nvarchar on the GUI will result in the column binding becoming SQL_C_WCHAR . When column binding is SQL_C_WCHAR, the ODBC driver will convert all data into unicode and hence IANAAppCodePage is not used. In case of ODBC Enterprise Stage selecting nchar/nvarchar on the GUI will not result in the column binding becoming SQL_C_WCHAR, as the stage does not get the column definition entered in the GUI.
A related DSN option for NLS data is CharsetFor65535. This is required to be set in the DSN for DB2 in order to read character data stored in column defined as CHAR/VARCHAR CCSID 65535. When the column is defined as CCSID 65535, the DB2 database does not store the data as any particular character set. Instead the binary values submitted by the end user are stored as such. When reading, the data is read as binary unless CharsetFor65535 is set in the DSN. This indicates to the ODBC driver how to interpret the data and hence get it as char/varchar.
e.g if a table on DB2 z/OS has a column with CCSID 65535 and contains data in EBCDIC Latin code page 37, then the DSN has to contain
CharsetFor65535=2028 ( for Progress ODBC driver version 6.0 and above)
IANAAppCodePage=4
This would be required for both ODBC Connector as well as ODBC Enterprise Stage as otherwise the column cannot be read as char/varchar, since the database can return the data only as binary.
Points to note:
Infosphere DataStage parallel canvas has two types of string datatypes , string and ustring. string is the datatype used when char/varchar is selected on the DataStage Designer canvas. ustring is the datatype used when nchar/nvarchar is selected , or when char/varchar with extended option unicode is selected. Both the types carry data differently. For string datatype the data is in a single byte character set . This can be single byte char code page like ISO-8859 or multi byte char code page like SJIS. The ustring data type carries the data as 2 byte unicode. The default conversion from string->ustring and vice versa is governed by the NLS Map defined at the job level. This could be the project default if setting is not done at the job level.
For ODBC Enterprise stage source stage, the schema for char/varchar columns is determined as string data type. If the char/varchar columns contains NLS data in some charset apart from ISO-8859-1 , the customer may get invalid data due to conversion issues. Since the column is defined as char/varchar in the table, ODBC Enterprise Stage will determine its type as SQL_CHAR and read it as SQL_C_CHAR. When the data is read as char , then IANAAppCodePage will determine the code page in which the application gets the data. This is easy to understand when retrieving data from DB2 z/OS. Since the z/OS is EBCDIC and Infosphere DataStage is an ASCII application, data conversion has to be done from EBCDIC to ASCII. This is accomplished by the ODBC driver using IANAAppCodePage. For EBCDIC Latin characters code page 37, the default IANAAppCodePage=4 is sufficient. For Other EBCDIC NLS character sets, equivalent ASCII character set will have to be determined and IANAAppCodePage updated.
For ODBC Connector, selecting nchar/nvarchar on the GUI will result in the column binding becoming SQL_C_WCHAR . When column binding is SQL_C_WCHAR, the ODBC driver will convert all data into unicode and hence IANAAppCodePage is not used. In case of ODBC Enterprise Stage selecting nchar/nvarchar on the GUI will not result in the column binding becoming SQL_C_WCHAR, as the stage does not get the column definition entered in the GUI.
A related DSN option for NLS data is CharsetFor65535. This is required to be set in the DSN for DB2 in order to read character data stored in column defined as CHAR/VARCHAR CCSID 65535. When the column is defined as CCSID 65535, the DB2 database does not store the data as any particular character set. Instead the binary values submitted by the end user are stored as such. When reading, the data is read as binary unless CharsetFor65535 is set in the DSN. This indicates to the ODBC driver how to interpret the data and hence get it as char/varchar.
e.g if a table on DB2 z/OS has a column with CCSID 65535 and contains data in EBCDIC Latin code page 37, then the DSN has to contain
CharsetFor65535=2028 ( for Progress ODBC driver version 6.0 and above)
IANAAppCodePage=4
This would be required for both ODBC Connector as well as ODBC Enterprise Stage as otherwise the column cannot be read as char/varchar, since the database can return the data only as binary.
Points to note:
- ODBC Enterprise Stage in source context determines the column definition of the query/table using SQLDescribeCol. It does not receive the column definition entered on the GUI. A environment variable APT_ODBC_USE_USTRINGS when set can force all char/varchar columns to get data as unicode, but individual columns cannot be controlled. This restriction is removed in ODBC Connector stage.
- ODBC Enterprise Stage and ODBC Connector stage do not do any conversion internally. Conversion is performed by ODBC driver using IANAAppCodePage and CharsetFor65535. Hence NLS map entered on the DataStage GUI does not impact the column data. It is used however to convert table , query and column names etc to unicode and vice versa.
No comments:
Post a Comment