Using PSQL Bulk Load
PSQL Bulk Load offers a one-stop approach for all of your bulk load needs, with a simple, consistent way to do bulk load operations for PSQL and for all of the DataDirect Connect products that support this bulk load feature. This means that you can write your bulk load applications using the standards-based API bulk interfaces, and then, just plug in the database data providers or drivers to do the work for you.
Suppose you need to load data into PSQL, Oracle, DB2, and Sybase. In the past, you probably had to use a proprietary tool from each database vendor for bulk load operations, or write your own tool. Now, because of the interoperability built into PSQL Bulk Load, your task is much easier. Another advantage is that PSQL Bulk Load uses 100% managed code, and requires no underlying utilities or libraries from other vendors.
Bulk load operations between dissimilar data stores are accomplished by persisting the results of the query in a comma-separated value (CSV) format file, a bulk load data file. The file can be used between the PSQL ADO.NET Data Provider and any DataDirect Connect for ADO.NET data providers that support bulk load. In addition, the bulk load data file can be used with any DataDirect Connect product driver or data provider that supports the Bulk load functionality. For example, the CSV file generated by the PSQL SQL data provider can be used by a DataDirect Connect for ODBC driver that supports bulk load.
Use Scenarios for PSQL Bulk Load
Following are two of ways you can use PSQL Bulk Load with the PSQL ADO.NET Data Provider are:
Figure 3 Using PSQL Bulk Load Between Two DataSources
After exporting data from an Oracle database, you migrate the results to a PSQL database. Figure 4 shows an ODBC environment copying data to an ADO.NET database server.
Figure 4 Copying Data from ODBC to ADO.NET
In this figure, the ODBC application includes code to export data to the CSV file, and the ADO.NET application includes code to specify and open the CSV file. Because the PSQL ADO.NET Data Provider and the DataDirect ODBC drivers use a consistent format, interoperability is supported via these standard interfaces.
PSQL Common Assembly
The PSQL Bulk Load implementation for ADO.NET uses the de facto standard defined by the Microsoft SqlBulkCopy classes, and adds powerful built-in features to enhance interoperability as well as the flexibility to make bulk operations more reliable.
The data provider includes provider-specific classes to support PSQL Bulk Load. See Data Provider-specific Classes for more information. If you use the Common Programming Model, you can use the classes in the PSQL Common Assembly (see PSQL Common Assembly Classes).
The Pervasive.Data.Common assembly includes classes that support PSQL Bulk Load, such as the CsvDataReader and CsvDataWriter classes that provide functionality between bulk data formats.
The Common assembly also extends support for bulk load classes that use the Common Programming Model. This means that the SqlBulkCopy patterns can now be used in a new DbBulkCopy hierarchy.
Future versions of the data provider will include other features that enhance the Common Programming Model experience. See PSQL Common Assembly Classes for more information on the classes supported by the Pervasive.Data.Common assembly.
Bulk Load Data File
The results of queries between dissimilar data stores are persisted in a comma-separated value (CSV) format file, a bulk load data file. The file name, which is defined by the BulkFile property, is using for writing and reading the bulk data. If the file name does not contain an extension, the ".csv" extension is assumed.
Example
The PSQL source table GBMAXTABLE contains four columns. The following C# code fragment writes the GBMAXTABLE.csv and GBMAXTABLE.xml files that will be created by the CsvDataWriter. Note that this example uses the DbDataReader class.
cmd.CommandText = "SELECT * FROM GBMAXTABLE ORDER BY INTEGERCOL";
DbDataReader reader = cmd.ExecuteReader();
CsvDataWriter csvWriter = new CsvDataWriter();
csvWriter.WriteToFile("\\NC1\net\PSQL\GBMAXTABLE\GBMAXTABLE.csv", reader);
The bulk load data file GBMAXTABLE.csv contains the results of the query:
1,0x6263,"bc","bc"
2,0x636465,"cde","cde"
3,0x64656667,"defg","defg"
4,0x6566676869,"efghi","efghi"
5,0x666768696a6b,"fghijk","fghijk"
6,0x6768696a6b6c6d,"ghijklm","ghijklm"
7,0x68696a6b6c6d6e6f,"hijklmno","hijklmno"
8,0x696a6b6c6d6e6f7071,"ijklmnopq","ijklmnopq"
9,0x6a6b6c6d6e6f70717273,"jklmnopqrs","jklmnopqrs"
10,0x6b,"k","k"
The GBMAXTABLE.xml file, which is the bulk load configuration file that provides the format of this bulk load data file, is described in the following section.
Bulk Load Configuration File
A bulk load configuration file is produced when the CsvDataWriter.WriteToFile method is called (see CsvDataWriter for more information).
The bulk load configuration file defines the names and data types of the columns in the bulk load data file. These names and data types are defined the same way as the table or result set from which the data was exported.
If the bulk data file cannot be created or does not comply with the schema described in the XML configuration file, an exception is thrown. See XML Schema Definition for a Bulk Data Configuration File for more information about using an XML schema definition.
If a bulk load data file that does not have a configuration file is read, the following defaults are assumed:
The bulk load configuration file describes the bulk data file and is supported by an underlying XML Schema defined at:
http://www.datadirect.com/ns/bulk/BulkData.xsd.
Example
The format of the bulk load data file shown in the previous section is defined by the bulk load configuration file, GBMAXTABLE.xml. The file describes the data type and other information about each of the four columns in the table.
<?xml version="1.0" encoding="utf-8"?>
<table codepage="UTF-16LE" xsi:noNamespaceSchemaLocation="http://www.datadirect.com/ns/bulk/BulkData.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<column datatype="DECIMAL" precision="38" scale="0" nullable=
"false">INTEGERCOL</column>
<column datatype="VARBINARY" length="10" nullable=
"true">VARBINCOL</column>
<column datatype="VARCHAR" length="10" sourcecodepage="Windows-1252"
externalfilecodepage="Windows-1252" nullable="true">VCHARCOL</column>
<column datatype="VARCHAR" length="10" sourcecodepage="Windows-1252"
externalfilecodepage="Windows-1252" nullable="true">UNIVCHARCOL</column>
</row>
</table>
Determining the Bulk Load Protocol
Bulk operations can be performed using dedicated bulk protocol, that is, the data provider uses the protocol of the underlying database. In some cases, the dedicated bulk protocol is not available, for example, when the data to be loaded is in a data type not supported by the dedicated bulk protocol. Then, the data provider automatically uses a non-bulk method such as array binding to perform the bulk operation, maintaining optimal application uptime.
Character Set Conversions
At times, you might need to bulk load data between databases that use different character sets.
For the PSQL ADO.NET Data Provider, the default source character data, that is, the output from the CsvDataReader and the input to the CsvDataWriter, is in Unicode (UTF-16) format. The source character data is always transliterated to the code page of the CSV file. If the threshold is exceeded and data is written to the external overflow file, the source character data is transliterated to the code page specified by the externalfilecodepage attribute defined in the bulk configuration XML schema (see XML Schema Definition for a Bulk Data Configuration File). If the configuration file does not define a value for externalfilecodepage, the CSV file code page is used.
To avoid unnecessary transliteration, it's best for the CSV and external file character data to be stored in Unicode (UTF-16). You might want your applications to store the data in another code page in one of the following scenarios:
The configuration file may optionally define a second code page for each character column. When character data exceeds the value defined by the CharacterThreshold property and is stored in a separate file (see External Overflow File), the value defines the code page for that file.
If the value is omitted or if the code page defined by the source column is unknown, the code page defined for the CSV file will be used.
External Overflow File
If the value of the BinaryThreshold or CharacterThreshold property of the CsvDataWriter object is exceeded, separate files are generated to store the binary or character data. These overflow files are located in the same directory as the bulk data file.
If the overflow file contains character data, the character set of the file is governed by the character set specified in the CSV bulk configuration file.
The filename contains the CSV filename and a ".lob" extension (for example, CSV_filename_nnnnnn.lob). These files exist in the same location as the CSV file. Increments start at _000001.lob.
Bulk Copy Operations and Transactions
By default, bulk copy operations are performed as isolated operations and are not part of a transaction. This means there is no opportunity for rolling the operation back if an error occurs.
PSQL allows bulk copy operations to take place within an existing transaction. You can define the bulk copy operation to be part of a transaction that occurs in multiple steps. Using this approach enables you to perform more than one bulk copy operation within the same transaction, and commit or roll back the entire transaction.
Refer to the Microsoft online help topic "Transaction and Bulk Copy Operations (ADO.NET)" for information about rolling back all or part of the bulk copy operation when an error occurs.