Schema Discovery
Similar to formats, DataFlow can analyze a delimited text file to produce a possible schema. This process is referred to as
schema discovery. The delimited text reader allows a
TextRecordDiscoverer to be provided that implements this process.
When the schema is needed, an initial chunk of the source will be read and parsed, with the resulting rows being passed to the discoverer. The discoverer is expected to compute the appropriate schema for the given data.
A default schema discovery mechanism is provided that is pattern-based. It applies regular expressions to values, looking for a match. If a pattern matches a field’s value, it is predicted to be of the associated text type. This process is conservative, only choosing a type if it can successfully match the pattern for all records. If no types match, it will assume the field is string valued.
The default pattern set is able to detect the following types. For simplicity, patterns are intended as rough approximations of the valid set of values; they may include some values that cannot be parsed for the associated type or exclude some values that can.
The default pattern set can be extended to produce a custom discoverer. TextRecord.extendDefault() takes a list of additional pattern-to-type mappings to apply in the discovery process and returns a discoverer using the specified patterns in addition to the defaults.
TextRecordDiscoverer is also used to support automatic generation of schemas for writing delimited text. In this case, the discoverer is provided the record token type of the input and is expected to produce a schema. The default implementation uses TextRecord.convert() to create the schema.