Debugging Techniques
Debug Logging
The standard configuration of DataFlow uses log4j for logging and is configured using the log4j.properties file contained in the conf/ directory of the DataFlow installation. This configuration will be used by any DataFlow application started using the
dr command (see
Command Line Usage: dr).
Logging is configured to log at the INFO level by default. Logs are written to two targets: the console and a log file. The console is the displayed output of a job when run from the command line. Also, some IDEs such as Eclipse will capture the console logs and display them within a console window. Log files are configured by default to be written into the user’s home directory. The file name used by default is dataflow.log.
As log files fill up, the current log file is archived using a different name, discarding any previously archived log. A new empty file is then created, replacing the current log file. Change the trace level for logs by changing the log level of the root logger. Alternatively, if fine-grained control of logging is desired, add specific properties for named loggers.
Most operators follow the convention of using a logger with a name matching the fully qualified class name of the operator itself. Refer to
log4j documentation for more information about configuring logging facilities.
Note: When collecting logging at debug level, bear in mind that this may produce large amounts of information. It may also include information related to exception handling. Do not be alarmed about exceptions and stack traces appearing in the log file; these do not indicate an issue unless they also appear at the CLI or GUI. The data in the log file is strictly informational and intended for debugging purposes.
The following is an example of the log4j.properties file. Edit the file and change the INFO setting for the log4j.rootLogger property to DEBUG. This will enable debug logging for both the console and the log file.
log4j.properties file
# Set root logger
log4j.rootLogger=INFO, console, file
# Logging to console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%-4r %-5p %c{1} %x - %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=${user.home}/dataflow.log
log4j.appender.file.MaxFileSize=100MB
log4j.appender.file.MaxBackupIndex=1
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d %p %t %c - %m%n
Injecting Write Operators
When creating a complex DataFlow application, it can be hard to determine where in the pipeline a problem is occurring. If you have inspected an application and cannot determine why the results of an application are not correct, the next course of action is to determine where the data is going wrong. Since DataFlow supports having multiple readers on each of the output ports of your operators, it is easy to add additional operators to persist the data in some form for later inspection. In this way, you can determine if data results are valid at each step within the data pipeline.
Useful operators for persisting data at different stages of a data pipeline include:
• LogRows: This operator will dump data values from data rows using the configured logging system. The operator logs the schema of the data in addition to the individual rows, where the frequency and total count of the rows logged is configurable. The amount of output from
LogRows can be overwhelming if large amounts of data are processed. Use with discretion.
• WriteDelimitedText: This operator will persist the data into files in CSV format. The exact format is configurable. These files are human readable.
• WriteStagingDataset: This operator will write data into files in binary form. It is the fastest way to persist data but the data is not in human readable form.
The followoing example application reads in two data files, joins the contents, aggregates the results of the join, and writes the aggregated data to a file.
However, when we execute the application we notice that the movie name with the highest count is a null value. Scanning our input data, we see no null values for movie name. Given this is a small application, our choices of where to inject a writer to validate its output are limited. The output of the
Group operator is already being persisted, so let’s inject a
LogRows operator after the
Join operator. The log frequency will be set to 1 to log every row. This is acceptable since the input files are relatively small.
Sample DataRush Application
import static com.pervasive.datarush.types.TokenTypeConstant.INT;
import static com.pervasive.datarush.types.TokenTypeConstant.STRING;
import static com.pervasive.datarush.types.TokenTypeConstant.record;
import com.pervasive.datarush.graphs.LogicalGraph;
import com.pervasive.datarush.graphs.LogicalGraphFactory;
import com.pervasive.datarush.io.WriteMode;
import com.pervasive.datarush.operators.group.Group;
import com.pervasive.datarush.operators.io.textfile.ReadDelimitedText;
import com.pervasive.datarush.operators.io.textfile.WriteDelimitedText;
import com.pervasive.datarush.operators.join.Join;
import com.pervasive.datarush.operators.join.JoinMode;
import com.pervasive.datarush.schema.TextRecord;
import com.pervasive.datarush.tokens.record.JoinKey;
import com.pervasive.datarush.types.RecordTokenType;
/**
* Join the ratings and movie data sets
*/
public class JoinRatings {
public static void main(String[] args) {
// Create an empty logical graph
LogicalGraph graph = LogicalGraphFactory.newLogicalGraph("ReadAndWrite");
// Create a delimited text reader for the "ratings.txt" file
ReadDelimitedText ratingsReader = graph.add(new ReadDelimitedText("data/ratings.txt"));
ratingsReader.setFieldSeparator("::");
ratingsReader.setHeader(true);
RecordTokenType ratingsType = record(INT("userID"), INT("movieID"), INT("rating"), STRING("timestamp"));
ratingsReader.setSchema(TextRecord.convert(ratingsType));
// Create a delimited text reader for the "movies.txt" file
ReadDelimitedText moviesReader = graph.add(new ReadDelimitedText("data/movies.txt"));
moviesReader.setFieldSeparator("::");
moviesReader.setHeader(true);
RecordTokenType moviesType = record(INT("movieID"), STRING("movieName"), STRING("genre"));
moviesReader.setSchema(TextRecord.convert(moviesType));
// Join the ratings and the movie definitions
Join joiner = graph.add(new Join());
joiner.setJoinKeys(JoinKey.keys("movieID")); // Join on movieID field
joiner.setJoinMode(JoinMode.LEFT_OUTER);
joiner.setUseHashJoinHint(true);
// Connect ratings reader, movies reader to join
graph.connect(ratingsReader.getOutput(), joiner.getLeft());
graph.connect(moviesReader.getOutput(), joiner.getRight());
// Group the joined data by movieName
Group group = graph.add(new Group());
group.setKeys(new String[] {"movieName"});
group.setAggregations("count(rating) as count_ratings, min(rating) as min_rating, max(rating) as max_rating, avg(rating) as avg_rating");
// Connect join and group
graph.connect(joiner.getOutput(), group.getInput());
// Create a delimited text writer
WriteDelimitedText writer = graph.add(new WriteDelimitedText("results/join-ratings.txt", WriteMode.OVERWRITE));
writer.setFieldDelimiter("");
writer.setHeader(true);
writer.setWriteSingleSink(true);
// Connect group to the writer
graph.connect(group.getOutput(), writer.getInput());
// Compile and run the graph
graph.run();
}
}
Following is the code fragment to inject the
LogRows operator.
Fragment Adding LogRows Operator
// Log output of the join operator (frequency 1 implies all rows logged)
LogRows logger = graph.add(new LogRows(1));
graph.connect(joiner.getOutput(), logger.getInput());
Now, when we run the application we will see rows of the join output logged to the console. A sample of the console output is shown below:
INFO: 3: row 443 is [32, 32, 5, 4, 978244962, Twelve Monkeys (1995), Drama|Sci-Fi]
com.pervasive.datarush.operators.sink.LogRows execute
INFO: 3: row 444 is [1243, , 5, 3, 978244001, , ]
com.pervasive.datarush.operators.sink.LogRows execute
INFO: 3: row 445 is [1897, , 5, 4, 978246426, , ]
com.pervasive.datarush.operators.sink.LogRows execute
INFO: 3: row 446 is [1171, , 5, 4, 978244852, , ]
com.pervasive.datarush.operators.sink.LogRows execute
INFO: 3: row 447 is [3786, , 5, 3, 978241981, , ]
com.pervasive.datarush.operators.sink.LogRows execute
INFO: 3: row 448 is [34, 34, 5, 4, 978244603, Babe (1995), Children's|Comedy|Drama]
Inspecting the output more closely, it becomes clear that the ratings data from the left side of the join is always present. However, the movie data from the right side of the join is not always present. So the Join operator is sometimes outputting null values in the right hand side data (the movie data).
Taking a look back at our application code, we can see on line 40 that a left outer join is being used. A left outer join always provides the left side data of the join, injecting null values if the right hand side data is not available with matching key values.
Thinking back, the test data used to create this example was sampled. It looks like we do not have a full set of movie data, which is causing the null values during the data join. To resolve this issue we can change the Join operator to use an inner join or update our movie data to contain the full set of movies.
Injecting the inspection of data at different stages of the application pipeline can be a very useful debugging exercise when the problem being experienced is data related.
Validating Key Properties
When data problems occur with a DataFlow application it is always a good idea to review the settings of operator properties that define key fields. A non-exhaustive list of such operators is:
For example, a common mistake with the
Group operator is to aggregate at the wrong level by excluding required key fields or including additional key fields. These types of errors can be discovered using code inspection and inspecting the data results from operators using the injection method described in the section above.
Using a Debugger
Because DataFlow is a Java library, applications can be debugged just like any other Java program. Unfortunately, debuggers are designed for traditional control flow, sometimes making it difficult to trace a problem in a DataFlow application this way. This can especially be the case when the problem is caused by a certain data record—especially when it is the millionth or ten millionth record. Even so, the execute() method of operators can be debugged just like any other Java code.
The simplest case is when the application is developed using an IDE like Eclipse or Netbeans. Both provide a facility for launching applications directly from the IDE and starting a debugging session. Simply do this with the application being debugged.
It is not necessary to launch programs from the IDE to debug them. Both Eclipse and Netbeans provide a way of starting a debugging session that can be attached to an already running application. To do this, it is necessary to pass the
-agentlib:jdwp option to the JVM specifying debugger connection information. One possible configuration, causing the JVM to open a socket on port 8000 and suspend on loading of the main class waiting for a debugger to connect, is shown below. An in-depth description of possible options using this flag can be found at
http://docs.oracle.com/javase/6/docs/technotes/guides/jpda/conninv.html#Invocation.
Example JVM debug settings
-agentlib:jdwp=transport=dt_socket,server=y,address=8000,suspend=y
If the
dr command is being used, it supports passing JVM options; see
Command Line Usage: dr for details. If the application is started some other way, ensure this process includes setting this option. Once the application has been started, a debugging session can then be started in the IDE and attached to the application using the information specified with the flag.
How to start and use a debugging session in an IDE is highly dependent on the IDE being used and is outside the scope of this document. See the documentation provided by the IDE for information about these topics.