Debugging Distributed Applications
In attempting to debug a distributed graph, an understanding of the material discussed in the
Application Model and
Execution Modes sections is helpful.
As described there, a distributed graph is converted into a sequence of graphs representing different phases of processing with each phase executing several copies of the graph on the nodes of the cluster. This implies that all of the techniques discussed previously can also be applied to debugging a graph executed on a cluster.
Unfortunately, the debugging process is made more complex by the fact that code is running on several remote machines instead of just the local one. So while the concepts behind the techniques remain the same, some details change.
Combining Logs
The challenge introduced for logging in a distributed scenario is that there is no single log file containing all of the data, unlike the case where you are executing locally. These files will be spread across all the machines executing the graph. To get a total view requires looking at all of these log files.
Setting Up Clusters discusses where the log files produced by cluster execution can be found.
Each graph will use a globally unique identifier, or GUID, which identifies the log file. By looking at log files on different machines with the same GUID, it is possible to produce a unified view of the logging for a given phase.
Note that this issue does not necessarily apply to data logged using an explicit operator as described in
Injecting Write Operators. Because these are operators like any other, these can be made to run locally (at a cost of performance) by calling the
disableParallelism() method, producing a unified output. However, while this may make things more convenient, it makes it impossible to determine which partition produced any given record. Therefore, this is not a recommended practice.
For data being written to disk instead of a log this is not typically a problem, as a parallel write will produce a set of files in a single directory on a shared file system. This makes it much easier to locate the files in question.
Remote Debugging
The key difference between debugging a local graph and a distributed graph is that multiple JVMs will be running. Therefore, there will need to be multiple concurrent debugging sessions, one per JVM.
To use a debugger with distributed graph execution, the JVM debugging flag, discussed in
Using a Debugger, must be passed to the remote executors. This can be accomplished through cluster configuration as described in
Setting Up Clusters. Only one JVM will be started per node, so sharing the same debugging port setting in all JVMs should not be a problem.
This approach is not generally recommended for distributed graphs, as it can quickly become difficult to manage as the number of nodes increases. Try using one of the other debugging approaches first before attempting this one.
Executing Locally
An alternative approach to distributed debugging is to try executing the graph locally, removing the complexity introduced by running across multiple machines. As execution in local mode emulates distributed execution inside a single JVM (though with some optimizations applied), any problems are likely to occur in both modes of execution. Attempting this approach does not require a rewrite of the application. It only requires changing the mode specified at execution; graphs for distributed and local execution are composed exactly the same.
One possible stumbling block to this approach is the size of the input. If the input data is too large to be processed on a single machine, try scaling the data down.