Performance and Scalability
When talking about computing, performance usually means how long it takes a job to execute, as measured by a clock. How long does it take to get a result for a given input? Performance is usually constrained by some finite resource—disk, CPU, memory, or network bandwidth—limiting the speed of the program. You can use a profiling tool to identify these bottlenecks, which hopefully can be addressed, yielding a performance gain.
Scalability is a measure of how performance changes as either available resources or the volume of input data changes. A scalable solution exhibits a proportional change in performance in response to a change in either of those two variables. While ideally doubling CPUs would halve the run time, in practice, the relationship may not be linear. Problems may consist of scalable and nonscalable portions or portions of varying scalability.
When adding additional resources to a system, there are two directions to consider:
• Scaling up refers to increasing the available resources on a single node of the system. This is the traditional SMP (symmetric multi-processor) model of adding more CPUs, more disk, or more memory to a machine to increase performance.
• Scaling out refers to increasing available system resources by adding additional nodes. This is the typical distributed cluster model, enlarging the cluster size to increase performance.
DataFlow is designed to handle scaling in either direction. Applications built using DataFlow scale equally well on "fat" SMP machines and on clusters of "skinny" commodity PCs.
Scalability is typically of greater concern than run time. This is especially true when dealing with large amounts of data. With a scalable solution, performance can always be improved by providing more resources; performance tuning cannot always offer such guarantees. Secondly, if the processing is nonscalable as data volume grows, the solution will fail, in which case execution time becomes irrelevant.