|Version 35 (modified by chak, 5 years ago) (diff)|
Preventing space blow-up due to replicate
The vectorisation transformation lifts scalar computations into vector space. In the course of this lifting, scalar constants are duplicated to fill an array, using the function replicateP. Array computations are lifted in a similar manner, which leads to array constants being replicated to form arrays of arrays, which are represented as a segmented arrays. A simple example is our 'smvm' example code:
smvm :: [:[: (Int, Double) :]:] -> [:Double:] -> [:Double:] smvm m v = [: sumP [: x * (v !: i) | (i,x) <- row :] | row <- m :]
Here the variable 'v' is constant in the array comprehensions and will be replicated while lifting the expression v !: i. In other words, for every single element in a row, lifting implies the allocation of a separate copy of of the entire array v — and this only to perform a single indexing operation on that copy of v. More precisely, in the lifted code, lifted indexing (which we usually denote by (!:^) is applied to a nested array consisting of multiple copies of v; i.e., it is applied to the result of replicatePA (lengthPA row) v.
This is clearly wasted effort and space. However, the situation is even worse in Ben's pathological example:
treeLookup :: [:Int:] -> [:Int:] -> [:Int:] treeLookup table xx | lengthP xx == 1 = [: table !: (xx !: 0) :] | otherwise = let len = lengthP xx half = len `div` 2 s1 = sliceP 0 half xx s2 = sliceP half half xx in concatP (mapP (treeLookup table) [:s1, s2:]) -- or better: concatP (mapP (treeLookup table) (segmentP [:half, len - half:] xx))
Here table is constant in mapP (treeLookup table) [:s1, s2:]; hence, the entire table gets duplicated on each level of the recursion, leading to space consumption that is exponential in the depth of the recursion.
What's happening here?
Replication of scalars and arrays is always a waste of time and space. However, it is particularly problematic if the replicated structure is consumed by an indexing operation as it can change the asymptotic work complexity of the vectorised program. This holds not only for indexing, but for any operation that consumes only a small part of its input array(s). In other words, if a replicated structure is consumed in its entirety (for example by a fold), the asymptotic work complexity of replication matches that of consuming the structure. For operations that only consume a small part of their input, that is not the case. Hence, lifting, which introduces the replication, does increase asymptotic work.
Generally, we don't want to copy replicated data — it's a waste of space! We definitely don't want to do it for large data structures, and in particular, we don't want to do it for arrays. After all, that can change the asymptotic work complexity of a program. To keep it simple, we are for the moment focusing on avoiding replicating arrays, as this is were our current practical problems are coming from. (Independently of our current plans, some cases of replicated scalars are eliminated by fusion.)
To clarify the scope of the present work:
- Avoid physically creating multiple copies of the same array due to replicateP and replicateP^ introduced by vectorisation.
- Ensure that consumers of replicated arrays only perform work in proportion to the size of the arrays before replication.
Non-goals (they are worthwhile goals, but we don't attempt them at the moment)
- Avoid physically creating multiple copies of scalars due to replicateP (this includes large scalars, such as list or trees).
- Avoid deep traversals of arrays of trees for packP and similar.
The main difference to Roman's original approach was that he included the above non-goals as goals. We agreed to leave them as non-goals for the time being and return to them, once we are confident that we eliminated the main space blow-up. Although, Roman pointed out that it might be easier to prove that the new approach preserves work complexity through vectorisation — something, which we eventually will have to show.
NB: We will have to revisit replication of scalar structures as such scalar structures may be large trees.
Where does the problematic replication originate?
The applications of replicatePA and expandPA that introduce the problematic replication of arrays are in the definition of mapP, specifically replicatePA is used in mapP_S and expandPA in mapP_L (see Page 403 of HtM):
mapP_S :: (a :-> b) -> PA a :-> PA b mapP_S (Clo env _ fl) xss = fl (replicatePA (lengthPA xss) env) xss mapP_L :: PA (a :-> b) -> PA (PA a) -> PA (PA b) mapP_L (AClo env _ fl) xss = unconcatPA xss (fl (expandPA xss env) (concatPA xss))
In both cases, we replicate the environment of a closure before we apply the lifted version of the function represented by the closure. This is important as it guarantees that the consumer of these occurrences of replicatePA and expandPA process (topmost) segment structure in a homomorphic manner (after all, we are implementing a map function here)!
Our basic plan to avoid array duplication is to change replicatePA and expandPA such that they produce a segmented array that encodes the replication without unnecessarily copying data and that the consumer —the lifted function fl— processes segmented arrays with encoded replication in a special manner. As we will see, that also leads to the requirement that index transformations on replicated arrays, such as packP, need to preserve the compact encoding.
Fixing the problem: avoid to repeat segments
A replicated array results is always represented by a segmented array; more precisely, by a segmented array where a contiguous sequence of segments contains the same data. For example, we have
replicatePA 3 [:1, 2, 3:] = [:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:] where [:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:] = ([:3, 3, 3:], [:1, 2, 3, 1, 2, 3, 1, 2, 3:])
expandPA [:2, 3:] [:[:1, 2:], [:3:]:] = [:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:] where [:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:] = ([:2, 2, 1, 1, 1:], [:1, 2, 1, 2, 3, 3, 3:])
NB: expandPA is lifted replication; expandPA is the name we used in HtM.
Collapse repeated segments
In practice, segments descriptors store more information than just the segment length. They at least additionally store the start position of each segment in the data array. In the conventional representation, an invariant is that the start positions are such that the segments don't overlap. To represent arrays with repeated segments more efficiently, we might relax that invariant. Specifically, all start positions of a contiguous sequence of repeated segments may be the same; i.e., the segment data is stored only once per sequence of repeated segments.
Then, we have for [:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:],
start: [:0, 0, 0:] len: [:3, 3, 3:] data: [:1, 2, 3:])
and for [:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:],
start: [:0, 0, 2, 2, 2:] len: [:2, 2, 1, 1, 1:] data: [:1, 2, 3:])
This is merely a change in the array representation that does not affect vectorisation.
TODO I am not sure whether we want to talk about this representation at all. Maybe just go to the one with virtual segments straight away. I suspect the latter will be less confusing in the paper.
Segment descriptor representation with virtual segments
Instead, of repeating the start indices in a segment descriptor, we alternatively might want to represent a segmented array with repeated segments by distinguishing its physical from its logical (or virtual) representation. Specifically, instead of representing [:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:] as
start: [:0, 0, 0:] len: [:3, 3, 3:] data: [:1, 2, 3:])
we might instead represent it as
vsegs: [:0, 0, 0:] pstart: [:0:] plen: [:3:] data: [:1, 2, 3:])
where pstart, plen, and data represent the underlying segmented array (with non-overlapping segments) and vsegs specifies the logical segments of the array, where physical segments may occur not at all, once, or multiple times. In this example, the only physical segment is repeated three times.
Our second example, [:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:], which we previously represented as
start: [:0, 0, 2, 2, 2:] len: [:2, 2, 1, 1, 1:] data: [:1, 2, 3:])
will now be
vsegs: [:0, 0, 1, 1, 1:] pstart: [0, 2:] plen: [:2, 1:] data: [:1, 2, 3:])
We choose the representation with virtual segments over the one with repeated start indicies for the following reasons:
- packP on a nested array that arose from replication only needs to operate on the virtual segments.
- Clean separation between the physical representation and the logical.
- Probably easier to subsequentyly move to Roman's proposal that moves replication information to the top of the representation.
Operations on arrays with virtual segments
Array consumers need to be adapted to work correctly on virtual segments and we need to be careful to make sure that although index space transformations introduced by vectorisation, such as packP, operate with a work complexity in proportion to the number of virtual segments (and not the size of the virtual number of elements). Moreover, other consumers, such as folds, must operate in work complexity in proportion to the physical size of the segmented array (and not in proportion to the virtual size.)
In the smvm example, a replicated array is consumed by lifted indexing to extract matching elements of the vector for all non-zero elements of the matrix. Using just an length array as a segment descriptor without virtual segments, lifted indexing might be implemented as follows:
(as_len, as_data) !:^ is = bpermutePA as_data ((prescanPA (+) 0 as_len) +^ is)
With overlapping segments, we have
(as_vsegs, as_pstart, as_plen, as_data) !:^ is = bpermutePA as_data (bpermutePA as_start as_vsegs +^ is)
In the case of smvm, where the first argument is produced by replicatePA (lengthPA row) v, we have as_vsegs = replicatePA (lengthPA row) 0 and as-data = v. In other words, lifted indexing draws from a single copy of v, which is what we wanted.
We need to be careful when reducing an array with virtual segments, as we want the work complexity to be in proportion of the physical and not in proportion of the virtual segments. Consider
f xs is = mapP f is where f i = sumP xs + i
As xs is free in f it will be in the environment of the array closure constructed by the vectorised code. Hence, given the implementation of mapP discussed earlier, xs will be replicated length is times. However, we want to work complexity of sumP xs be in proportion of length xs and not of length xs * length is.
To this end, sumP^ performs a reduction on the physical segments only. Afterwards, the per-segment results are distributed over a vector of the length of the vsegs descriptor using bpermutePA.
As we will see in the next subsection, some physical segments may not appear as virtual segments at all (if a replicated array shrunk by applying a pack operation). Hence, we need to be careful that sumP^ only reduces those physical segments of the array that are used by one or more virtual segments.
Splitting and combining (for lifted conditions)
Due to lifted conditionals (or, generally, case constructs), arrays with repeated segments may be split (packed) and combined. Arrays with repeated segments can be split (or packed) by including a repeated segment in the result exactly if one of its repetitions is included. This can be determined by disjunctively combining all flags for one sequence of repeated segments.
Arrays with repeated segments can not always be combined without duplicating the data corresponding to repeated segments (after all, a disjoint segment may be inserted into a sequence of repeated segments). For simplicity, we may want to expand all repeated segments in combinePA. (It seems that this should not lead to unexpected blow-ups as the repeated data now is part of a functions result and should be accounted for in its complexity estimate.)
Splitting and joining (for distribution across threads)
Our idea is to continue to split the data array evenly across threads. That may spread out the segments descriptor (length and starts arrays) very unevenly as repeated segments have an entry for each repetition in the segmentation information, but not in the data.
TODO Roman believes splitting arrays with repeated segments is a problem. To me it doesn't seem to be that much of a problem. (Keep in mind that we only need a restricted number of operations on arrays with repeated segments — all their consumers are homomorphisms as discussed in Plan B.
Multiple levels of nesting (unconcat and concat)
TODO What if we have segment descriptors on top of one with repeated segments?
The bigger picture
It makes sense to see this work and the concepts behind the Repa library as part of a bigger picture. In both cases, we want to avoid the overhead of index space transformations. In Repa, we use delayed arrays —i.e., arrays represented as functionals— to delay the execution of index transformations (as well as maps) and to fuse them into consumers. In Repa, we do that for index transformations explicitly specified by the programmer and we rely on the programmer to be aware of situations, where delayed arrays need to be forced into manifest form before they are consumed by an array operation that cannot be represented in delayed form — e.g., in matrix-matrix multiplication, we need to force the transposed array to improve cache behaviour.
In DPH, we are first of all concerned about chains of index transformations that begin with a lifted replicate as these lead to an asymptotic increase of work complexity as discussed above. However, other index transformations, such as non-lifted replicate, are of concern as well, and will need to be addressed eventually.
Despite the conceptual similarity, there are two big differences between the situation in Repa and DPH:
- In Repa, arbitrary user-specified index transformations are being delayed and we rely on the programmer to force these explicitly where necessary — i.e., a user needs to be aware of this whole mechanism. In contrast, in DPH, we aim at eliminating the index transformations introduced by the vectoriser (of which many programmers will not be aware); hence, the elimination also needs to be without user intervention.
- In Repa, we have no segmented arrays; hence, functions are sufficient to represent delayed arrays. In contrast, in DPH, segmented arrays are central and we need auxiliary data structures —such as virtual segment descriptors— to delay index transformations efficiently.
A question for the future is whether we can find a uniform framework that works for both Repa's regular arrays and DPH's nested arrays. This would provide a key to an integrated system.
- The work-efficient vectorisation paper by Prins et al. Their approach only works in a first-order language.
- Blelloch's work on the space-behaviour of data-parallel programs.
A plan to fix the problem
Generally, we don't want to copy replicated data — it's a waste of space! We definitely don't want to do it for large data structures, and in particular, we don't want to do it for arrays. After all, that can change the asymptotic work complexity of a program. So, instead of having replicateP allocate and fill a new array with multiple copies of the same data, we use a special array representation that stores the data (once!) together with the replication count. This is surely the most space efficient representation for a replicated array.
The downside of a special representation is that we now also need to modify all consumers of replicated arrays to accept this new representation and to handle it specially. This leads to some code blow up (each array consumer needs to be able to dynamically decide between two input array representations), and we need to be careful not to disturb fusion.
The trouble with indices
Although, a replicated array stores its replicated payload only once, it needs to be handled with care. When indexing into a replicated array, we still use the same indices as if the array data would have been copied multiple times. That can be a problem in examples, such as treeLookup above where the replicated array iteration-space grows exponentially — even 64bit indices will eventually overflow. However, we can circumvent that problem by taking care in code that consumes replicated arrays.
In the treeLookup example, the table is replicated and grows exponentially. But it is a segmented structure (one segment per copy of the original array) and it is accessed in the base case by a lifted index operation. When you look at the input to that application of lifted indexing, its first argument is huge (the replicated table), but the second argument contains the same data as the original value of xx, albeit segmented into an array with one segment per element. So we have effectively got
[:table, table, ...., table:] !^ [:[:xx_1:], [:xx_2:], ..., [:xx_n:]:]
Note how the xx_i are unchanged. It is only in the implementation of (!^) that the xx_i are blown up to index into the data vector of [:table, table, ...., table:] (which is concatP [:table, table, ...., table:]). It is that multiplication of the xx_i with the prescaned segment descriptor of [:table, table, ...., table:] that will overflow. Notice how that is internal to the implementation of (!^). If the first argument of (!^) is a replicated structure, there is no need whatsoever to perform that multiplication (and subsequent division) and the indices never overflow!
Never take the length of a replicated array
Unfortunately, not only indices blow out, the length of replicated arrays may also overflow 64bit integers. Hence, the consuming code must carefully avoid to take the length of such arrays. This is only the case for replicatePs introduced by the vectoriser. It is the responsibility of the DPH user to ensure that replicatePs that are explicit in the user code do not blow out. (We may want to switch to 64bit indices —at least on 64bit builds— anyway.)
Concrete implementation of replicated arrays
The DPH library is built on the vector package (that provides high-performance sequential arrays). This package heavily relies on a cheap representation of sliced arrays — i.e., arrays of which a subarray is extracted. Such array slices are not copied, but represented by a reference to the original array together with markers for the start and end of the slice.
Replicating and slicing
When implementing replicated arrays, we need to take into account that (1) a replicated may be a sliced vector and (b) that the partitioning of a parallel array across multiple threads requires to slice that parallel array.
* Is this really an issue? After all, we don't replicated thread-local slices of parallel arrays (which in turn may be sliced vectors), but replicate parallel arrays (which are already distributed over multiple threads).