[CakeML-dev] Let's remove max_app

Mon Mar 6 06:56:15 UTC 2017

> There are really two competing problems:
>  a) we need max_app to be high to get good code
>  b) we need max_app to be low to avoid huge stubs code

Supposing we don't remove max_app, and make it high to address (a) above.
Then is it possible to address (b) by doing dead code elimination on
the stubs that aren't necessary?
I guess the problem is that what's necessary is not statically known.
So maybe not.
But what is OCaml doing to be able to produce lots of stubs without
taking a code size performance hit?

On 6 March 2017 at 05:50, Scott Owens <S.A.Owens at kent.ac.uk> wrote:
> I’ve been thinking about this all day, just trying to cut the size and number of stubs, not get rid of max_app entirely yet. It’s not easy to get something that will work without slowing down some things that I’d rather not slow down: applications of functions with too many arguments, and applications of function that have been previously been partially applied, and are now given the rest of their arguments.
>
> I’m still working on it, but I think it’s worth recording my thoughts at this point. The difficulty is essentially due to the Call and Cons instructions needing to have all of their arguments listed out in the code.
>
> Consider the following case: I have a 5 argument function that has been given 3 arguments previously, and now it has been called with 2 more. This means that there is a stub in the partial application closure wrapper that contains a call with 5 argument that looks something like the following, using named variables:
>
> stub_5_3 <x, y, cl> =
>   Call None [x; y; El cl 3; El cl 4; El cl 5; El cl 2; El (El cl 2) 0]
>
> In order, the call has the two variables just passed in, the 3 variables stored in the partial application closure wrapper, the original closure, and finally, the code pointer of the actual function, which is what Call is calling.
>
> Of course, we need a stub for each size of call to make, and each different split between the current and stored arguments. There are O(max_app^2) such functions, and each one has O(max_app) arguments, so the whole thing is cubic. I don’t think there is a way to improve on this without changing BVL, and I double checked that OCaml does indeed generate all of these stubs, so there is some indication that they might not be a real problem.
>
> The only solution that I can think of is to introduce a new kind of Call into bvl to give me an alternate calling convention. Essentially, I would need a Call variant that could take a pointer to a block as an argument and then do the call with the elements of the block as argument. This would be implemented later on in the compiler once the calling convention is visible by emitting a loop that iterating over the block and pushing the contents on the stack. Then you could just have 1 stub per how many arguments are given immediately (the 2 arguments in the example).
>
> It doesn’t seem worth the disruption to add just for this, unless we would use it for other things too. Also there are some details to work out, like passing the code pointer closure record first so that the block pointer comes at the end. Also looping to extract the tuple’s contents might be slower than doing it directly.
>
> Once you have a quadratic number of stubs, you need a quadratic sized jump table to turn the runtime argument number calculations into the labels.
>
>
> Now let’s turn to reducing the n generic application functions to 1, as Magnus proposed below. Consider the following:
>
> We have a function of 3 arguments, and the application site has just passed in 5.
>
> Right now, this is handled by a function like the following:
>
> gen_app_5_3 <v,w,x,y,z,cl> =
>   Let [Call NONE [v; w; x; cl; El cl 0]]
>     mk_call cl [y;z]
>
> Again there are a quadratic number, since there are two calls of different lengths. In the real code, there is just gen_app5, and it contains a jump table to get the proper body based on the run-time information of how many arguments the closure value expects. If we just have gen_app and it takes a tuple of length 5, then instead of a Jump table followed by the code above, we have two options:
>
> - Generalise the extended Call with block argument that I mentioned above, so that I can specify a slice of the tuple to be used for arguments.
> - Build a jump table based on how many arguments the function expects, and then copy them out of the tuple. (A specialised mk_call can re-use the same tuple with an index for how much many arguments have been used up so far).
>
> In each case, we allocate a tuple on the heap just to immediately start copying them back out.
>
> Lastly, the code for building partial application wrappers has a nested jump table for getting label of the application stub to put into the record, in one case, this is triply nested, giving another cubic factor. I think can lift this to it’s own stub function and avoid massively duplicating it.
>
> The code for extending a partial application wrapper with a new partial application is also quadratic. This is where we differ from OCaml, where the wrappers nest, but this makes it slower to apply them as you have to unnest the wrappers. We want to do more work at closure creation time, to have faster applications, and so we copy the arguments from the old wrapper into the new one. But, similar to the function calls, the Cons instruction must statically know how many arguments it is getting, and so we need another jump table on the number of old arguments nested inside of the one for the number of new arguments. Similar to the hypothetical Call extension mentioned above, a Cons that could take a block as its last argument and copy the contents into the new block would make this work better, especially as performance of building the closure isn’t so important.
>
> Scott
>
>> On 2017/03/05, at 08:21, Magnus Myreen <magnus.myreen at gmail.com> wrote:
>>
>> Hi all,
>>
>> We should get rid of max_app. Reasons:
>>
>> 1. The benchmarks show that one hits a significant performance drop
>>    if the compiled code includes curried functions with more
>>    arguments than max_app. I believe the bootstrapped compiler would
>>    run a significant constant factor faster if it was compiled with a
>>    high (or non-existent) max_app value, since many functions take
>>    more than 5 arguments. (I was e.g. looking at assign_def in
>>    data_to_word which is run on every Op in the dataLang
>>    programs. The assign function takes 7 arguments).
>>
>> 2. The current compilation strategy produces too much stub code in
>>    clos_to_bvl.  If we really must have max_app (I argue below that
>>    we don't), then it should be something like 10 or 15. With the
>>    current compilation strategy, the binary produced for max_app=9 is
>>    176 kb larger than the binary for max_app=3. I wanted to try
>>    compiling with max_app=10, but that was taking more than 5
>>    minutes. Compiling with max_app=3 takes 7 seconds for me. I
>>    believe the slow down in compilation speed with differing max_app
>>    values is caused by the rapid increase in code size (number of
>>    stubs in clos_to_bvl) as max_app gets larger. My measurements
>>    are below while compiling factorial (and the basis library):
>>
>>      max_app=1:  10.810s  (* very little room for opts in basis *)
>>      max_app=2:  8.724s
>>      max_app=3:  7.872s  (* allows optimisations in basis code *)
>>      max_app=4:  8.935s
>>      max_app=5:  11.608s
>>      max_app=6:  21.771s
>>      max_app=7:  42.932s
>>      max_app=8:  1m24.872s
>>      max_app=9:  3m9.745s
>>
>>    One of the reasons the bootstrapped compilation times are slow is
>>    that closLang (and subsequent) optimisations could not fire due to
>>    clos_mti not doing enough (due to the low max_app) when
>>    compiling the compiler in HOL. As we know, if clos_mti doesn't
>>    do enough, a lot of other optimisations don't do much.
>>
>> So the question is: why do we have max_app currently? I believe it is
>> only there because of the way clos_to_bvl wants to jump to specialised
>> code for dealing with applying n arguments to an m-argument
>> closure. Since n and m need to be <= max_app, we have max_app *
>> max_app stubs (each being a non-trivial piece of code).
>>
>> My suggestion is to:
>>
>> 1. Remove max_app completely from the semantics, compiler and
>>    proofs. This should make various things neater and simpler.
>>
>> 2. Have only one general-purpose stub (gen_app_stub) in clos_to_bvl
>>    for dealing with implementing the semantics of applying an
>>    arbitrary closure value to any number of arguments. Currently the
>>    compilation of applications of unknown closures (case: loc_opt =
>>    NONE) is as follows:
>>
>>  (compile_exps max_app [App loc_opt x1 xs2] aux =
>>     let (c1,aux1) = compile_exps max_app [x1] aux in
>>     let (c2,aux2) = compile_exps max_app xs2 aux1 in
>>       ([dtcase loc_opt of
>>         | NONE =>
>>             Let (c2++c1) (mk_cl_call (Var (LENGTH c2)) (GENLIST Var
>> (LENGTH c2)))
>>         | SOME loc =>
>>             (Call (LENGTH c2 - 1) (SOME (loc + (num_stubs max_app)))
>> (c2 ++ c1))],
>>        aux2)) /\
>>
>>  mk_cl_call cl args =
>>    If (Op Equal [mk_const (LENGTH args - 1); mk_el cl (mk_const 1)])
>>       (Call (LENGTH args - 1) NONE (args ++ [cl] ++ [mk_el cl (mk_const 0)]))
>>       (Call 0 (SOME (LENGTH args - 1)) (args ++ [cl]))
>>
>>    I suggest we change the last line above to something that tuples
>>    up the given arguments and senda them to the gen_app_stub,
>>    i.e. something like:
>>
>>       (Call 0 (SOME gen_app_stub_location)
>>          [cl; Op (Cons 0) args; Op (Const (& (LENGTH args))) []])
>>
>>    The code above passes in LENGTH args to save gen_app_stub from
>>    having to compute it from the tuple. It should probably also be
>>    given the previously computed value of `mk_el cl (mk_const 1)` so
>>    that it can avoid looking up the expected number of arguments from
>>    the closure value. (This is a very minor optimisation.)
>>
>>    The code for gen_app_stub can of course include a few tests for
>>    common cases at the top, e.g. it should probably first check
>>    whether this is a two-argument closure applied to one argument (my
>>    guess is that this is the most common case of a partial
>>    application) and then run special-purpose code only for that
>>    common case, but all other cases would be dealt with using generic
>>    code.
>>
>> Thoughts? Am I missing some reason for why we have max_app? I suggest
>> we remove it.
>>
>> Cheers,
>> Magnus
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at cakeml.org
>> https://lists.cakeml.org/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at cakeml.org
> https://lists.cakeml.org/listinfo/developers