[CakeML-dev] Let's remove max_app

Mon Mar 6 10:39:36 UTC 2017

I had written the following before Scott's email, but I think it is as
relevant still:

On rereading Scott's comments, I realised the implications of his
point about Call. I now agree that we can't remove max_app completely
from clos_to_bvl, because we can't at runtime decide the number of
arguments to pass.

With the current Call, we need a stub that can explode a functions
arguments up to max_app, i.e. given a code_ptr and a Block 0 [arg1;
arg2; arg3] it will run:

  Call NONE [El 0 block; El 1 block; El 2 block; code_ptr]

This stub would consist of a Jump table and could be the only part
that grows with max_app.

With this approach, I would like max_app to be something like 20.

The reason I'm so keen on using a large value for max_app is that I
believe the bootstrapped compiler runs slowly because many of its
function calls currently get compiled to bad partial calls due to
hitting max_app. If they didn't hit max_app, many of them would be
calls to known functions that can be replaced with fast calls in
clos_call.

On 6 March 2017 at 11:37, Scott Owens <S.A.Owens at kent.ac.uk> wrote:
> I’ll reply in-line below. I still think we need a quadratic number of stubs for the cases where performance matters. I think we can have a really slow stub to handle all applications with > local_max_app arguments and that that would be a good idea. So I think we are still in agreement at the high level.
>
> Scott
>
>> On 2017/03/06, at 10:02, Magnus Myreen <magnus.myreen at gmail.com> wrote:
>>
>> Hi Scott and Ramana,
>>
>> I'm replying to Scott and Ramana's messages in one. I've inlined my
>> detailed comments at the bottom.
>>
>> I have one top-level reply to Scott:
>>
>> Scott's statement that there must be a quadratic number of stubs is
>> not true. I believe Scott has missed BVL's FromList which allows you
>> to produce an arbitrary Block at runtime:
>>
>> https://github.com/CakeML/cakeml/blob/master/compiler/backend/semantics/bvlSemScript.sml#L147
>>
>> The FromList operation would be used in the gen_app_stub that I was
>> outlining.
>
>> A reminder:
>>
>>>> There are really two competing problems:
>>>> a) we need max_app to be high to get good code
>>>> b) we need max_app to be low to avoid huge stubs code
>>
>> Note that a) refers to max_app as used in clos_mti, and b) refers to
>> max_app as used in clos_to_bvl. I want us to decouple these so that we
>> remove the current max_app from clos_to_bvl, so that we can give it a
>> huge number in clos_mti (e.g. 1000000).
>>
>> In clos_to_bvl, we should have one stub that can handle application of
>> any closure to any number of arguments. Once that works, we can add
>> optimisations for small numbers using a quadratic number of stubs for
>> a new configuration value, say fast_app_limit (instead of the current
>> max_app which currently must coincide with the max_app in the closLang
>> semantics).
>>
>> Summary:
>> - we can have a single gen_app_stub
>> - we should decouple the current max_app from clos_to_bvl
>>
>> Cheers,
>> Magnus
>>
>> -------
>>
>> From Ramana:
>>
>>>> There are really two competing problems:
>>>> a) we need max_app to be high to get good code
>>>> b) we need max_app to be low to avoid huge stubs code
>>>
>>> Supposing we don't remove max_app, and make it high to address (a) above.
>>> Then is it possible to address (b) by doing dead code elimination on
>>> the stubs that aren't necessary?
>>> I guess the problem is that what's necessary is not statically known.
>>> So maybe not.
>>
>> I really doubt you can do deadcode elimination on it because they all
>> refer to each other (well almost). So if you have one unknown app in
>> your program, then practically all of these stubs will be potentially
>> live.
>
> Yeah, all you can do is note that is there are no 8 argument functions, then you don’t have to generate the 8 argument stubs. OCaml does this, but I don’t think it’s work the effort for us.
>
>>> But what is OCaml doing to be able to produce lots of stubs without
>>> taking a code size performance hit?
>
> I think it’s just a fast compiler.
>
>> From Scott:
>>
>> On 5 March 2017 at 19:50, Scott Owens <S.A.Owens at kent.ac.uk> wrote:
>>> I’ve been thinking about this all day, just trying to cut the size and number of stubs, not get rid of max_app entirely yet. It’s not easy to get something that will work without slowing down some things that I’d rather not slow down: applications of functions with too many arguments, and applications of function that have been previously been partially applied, and are now given the rest of their arguments.
>>>
>>> I’m still working on it, but I think it’s worth recording my thoughts at this point. The difficulty is essentially due to the Call and Cons instructions needing to have all of their arguments listed out in the code.
>>
>> What you say about Cons is only true for Cons. You should use
>> FromList, which builds a Block from a list. This is how arbitrary
>> vectors are created.
>
> Ok, but Call is the bigger problem. Also, allocating a bunch of pairs on the heap to form the list, then copying from a block into the pairs, then looping over the list again to form a block seems needlessly wasteful. The more specialised operation of building a block with some arguments directly and then copying in the rest of the block wouldn’t have this overhead, and be similar semantically and probably not much harder to verify than the list building code.
>
>>> Consider the following case: I have a 5 argument function that has been given 3 arguments previously, and now it has been called with 2 more. This means that there is a stub in the partial application closure wrapper that contains a call with 5 argument that looks something like the following, using named variables:
>>>
>>> stub_5_3 <x, y, cl> =
>>>  Call None [x; y; El cl 3; El cl 4; El cl 5; El cl 2; El (El cl 2) 0]
>>>
>>> In order, the call has the two variables just passed in, the 3 variables stored in the partial application closure wrapper, the original closure, and finally, the code pointer of the actual function, which is what Call is calling.
>>>
>>> Of course, we need a stub for each size of call to make, and each different split between the current and stored arguments. There are O(max_app^2) such functions, and each one has O(max_app) arguments, so the whole thing is cubic. I don’t think there is a way to improve on this without changing BVL, and I double checked that OCaml does indeed generate all of these stubs, so there is some indication that they might not be a real problem.
>>
>> See my comment above. BVL already support producing arbitrary Block
>> values without Cons.
>
> I don’t see how this is relevant, the problem here is with the Call, this code doesn’t make blocks.
>
>>> The only solution that I can think of is to introduce a new kind of Call into bvl to give me an alternate calling convention. Essentially, I would need a Call variant that could take a pointer to a block as an argument and then do the call with the elements of the block as argument. This would be implemented later on in the compiler once the calling convention is visible by emitting a loop that iterating over the block and pushing the contents on the stack. Then you could just have 1 stub per how many arguments are given immediately (the 2 arguments in the example).
>>
>> This would be terrible to implement.
>
> I suspected as much.
>
>>> It doesn’t seem worth the disruption to add just for this, unless we would use it for other things too. Also there are some details to work out, like passing the code pointer closure record first so that the block pointer comes at the end. Also looping to extract the tuple’s contents might be slower than doing it directly.
>>>
>>> Once you have a quadratic number of stubs, you need a quadratic sized jump table to turn the runtime argument number calculations into the labels.
>>
>> You don't need a quadratic number of stubs, see comments above.
>
> I still think we do since FromList doesn’t help with Calls.
>
>>> Now let’s turn to reducing the n generic application functions to 1, as Magnus proposed below. Consider the following:
>>>
>>> We have a function of 3 arguments, and the application site has just passed in 5.
>>>
>>> Right now, this is handled by a function like the following:
>>>
>>> gen_app_5_3 <v,w,x,y,z,cl> =
>>>  Let [Call NONE [v; w; x; cl; El cl 0]]
>>>    mk_call cl [y;z]
>>>
>>> Again there are a quadratic number, since there are two calls of different lengths. In the real code, there is just gen_app5, and it contains a jump table to get the proper body based on the run-time information of how many arguments the closure value expects. If we just have gen_app and it takes a tuple of length 5, then instead of a Jump table followed by the code above, we have two options:
>>>
>>> - Generalise the extended Call with block argument that I mentioned above, so that I can specify a slice of the tuple to be used for arguments.
>>> - Build a jump table based on how many arguments the function expects, and then copy them out of the tuple. (A specialised mk_call can re-use the same tuple with an index for how much many arguments have been used up so far).
>>>
>>> In each case, we allocate a tuple on the heap just to immediately start copying them back out.
>>
>> Regarding, "allocate a tuple on the heap just to immediately start
>> copying them back out": this is what a lot of SML code does. I don't
>> see this as being a major hit: the reason is that the alternative is
>> to write these to memory on the stack. The only difference in
>> performance is that allocation on the heap brings the next GC cycle
>> closer. We will soon have a generational GC which I hope will help
>> ease the cost of having short lived heap allocated objects.
>
> It’s not just the extra allocation for heap vs stack, you are also not able to pass any of them in registers to the generic app if you tuple them all. But this might not be a big concern.
>
> Scott