🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Implementing "coroutines" in a custom language

Started by
16 comments, last by Juliean 3 years ago

Wow, I can't belive to actually say that, but - I'm almost finished with the rewrite of my visual-scripting backend. I've only got one thing left- and unfortunately its a spicy one.

So, unless you've read a few of my previous thread on that topic, here's a quick heads-up: I'm writing a custom interpreted language, which works a bit like Java/C# (without any JIT). You have a stack, you push and pop values, you have local variables that can be addresses in relation to the current “frame”-pointer. Its also heavy tied to C++ right now - arrays are represented as std::vector (essentially), string as std::wstring and custom structs are supported.

Now to the main point of this thread. My visual script supports the concept of what is essential coroutines - a function that can yield execution, and resume it at a later point in time. While I have a general idea on how I want to handle this, I have some trouble with exactly where to store things when coroutines are involved. So lets look a quick example to see whatI mean. Here's some “code” from the VS:

Pretty simple stuff. Create a local variable, increment it, and the log the result “1”. Without any optimization, that would result in the following instructions:

AddSpace            40
PushIntConst        0
StoreWord           %rbp-4
LoadAsRef           %rbp-4
IncrementInt
LoadWord            %rbp-4
CallNative          40, IntToString[Inline]
CallNative          40, LogText[Inline]
DestroyString       %rbp-40
Pop                 40
Return

I hope this is somewhat comprehensive. Reserve space for the local variables, init the first variable at “frame-pointer - 4 bytes” to “0” , increment it, convert it to a temporary string (stored at frame-pointer - 40) and log it. Then, destroy the string, remove the space allocated for the local variables from the stack and return. So far, so good.

Now, lets look at the case of a coroutine:

Now things are getting a lot more interesting - but lets focus one one thing that I'm currently having trouble with: Where do I store the data of the call? Previously, I would store the first local variable after the frame-pointer. Now that would not work by itself, since once The “wait” is reached, the function will be suspended, the stack needs to be reused by other functions, and in the next update, I need to try to resume the function until the 5 seconds pass, after which I need to increment the local variable and log it.

Now I now that usually, coroutines allocate a special frame where they put their data. However, I'm having one main problem with that: My visual-script doesn't need coroutines to be specially designated. What I mean is, you can simply place a suspension-call (like Wait) wherever you want, and it will become a coroutine. Combine that with me having virtual functions, means that I cannot realistically know whether the entry-point into my bytecode will be a coroutine or not (I could sometimes tell that it won't be for sure, but as soon as a virtual function is used, this guarantee is out of the window).

Lastly, as you can see by the usage of the reference-type (the <>-pins), I can pass references to data around, which are pretty much just c++ references/pointers - which means that the address of elements needs to be preserved between suspenion and resume of coroutines as well.

_____________________________

Now with all that said - what options do I really have? I can think of two main ways to handle this, but both have down-sides:

  1. Store everything on the stack as usual. When a suspension-point is reached the first time, I allocate a frame on the side, and copy the whole content of the stack to that frame. On resume, I copy the whole frame back to exactly the same location that it was before suspension. This is at least correct, based on my constraints on what custom types can or can't do. Also, at the point where the resume happens, the stack is guaranteed to be empty. There's just a few downsides:
    1. Its O(N) in the size of the stack - even though the copies only happen for coroutines, as I said I use them quite frequently.
    2. Having to copy to the exact stack-location (to preverse addresses) means that you could eventually just run out of stack-space, without the stack actually becoming full (its not always the full stack that becomes preversed, when you call a bytecode-method from some native code that was called from bytecode, to put it simply)
    3. Technically, if I took a reference to a local from that method, store it in some native method and use it while the coroutine is suspended, I get madness. This is not a realisitic issue (and I could probably prevent this in the compiler), but its still not that nice.
  2. My other idea - instead of having one stack, I have a pool of stacks which are all a bit smaller. When a call becomes suspended, I simply put that stack on the side, put the next one as “active”, and on resume I only need to put the other stack back as “active” and mark it as “free” when the coroutine is done. This has the upside of being O(1) independant of size of stack, also there is no issue with invalid references running out of stack-space like in scenario 1). However, this also has a few downsides:
    1. Since the used stack is now no longer fixed, every call now essentially needs an additional indirection, essentially pessimising the entire interpreter (while the solution above would have only incurred overhead when a coroutine is used)
    2. Since a stack cannot be resized, this solution has serious issues with memory. The language should be able to run many coroutines at the same time - perhaps 100s or 1000s. If each of those needs to have a separate stack, it could easily use a few hunderd megabytes, if not gigabytes (currently stack is ~3MB, I could make it only 0.5-1MB for this option, but then I can see myself running into stack-overflows).

So yeah, thats where I'm at. I did initially plan to go for option 1), but I'm not so sure anymore. I think option 3) would be to just force coroutines to be declared explicitely - though it kind of goes against the idea that I originally had with my visual-scripting, where those kind of things should be very easy and quick for prototyping and writing gameplay-scripts and so on.

----------------------------

Sorry, this has been very long, but its also a hard question to put out there without any context. I hope everything has been mostly understandable - if there's still questions, please let me know. Otherwise, does anybody have some clever idea that I've been missing so far? I tried looking at the generated assembly for c++-coroutines, but its quite a lot of boilerplate, not the easy to understand (and also it does not solve my core problem as coroutines in c++ have to be declared for a method).
Thanks in advance for any advice!

Advertisement

Writing your own interpreted language seems like an impressive challenge to take on. Looks fun ?!

I don't really have experience with coroutines or writing my own interpreter for that matter, so I can't say I fully understand the issue. However, the problem at its core seems to be about memory management.

A couple of questions:

  1. since you're in control of the memory space, why do you need to copy the allocated stack back and forth? As I see it, the interpreter should already be aware of all local variables and their addresses, which you could simply modify to refer to a different base address. Alternatively, if your variables are relative offsets from a stack base pointer, then you shouldn't need to worry about them at all. You mentioned you're internally using wstring and std::vector. Unless you're somehow using a custom allocator to change the default behavior, these will already be stored on the heap and are resident at a fixed address somewhere in the host program's address space. Or am I reading this wrong?
  2. Again, I'm not really a hundred percent sure how the single fixed stack idea is supposed to work, but given the numbers you're mentioning, this does not seem like a good general purpose idea at all. To me the question becomes - why overthink? It seems that if you want full control over the amount of memory you're taking up, then you need to go with resizable stacks, which means you'll be running into the same issue as in anyway (1). Right?

In either case it seems you need some sort of memory virtualization (e.g. add interpreter overhead) and the easiest thing would be to know where your data is and update the addresses as you move the base pointer around (either due to the stack growing, storing it for later use, etc.). If your thinking is along the lines of “I don't know what could happen, but I also don't want to set any hard limits”, then you probably need to prepare for the worst and start thinking in terms of the user and offset the responsibility to them (e.g. if they run out of memory, then it's because they're over-taxing their particular system). Which to me seems to mean that all of your memory management needs to be dynamic and any sort of fixed-size stack simply won't do (unless you manage to somehow allocate/extend them in chunks or pools, but again - this leads to even more virtualization).

That being said - perhaps one idea would be to use some sort of hybrid approach. I don't know how this would work exactly in your case, but here's what I did with my hash map class: internally, I completely separated the hash from the data, which is a choice between either moving more memory around per sort or accepting the fixed cost of a redirection per access. Moreover, I was worried about keeping the table sorted in a worst case scenario where the user performs a lookup, then adds an object, performs another lookup etc., which would cause a sort after every addition. None of this is good for the cache. My approach to mitigate this was adding a “slack” to the hash map (user-definable, but by default a few dozen items), which is basically a regular unsorted array tacked on to the end of the table. The class simply hides the fact that until “slack” number of new items have been added to it, it first iterates through the unsorted array (sequential in memory) part and then performs a cache-unfriendly look-up. For a small number of items, it never performs a sort or a non-linear look-up to begin with.

I guess what I want to say with this example is that you could also implement both solutions and switch between the two as memory requirements demand. If you can make the code faster for smaller coroutines, then that's most likely already a win in a vast majority of cases.

In any case - I hope I'm not missing the point completely here ?.

I have no idea how feasible it is for your language, but there is another option. I ran into it while looking at JSP (Jackson System Programming) by Micheal Jackson (not the famous singer) in 1980-something. It's called program inversion. I tried finding it again and ended up at https://en.wikipedia.org/wiki/Inversion_of_control​ which apparently is closely related to program inversion. In the Background Section there, they make the connection. Reference 6 points at an PDF, where the inversion thing starts at page 92. You'll likely need to understand some parts before it, to be able to read it.

[[Note that Jackson “yields" at IO-access, he wants to interleave program code that produces a result with code that uses the result iirc.]]

In a nut-shell, what happens is a conversion from

def f():
  statement1;
  statement2;
  yield
  statement3;

to something like

def f():
  static int counter = 0; // Ignoring re-entrancy here.
  switch (counter):
    0:
      statement1;
      statement2;
      counter = 1;
      return;
    1:
      statement3;
      counter = -1; // Or 0 if it should be available again.

At every “yield” you insert a “counter = <next-part>; return” sequence. At the start of the body you dispatch to the first statement of the part indicated by the counter. This should get rid of the computation stack, leaving only the variables and the additional “counter” variable (assuming you want re-entrancy).

You can also yield inside loops. You'd have to break the loops into a bunch of numbered blocks as well, and jump between them. Jackson explains this all.

I haven't implemented something like this but I have thought about this problem before.

I studied Jackson's method in college, including inversion of control. It's how I implemented lazy move generation in my chess program a few years ago, because of C++'s lack of support for coroutines. However, for a general implementation in a language, you have to make sure that all the state of the function is saved. I think this quickly leads to an implementation using a separate frame for the function, which is basically what Juliean was asking about.

I think this could be solved by giving up the notion that function frames are in a stack, and instead using a more general memory management solution. Perhaps for speed of execution a stack could be used for functions that are known not to yield. Does your language have a compilation step where this can be found out?

As I said, I haven't implemented this, so perhaps I am missing something important. I might try to write a tiny prototype of an interpreted language with this feature, as a proof of concept.

irreversible said:
Writing your own interpreted language seems like an impressive challenge to take on. Looks fun ?!

It was indeed fun, but also a very long process - i started about a year ago. Now, I had to pause halfway through due to health-issues, and occasionally slack off to play video-games, but its still an incredibly garganteous task!

irreversible said:
since you're in control of the memory space, why do you need to copy the allocated stack back and forth? As I see it, the interpreter should already be aware of all local variables and their addresses, which you could simply modify to refer to a different base address. Alternatively, if your variables are relative offsets from a stack base pointer, then you shouldn't need to worry about them at all. You mentioned you're internally using wstring and std::vector. Unless you're somehow using a custom allocator to change the default behavior, these will already be stored on the heap and are resident at a fixed address somewhere in the host program's address space. Or am I reading this wrong?

The copying would happen in idea 1) to preverse all the values which are normally removed from the stack as the function unwinds. For a suspended function, I need to both keep the values around but also remove them from the stack (as otherwise you would run into stack-overflow as well as be unable to resume if two coroutines happened right after another). Refering to a different base address seems like my idea #2. The main issue with that is that if I want pretty much every function to potentially be a coroutine, I would pretty much lose the ability to access the stack at all (since if I store the int in my example on the stack and then run into the “yield”-Wait, it would already be on the stack). Also, under the same precondition I would need to use some target-memory which in itself is not resizable (as references need to be preserved), so thats why idea #2 would be.

Regarding std::wstring/vector: The content of those is on the heap, yes. The variable itself can be placed on the stack (as you can see in my code-example). That means that a later function, which accepts a “string” will do this by taking a reference to that variable which is on the stack.

irreversible said:
In either case it seems you need some sort of memory virtualization (e.g. add interpreter overhead) and the easiest thing would be to know where your data is and update the addresses as you move the base pointer around (either due to the stack growing, storing it for later use, etc.). If your thinking is along the lines of “I don't know what could happen, but I also don't want to set any hard limits”, then you probably need to prepare for the worst and start thinking in terms of the user and offset the responsibility to them (e.g. if they run out of memory, then it's because they're over-taxing their particular system). Which to me seems to mean that all of your memory management needs to be dynamic and any sort of fixed-size stack simply won't do (unless you manage to somehow allocate/extend them in chunks or pools, but again - this leads to even more virtualization).

Hm yeah, thats along the lines of what I though. One of the benefits of having a compiled language is that you can predetermine a lot of things, like which variables are used for how long, etc… My old system pretty much had to always allocate all the return-values all at once, even if they were never used; or calculate control-flow dynamically. But now if I have coroutines being able to happen pretty much whenever, I again run into a situation whereI can eigther pessimise everything in preparation for when a coroutine happens (which it often won't), or have to accept some overhead for like my idea #1.

irreversible said:
That being said - perhaps one idea would be to use some sort of hybrid approach. I don't know how this would work exactly in your case, but here's what I did with my hash map class: internally, I completely separated the hash from the data, which is a choice between either moving more memory around per sort or accepting the fixed cost of a redirection per access. Moreover, I was worried about keeping the table sorted in a worst case scenario where the user performs a lookup, then adds an object, performs another lookup etc., which would cause a sort after every addition. None of this is good for the cache. My approach to mitigate this was adding a “slack” to the hash map (user-definable, but by default a few dozen items), which is basically a regular unsorted array tacked on to the end of the table. The class simply hides the fact that until “slack” number of new items have been added to it, it first iterates through the unsorted array (sequential in memory) part and then performs a cache-unfriendly look-up. For a small number of items, it never performs a sort or a non-linear look-up to begin with.

Funny, I made something pretty similar. I made a small_map-class, which actually is nothing more than a vector with a map-like interface. Its actually way faster for small numbers of elements (even for insertion, as when I know that elements shouldn't be duplicates I can just assert that the key is missing in debug and then push_back). But I'm not quite sure how I could apply the same to the interpreter. Whats making things actually pretty fast right now is that all the accesses to local variables can happen to the know address of the stack, which is a few indirects, but nothing too fancy (and it should be in the cache). I'm worried I need add a lot of instructions and pessimise my default-use case if I end up with a hybrid structure.

Alberth said:
At every “yield” you insert a “counter = ; return” sequence. At the start of the body you dispatch to the first statement of the part indicated by the counter. This should get rid of the computation stack, leaving only the variables and the additional “counter” variable (assuming you want re-entrancy).

I'll have to read into it - but that indeed seems like how the control-flow of a coroutine can work. Not sure about that “static”, wouldn't that mean that the function can only be called once at a time? Maybe it'll make more sense if I check out the sources.

alvaro said:
I think this could be solved by giving up the notion that function frames are in a stack, and instead using a more general memory management solution. Perhaps for speed of execution a stack could be used for functions that are known not to yield. Does your language have a compilation step where this can be found out?

Yes, the language is compiled, and I can figure out when a function can absolutely never be coroutine. The problem is with other methods, especially virtual onces. For a normal method, I could recursively figure out if the method consists of any coroutine-statement, but as soon as “virtual” is in play, I'm in a loss. I could certainly treat any method that uses any “virtual” method as just being a coroutine; but I'd have to evaluate how high of a % that would affect, and how many of those cases realistically are actually coroutine and how many end up just being pessimisations.

I could perhaps just make a flag for methods that marks them as “never a coroutine". For normal methods, this would yield a compile-error when any code inside of it produces a coroutine. For virtual-functions, this would simply prohibit all overriding methods from being a coroutine itself. That doesn't sound halfway-bad, at least it would be a better option than having to mark everything explicitely as coroutine in my current case. I think I'll consider that. I still have a small issue here since with virtual methods, I don't know how much memory I'll exactly need to allocate on the frame (otherwise, I could precompile this information as well). So perhaps in that system, a virtual-method would have to open a separate stack-frame alltogether. I'll have to think about it - if it isn't too much overhead (in terms of development-time), I'd really like to profile this against my idea #1 in the actual game. But in terms of working-efficiency, it would certainly better if I just sticked to one right now.

Juliean said:
Not sure about that “static”, wouldn't that mean that the function can only be called once at a time? Maybe it'll make more sense if I check out the sources.

Don't get too hanged-up on the precise semantics, it's mostly a random mix of languages anyway ?

The idea behind “static” is just that its value is preserved between calls (or rather, continuation of calls). In reality you likely want to handle its value elsewhere but how exactly depends on your requirements. For example, in a more realistic setting, something like

class X:
  def __init__(self):
    self.counter = 0

  def f(self):
    if self.counter == 0:
      statement1
      statement2
      self.counter = 1
      return False
    elif self.counter == 1:
      statement3
      self.counter = 2
      return True
    else:
       assert False, "Incorrect call."

Instantiate the class, repeatedly call obj.f() until it returns ‘true’. Extensions are possible to store return values in the instance, etc. In your language, you likely want to manage the counter somewhere else, completely hidden for the user.

My main goal was to push you a bit away from the idea that you must keep the entire stack between parts being executed.

Alberth said:
The idea behind “static” is just that its value is preserved between calls (or rather, continuation of calls). In reality you likely want to handle its value elsewhere but how exactly depends on your requirements. For example, in a more realistic setting, something like

Ok, yeah, that makes sense. From mostly what I've seen this is similar to how most languages implement coroutine - I think C# also essentially creates a state-machine internally.

Alberth said:
My main goal was to push you a bit away from the idea that you must keep the entire stack between parts being executed.

The main reason why I said that though, is due to my current system not knowing what is a coroutine. If I did what you suggested and precompiled a known coroutine method, this would work perfectly. However, if I had a function like:

def f():
  statement1;
  statement2;
  object.Virtual(); // implicitely possible to yield
  statement3;

Then the “virtual” call could in the current setup potentiall result in a yield - or it couldn't, its up to the object. This makes it a lot harder to just transform everything upfront, as I know would have to transform essentially every method that uses a virtual method into a coroutine-statemachine. Thats why my original idea was to preserve the stack - its a solution for on-demand coroutine handling, without prematurely having to mark pretty much every method potentially yielding. I'm not saying its the best idea - it does have the advantage of being pretty much zero-overhead as long as no coroutine is involved, but scales badly with stack-size when those occur. I did discuss an alternativey in my reply to alvaro though, which would then end up using a solution like the one you proposed.

Juliean said:
Then the “virtual” call could in the current setup potentiall result in a yield - or it couldn't, its up to the object. This makes it a lot harder to just transform everything upfront, as I know would have to transform essentially every method that uses a virtual method into a coroutine-statemachine

I would likely simply assume everything is a co-routine. The cost is basically a dispatch table, and a block-counter for the next continuation. If your code never yields, you initialize a counter, and use the dispatch table once.

switch (counter):
 0:
   statement1;
   statement2;
   // fall-through
 1:
   ret = subcall();
   if (ret == 'yielded'):
       counter = 1;
       return
   // else call finished, continue
   statement3;

If the call yields, you need something like the above. If the code never yields, after the initial dispatch the only extra costs are a counter and 1 dispatch inside the “subcall”, and the “if” branch here that is not taken.

if “subcall” does yield, I think you need something like this anyway.

For code that certainly doesn't yield, you can drop the table and the counter

This idea may not mesh architecturally with your language, but in the scripting language we use at the studio I work at we've simplified all these problems in a couple ways:

  1. All execution occurs in a “thread” (or coroutine). Whenever you want to execute some script you either create a new thread (which includes bytecode, stack, etc.) or resume a previously-created thread.
  2. Yielding involves returning execution to the calling C++ with a “yielded” return code (optionally with a value). This may be returning to external C++ game code, or to some recursively-evaluating bytecode interpreter of a separate thread.
  3. Because the thread is self-contained, yielding doesn't require extra state. The instruction pointer and stack are just left as-is when yielding control and when resuming we just start executing again at the next instruction.
  4. Threads start with a relatively small stack. I think it's something like 1K? Certainly not in the order of MB's.
  5. Stacks are resizable. Resizing happens relatively infrequently in practice and does require us to fixup stack references, but only within that thread. Objects all live on the heap, so external code / other threads don't need to be fixed up since the actual object hasn't moved.

Maybe there are some bits here that are interesting. We've used this scripting language mostly in this form on console hardware across multiple generations and it's done the job just fine.

Alberth said:
If the call yields, you need something like the above. If the code never yields, after the initial dispatch the only extra costs are a counter and 1 dispatch inside the “subcall”, and the “if” branch here that is not taken.

Wouldn't there also be an overhead for pretty much every “local” variable? I didn't yet have the time to check your link yet, but in order to preverse the values of locals when I don't really know whether or not the yield is happening or not, I have to start by storing the variable somewhere else then the stack.

def f():
  int i = 0;
  potentiallyYielding();
  ret i;

Granted, it seems like I won't be able to totally avoid that cost anyway as it seems, but it still makes a me a bit uneasy (I might have to do a few synthetic benchmarks to see how much that all really adds up).

Valakor said:
This idea may not mesh architecturally with your language, but in the scripting language we use at the studio I work at we've simplified all these problems in a couple ways:

Thanks for your insights! Using threads is actually an interesting idea - I usually don't do much threading, but if it were just to wait until execution finishes, it wouldn't be a problem. Did you measure the impact on non-yielding executions? it seems a bit overkill having to start a thread for each native→bytecode call (again I don't do much threading so that might be just false).

Valakor said:
Threads start with a relatively small stack. I think it's something like 1K? Certainly not in the order of MB's.

Hm, interesting. If I could go with such a small stack-size, then the possibility of swapping stacks seems a lot more reasonable. That way, I really don't think I would even need something like a thread - aside from having to preverse the content of the stack, there's no reason why I couldn't yield in the current system (actual native yield-statements will likely be implemented based on c++-coroutines). Though there's only one problem I see:

Valakor said:
Stacks are resizable. Resizing happens relatively infrequently in practice and does require us to fixup stack references, but only within that thread. Objects all live on the heap, so external code / other threads don't need to be fixed up since the actual object hasn't moved.

I don't see that working in my case. Since the native-side can take references to objects on the stack, something like this would break if references were to break:

// in c++
yield_statement<> loadScene_event(const std::string& path)
{
	co_yield WaitUntilCanLoad();
	
	loadSceneImpl(path);
}

“path” can easily point to a refernce of a value that is a local stored on the interpreters stack. Even if I did something like-heap-copying the string before passing it into the function (which I would like to avoid), native functions are still allowed to take like an “int&” which would then not work anymore, it appears.

This topic is closed to new replies.

Advertisement