🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to General and Gameplay Programming

Implementing "coroutines" in a custom language

Julian Watzinger · 2021-06-08T06:20:14

Wow, I can't belive to actually say that, but - I'm almost finished with the rewrite of my visual-scripting backend. I've only got one thing left- and unfortunately its a spicy one. So, unless you've read a few of my previous thread on that topic, here's a quick heads-up: I'm writing a custom interpreted language, which works a bit like Java/C# (without any JIT). You have a stack, you push and pop values, you have local variables that can be addresses in relation to the current “frame”-pointer. Its also heavy tied to C++ right now - arrays are represented as std::vector (essentially), string as std::wstring and custom structs are supported. Now to the main point of this thread. My visual script supports the concept of what is essential coroutines - a function that can yield execution, and resume it at a later point in time. While I have a general idea on how I want to handle this, I have some trouble with exactly where to store things when coroutines are involved. So lets look a quick example to see whatI mean. Here's some “code” from the VS: Pretty simple stuff. Create a local variable, increment it, and the log the result “1”. Without any optimization, that would result in the following instructions: AddSpace 40 PushIntConst 0 StoreWord %rbp-4 LoadAsRef %rbp-4 IncrementInt LoadWord %rbp-4 CallNative 40, IntToString[Inline] CallNative 40, LogText[Inline] DestroyString %rbp-40 Pop 40 Return I hope this is somewhat comprehensive. Reserve space for the local variables, init the first variable at “frame-pointer - 4 bytes” to “0” , increment it, convert it to a temporary string (stored at frame-pointer - 40) and log it. Then, destroy the string, remove the space allocated for the local variables from the stack and return. So far, so good. Now, lets look at the case of a coroutine: Now things are getting a lot more interesting - but lets focus one one thing that I'm currently having trouble with: Where do I store the data of the call? Previously, I would store the first local variable after the frame-pointer. Now that would not work by itself, since once The “wait” is reached, the function will be suspended, the stack needs to be reused by other functions, and in the next update, I need to try to resume the function until the 5 seconds pass, after which I need to increment the local variable and log it. Now I now that usually, coroutines allocate a special frame where they put their data. However, I'm having one main problem with that: My visual-script doesn't need coroutines to be specially designated. What I mean is, you can simply place a suspension-call (like Wait) wherever you want, and it will become a coroutine. Combine that with me having virtual functions, means that I cannot realistically know whether the entry-point into my bytecode will be a coroutine or not (I could sometimes tell that it won't be for sure, but as soon as a virtual function is used, this guarantee is out of the window). Lastly, as you can see by the usage of the reference-type (the <>-pins), I can pass references to data around, which are pretty much just c++ references/pointers - which means that the address of elements needs to be preserved between suspenion and resume of coroutines as well. _____________________________ Now with all that said - what options do I really have? I can think of two main ways to handle this, but both have down-sides: Store everything on the stack as usual. When a suspension-point is reached the first time, I allocate a frame on the side, and copy the whole content of the stack to that frame. On resume, I copy the whole frame back to exactly the same location that it was before suspension. This is at least correct, based on my constraints on what custom types can or can't do. Also, at the point where the resume happens, the stack is guaranteed to be empty. There's just a few downsides: Its O(N) in the size of the stack - even though the copies only happen for coroutines, as I said I use them quite frequently. Having to copy to the exact stack-location (to preverse addresses) means that you could eventually just run out of stack-space, without the stack actually becoming full (its not always the full stack that becomes preversed, when you call a bytecode-method from some native code that was called from bytecode, to put it simply) Technically, if I took a reference to a local from that method, store it in some native method and use it while the coroutine is suspended, I get madness. This is not a realisitic issue (and I could probably prevent this in the compiler), but its still not that nice. My other idea - instead of having one stack, I have a pool of stacks which are all a bit smaller. When a call becomes suspended, I simply put that stack on the side, put the next one as “active”, and on resume I only need to put the other stack back as “active” and mark it as “free” when the coroutine is done. This has the upside of being O(1) independant of size of stack, also there is no issue with invalid references running out of stack-space like in scenario 1). However, this also has a few downsides: Since the used stack is now no longer fixed, every call now essentially needs an additional indirection, essentially pessimising the entire interpreter (while the solution above would have only incurred overhead when a coroutine is used) Since a stack cannot be resized, this solution has serious issues with memory. The language should be able to run many coroutines at the same time - perhaps 100s or 1000s. If each of those needs to have a separate stack, it could easily use a few hunderd megabytes, if not gigabytes (currently stack is ~3MB, I could make it only 0.5-1MB for this option, but then I can see myself running into stack-overflows). So yeah, thats where I'm at. I did initially plan to go for option 1), but I'm not so sure anymore. I think option 3) would be to just force coroutines to be declared explicitely - though it kind of goes against the idea that I originally had with my visual-scripting, where those kind of things should be very easy and quick for prototyping and writing gameplay-scripts and so on. ---------------------------- Sorry, this has been very long, but its also a hard question to put out there without any context. I hope everything has been mostly understandable - if there's still questions, please let me know. Otherwise, does anybody have some clever idea that I've been missing so far? I tried looking at the generated assembly for c++-coroutines, but its quite a lot of boilerplate, not the easy to understand (and also it does not solve my core problem as coroutines in c++ have to be declared for a method). Thanks in advance for any advice!

General and Gameplay Programming Programming

Started by Juliean June 04, 2021 12:25 PM

16 comments, last by Juliean 3 years ago

Valakor

June 05, 2021 06:31 PM

Ah sorry - terminology dies hard. We call our coroutines “threads” which I'm using interchangeably with “coroutine”; they're not actual hardware threads. The language itself is singly-threaded, but there are many execution contexts alive at once. Some execute to completion immediately, others yield until the following frame or some condition is met, etc.

Juliean

7,351

Author

June 05, 2021 06:47 PM

Valakor said:
Ah sorry - terminology dies hard. We call our coroutines “threads” which I'm using interchangeably with “coroutine”; they're not actual hardware threads. The language itself is singly-threaded, but there are many execution contexts alive at once. Some execute to completion immediately, others yield until the following frame or some condition is met, etc.

Ah yeah - then thats conceptually were close to what I'm trying to achieve, or what I had in my old backend. So if I could get away with using a small stack, then I might end up doing something pretty similar - its only that I'm pretty certainly not able to resize the stack, otherwise I wouldn't be hesistant to try it out.

Valakor

June 05, 2021 06:54 PM

Could you make your stack out of a linked list of fixed-sized pages? You could allocate/free them very quickly and addresses into each page would always be stable. There'd be a little bit more overhead when computing the address of a stack variable but maybe that's acceptable? Could also maybe simplify things by ensuring a single function's stack space is always on one contiguous page rather than spanning multiple pages.

Juliean

7,351

Author

June 05, 2021 07:10 PM

Valakor said:
Could you make your stack out of a linked list of fixed-sized pages? You could allocate/free them very quickly and addresses into each page would always be stable. There'd be a little bit more overhead when computing the address of a stack variable but maybe that's acceptable? Could also maybe simplify things by ensuring a single function's stack space is always on one contiguous page rather than spanning multiple pages.

Perhaps something like this could work - I'm not 100% sure right now, I'd have to probably solve a few things. For example, since I don't have registers, arguments and return-values are placed right before the current stack-frame and are addressed with negative offsets. Arguments/input would work relatively easy, but for return-values I'd probably need to eigther copy all return values between both pages, or store references to all return-values in the called functions frame (which is a bit suboptimal for primitive types as well).

I think the bottom line is that I'll need to have some benchmakrs to see how the different approaches could perform (or at least how much slower whichever approach I end up doing performs for non-coroutine functions). Performance has been a major factor of why I pretty much had to the whole bytecode-thing in the first place, so I'm a bit over-sensitive to the topic.

Shaarigan

1,471

June 07, 2021 07:01 AM

What I came accross when investigating Coroutines (C#/Unity) and Tasks (C#/async-await) in research for writing our engine's work scheduling is that those 'concepts' both rely on compiler magic. The C# compiler creates a class when debugging and a struct in release code which acts like a state machine. So means that everything reused, like your local variables, is a field in the class/struct which is allocated when the function is invoked.

At the point in time when Microsoft added the yield keyword to it's language, which was around .NET 3.5, the initial intention was to make writing iterator code more easy. That's because those functions' ‘break-points’ always have to return an iterator object, which controlls the state of the state machine. The function itself is then split into it's instructions and translated into a switch-case statement where all cases are related to one state of the state machine. When the function then re-enters execution, the state of the iterator is taken and feeded into the switch statement to jump right back to the case statement executed next.

Unity used that tech to write their coroutines, which act the same and also have to return an IEnumerable, the C# Iterator Interface. Unless plain .NET, Unity has a call to start a coroutine which in fact just executes the iterator until certain point when the first yield is reached and then adds those to a queue which is processed on certain time each frame, related to the Unity specific iterator returned.

Tasks, which were introduced in .NET 4, work somehow similar. An async function which yields an await instead of a return of an iterator, is the same way split into a state machine and executed one by one. The difference here is that Tasks are not IEnumerable and so the compiler has to create different objects to store local variables, but a class e.g. a struct is generated as well. And Tasks are run by a scheduler instead of simply calling the IEnumerable interface method ‘MoveNext’.

There is an interesting article about Tasks and how they work under the hood here:

https://devblogs.microsoft.com/premier-developer/dissecting-the-async-methods-in-c/

Btw. this priciple is also implemented in C++ async/await

https://en.cppreference.com/w/cpp/thread/async

But as far as we don't want to write our own compiler and don't rely on Microsofts' state machine solution, one thing I came accross was an implementation acording to a GDC Talk from Naughty Dog about cooperative multitasking on the PS4. And this is what we use on the basics.

This solution allocates some memory pages from the OS in order to create stack-frames on them. Those stack-frames are then initialized from current thread on the CPU. When you want to start something as Task, you just switch-load the memory page into the CPU registers, set the stack pointer and the function is executed as if it were called from somewhere in your current program. Switching back to the calling method works on the same way but must be done from within the Task.

What we ended up with is a fixed amount of pre-allocated threads running our scheduler which then bootstraps the switch into the Task and manages some things like ensuring to return the thread into the scheduler after the Task has finished, maintain the state of the Task (in order an ‘await’ was called) and context bound variables.

We allocate a bunch of memory pages and return them to a stack whenever a Task has finished. The good about memory pages is that the OS manages them and they can be allocated but are initialized on first use only, so keeping a bunch of them in a list doesn't increase the memory consumption of the program in general.

Maybe you want to implement some hybrid solution, as you say your language can break at every point in time, maintaining an amount of stacks in the background and perform a switch-over might be the way to go here. This way you can execute your scripts up to a certain point on the primary stack and copy everything necessary to a temporary stack when breaking. After your script is executed again, you can have your runtime check for stacks associated with a state object you need to keep somewhere in order to be able to re-join the execution and if some exists, load everything from that stack and continue

Juliean

7,351

Author

June 07, 2021 10:37 AM

Shaarigan said:
Maybe you want to implement some hybrid solution, as you say your language can break at every point in time, maintaining an amount of stacks in the background and perform a switch-over might be the way to go here. This way you can execute your scripts up to a certain point on the primary stack and copy everything necessary to a temporary stack when breaking. After your script is executed again, you can have your runtime check for stacks associated with a state object you need to keep somewhere in order to be able to re-join the execution and if some exists, load everything from that stack and continue

Thanks for the detailed reply! Yeah, thats about what I had originally in mind with my idea #1. There's only a few issues with it (but after discussing everything, it seems there exists no optimal solution unless I'm willing to require explicit marking, or at least explicit disabling of coroutines).

The main issue with copying the stack is, that references(=addresses) of values on the stack can be taken, both from the scripting-runtime as well as the native backend. Lets just say at that point, that I can ensure that no stale references exist during the yield (which I can to a certain degree, though its also not 100% safe). That would still mean that I would need to copy everything back to exactly the exact location of the stack on “resume” (which I could also do). However, I would pretty much need to do this every frame - since a yielding-native call could take a “string” stored on the stack, like in my example, the stack would need to be in its original state during each check. Even if I did work around this, there are still cases where the excetion is resumed every frame (think a while(true) yield return WaitForEndOfFrame - kind of thing). Thats were I'm afraid the constant cost of copying back and forth might be too much.

Now, admittetly, since this option is the one that requires the least amount of work, I think I'll just go ahead and start with it and see how much overhead there is. Everything else seems to require a lot more engineering with coming up with different data-structures, requiring more instructions to be able to address things from different places etc… I think that once I've got everything running and am able to benchmark my actual game, I should get a clearer picture if this approach is good enough or if something more scalable is required.

Shaarigan

1,471

June 07, 2021 07:55 PM

The alternative is to fetch a stack on every execution and keep it for as long as the script is running. Variables should be secured, maybe by refcounting them, to not leave scope until the sctipt is done with them. You could, which I also did for my DOM data structures used by JSON for example, not use plain pointers in your code rather than indices. This way you are not relating to the exact memory address rather than the ‘location’ of your data. So copying stuff around would be of much less pain and being honest, a memcpy call is fastest you can get in C++.

Anyways, maybe you want to have a look at Rust and how they manage their different pointer types (the language, not the game perhaps)

Juliean

7,351

Author

June 08, 2021 06:20 AM

Shaarigan said:
The alternative is to fetch a stack on every execution and keep it for as long as the script is running. Variables should be secured, maybe by refcounting them, to not leave scope until the sctipt is done with them. You could, which I also did for my DOM data structures used by JSON for example, not use plain pointers in your code rather than indices. This way you are not relating to the exact memory address rather than the ‘location’ of your data. So copying stuff around would be of much less pain and being honest, a memcpy call is fastest you can get in C++.

Using indices is out of the question. I'm quite happy to be able to do:

void loadGame(const std::string& file);

registerScriptFunction(&loadGame, "LoadGame(file)");

And not have to write some sort of ugly wrapper around it (I had to do that at one point and it really sucked). Before I do that, I just do something that makes performance worse again.

Refcouting variables might be neccessary if I want to go for 100% safety (if for example I assume users to be able to always hold on the the “file” variable that they get in the loadGame-function), but I'm at least ok to enfore users to be somewhat cautions with storing references long-term (only during yielding is the real challenge). Other than that, the compiler can already see mostly how long variables are actually needed.

🎉 Celebrating 25 Years of GameDev.net! 🎉

Implementing "coroutines" in a custom language

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Implementing "coroutines" in a custom language

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines