🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Implementing "coroutines" in a custom language

Started by
16 comments, last by Juliean 3 years ago

Ah sorry - terminology dies hard. We call our coroutines “threads” which I'm using interchangeably with “coroutine”; they're not actual hardware threads. The language itself is singly-threaded, but there are many execution contexts alive at once. Some execute to completion immediately, others yield until the following frame or some condition is met, etc.

Advertisement

Valakor said:
Ah sorry - terminology dies hard. We call our coroutines “threads” which I'm using interchangeably with “coroutine”; they're not actual hardware threads. The language itself is singly-threaded, but there are many execution contexts alive at once. Some execute to completion immediately, others yield until the following frame or some condition is met, etc.

Ah yeah - then thats conceptually were close to what I'm trying to achieve, or what I had in my old backend. So if I could get away with using a small stack, then I might end up doing something pretty similar - its only that I'm pretty certainly not able to resize the stack, otherwise I wouldn't be hesistant to try it out.

Could you make your stack out of a linked list of fixed-sized pages? You could allocate/free them very quickly and addresses into each page would always be stable. There'd be a little bit more overhead when computing the address of a stack variable but maybe that's acceptable? Could also maybe simplify things by ensuring a single function's stack space is always on one contiguous page rather than spanning multiple pages.

Valakor said:
Could you make your stack out of a linked list of fixed-sized pages? You could allocate/free them very quickly and addresses into each page would always be stable. There'd be a little bit more overhead when computing the address of a stack variable but maybe that's acceptable? Could also maybe simplify things by ensuring a single function's stack space is always on one contiguous page rather than spanning multiple pages.

Perhaps something like this could work - I'm not 100% sure right now, I'd have to probably solve a few things. For example, since I don't have registers, arguments and return-values are placed right before the current stack-frame and are addressed with negative offsets. Arguments/input would work relatively easy, but for return-values I'd probably need to eigther copy all return values between both pages, or store references to all return-values in the called functions frame (which is a bit suboptimal for primitive types as well).

I think the bottom line is that I'll need to have some benchmakrs to see how the different approaches could perform (or at least how much slower whichever approach I end up doing performs for non-coroutine functions). Performance has been a major factor of why I pretty much had to the whole bytecode-thing in the first place, so I'm a bit over-sensitive to the topic.

What I came accross when investigating Coroutines (C#/Unity) and Tasks (C#/async-await) in research for writing our engine's work scheduling is that those 'concepts' both rely on compiler magic. The C# compiler creates a class when debugging and a struct in release code which acts like a state machine. So means that everything reused, like your local variables, is a field in the class/struct which is allocated when the function is invoked.

At the point in time when Microsoft added the yield keyword to it's language, which was around .NET 3.5, the initial intention was to make writing iterator code more easy. That's because those functions' ‘break-points’ always have to return an iterator object, which controlls the state of the state machine. The function itself is then split into it's instructions and translated into a switch-case statement where all cases are related to one state of the state machine. When the function then re-enters execution, the state of the iterator is taken and feeded into the switch statement to jump right back to the case statement executed next.

Unity used that tech to write their coroutines, which act the same and also have to return an IEnumerable, the C# Iterator Interface. Unless plain .NET, Unity has a call to start a coroutine which in fact just executes the iterator until certain point when the first yield is reached and then adds those to a queue which is processed on certain time each frame, related to the Unity specific iterator returned.

Tasks, which were introduced in .NET 4, work somehow similar. An async function which yields an await instead of a return of an iterator, is the same way split into a state machine and executed one by one. The difference here is that Tasks are not IEnumerable and so the compiler has to create different objects to store local variables, but a class e.g. a struct is generated as well. And Tasks are run by a scheduler instead of simply calling the IEnumerable interface method ‘MoveNext’.

There is an interesting article about Tasks and how they work under the hood here:

https://devblogs.microsoft.com/premier-developer/dissecting-the-async-methods-in-c/

Btw. this priciple is also implemented in C++ async/await

https://en.cppreference.com/w/cpp/thread/async

But as far as we don't want to write our own compiler and don't rely on Microsofts' state machine solution, one thing I came accross was an implementation acording to a GDC Talk from Naughty Dog about cooperative multitasking on the PS4. And this is what we use on the basics.

This solution allocates some memory pages from the OS in order to create stack-frames on them. Those stack-frames are then initialized from current thread on the CPU. When you want to start something as Task, you just switch-load the memory page into the CPU registers, set the stack pointer and the function is executed as if it were called from somewhere in your current program. Switching back to the calling method works on the same way but must be done from within the Task.

What we ended up with is a fixed amount of pre-allocated threads running our scheduler which then bootstraps the switch into the Task and manages some things like ensuring to return the thread into the scheduler after the Task has finished, maintain the state of the Task (in order an ‘await’ was called) and context bound variables.

We allocate a bunch of memory pages and return them to a stack whenever a Task has finished. The good about memory pages is that the OS manages them and they can be allocated but are initialized on first use only, so keeping a bunch of them in a list doesn't increase the memory consumption of the program in general.

Maybe you want to implement some hybrid solution, as you say your language can break at every point in time, maintaining an amount of stacks in the background and perform a switch-over might be the way to go here. This way you can execute your scripts up to a certain point on the primary stack and copy everything necessary to a temporary stack when breaking. After your script is executed again, you can have your runtime check for stacks associated with a state object you need to keep somewhere in order to be able to re-join the execution and if some exists, load everything from that stack and continue

Shaarigan said:
Maybe you want to implement some hybrid solution, as you say your language can break at every point in time, maintaining an amount of stacks in the background and perform a switch-over might be the way to go here. This way you can execute your scripts up to a certain point on the primary stack and copy everything necessary to a temporary stack when breaking. After your script is executed again, you can have your runtime check for stacks associated with a state object you need to keep somewhere in order to be able to re-join the execution and if some exists, load everything from that stack and continue

Thanks for the detailed reply! Yeah, thats about what I had originally in mind with my idea #1. There's only a few issues with it (but after discussing everything, it seems there exists no optimal solution unless I'm willing to require explicit marking, or at least explicit disabling of coroutines).

The main issue with copying the stack is, that references(=addresses) of values on the stack can be taken, both from the scripting-runtime as well as the native backend. Lets just say at that point, that I can ensure that no stale references exist during the yield (which I can to a certain degree, though its also not 100% safe). That would still mean that I would need to copy everything back to exactly the exact location of the stack on “resume” (which I could also do). However, I would pretty much need to do this every frame - since a yielding-native call could take a “string” stored on the stack, like in my example, the stack would need to be in its original state during each check. Even if I did work around this, there are still cases where the excetion is resumed every frame (think a while(true) yield return WaitForEndOfFrame - kind of thing). Thats were I'm afraid the constant cost of copying back and forth might be too much.

Now, admittetly, since this option is the one that requires the least amount of work, I think I'll just go ahead and start with it and see how much overhead there is. Everything else seems to require a lot more engineering with coming up with different data-structures, requiring more instructions to be able to address things from different places etc… I think that once I've got everything running and am able to benchmark my actual game, I should get a clearer picture if this approach is good enough or if something more scalable is required.

The alternative is to fetch a stack on every execution and keep it for as long as the script is running. Variables should be secured, maybe by refcounting them, to not leave scope until the sctipt is done with them. You could, which I also did for my DOM data structures used by JSON for example, not use plain pointers in your code rather than indices. This way you are not relating to the exact memory address rather than the ‘location’ of your data. So copying stuff around would be of much less pain and being honest, a memcpy call is fastest you can get in C++.

Anyways, maybe you want to have a look at Rust and how they manage their different pointer types (the language, not the game perhaps)

Shaarigan said:
The alternative is to fetch a stack on every execution and keep it for as long as the script is running. Variables should be secured, maybe by refcounting them, to not leave scope until the sctipt is done with them. You could, which I also did for my DOM data structures used by JSON for example, not use plain pointers in your code rather than indices. This way you are not relating to the exact memory address rather than the ‘location’ of your data. So copying stuff around would be of much less pain and being honest, a memcpy call is fastest you can get in C++.

Using indices is out of the question. I'm quite happy to be able to do:

void loadGame(const std::string& file);

registerScriptFunction(&loadGame, "LoadGame(file)");

And not have to write some sort of ugly wrapper around it (I had to do that at one point and it really sucked). Before I do that, I just do something that makes performance worse again.

Refcouting variables might be neccessary if I want to go for 100% safety (if for example I assume users to be able to always hold on the the “file” variable that they get in the loadGame-function), but I'm at least ok to enfore users to be somewhat cautions with storing references long-term (only during yielding is the real challenge). Other than that, the compiler can already see mostly how long variables are actually needed.

This topic is closed to new replies.

Advertisement