Sunday 25 January 2009

A Note on Distributed Computing

Link: A Note on Distributed Computing

The paper isn't the most straightforward read, but it has some good points to take from it which apply as much to the game programming I do, as they did back in 1994 when it was written.

...it is a mistake to attempt to construct a system that is “objects all the way down” if one understands the goal as a distributed system constructed of the same kind of objects all the way down.
Absolutely.

When you are programming server code and you are using an object which is not necessarily located on the server that code is running on, you know this. That you are using it, is directly because of this. When you are programming client code, the same thing applies, and when you use objects which are located on the server you are doing so because you intend to and need to. If your code invokes other code which does something remotely and you're not aware of it, you don't fully understand the code you are using. Or you know it doesn't matter in the context of what you are doing.

Now, if you are using microthreads and Stackless Python in particular, you could in theory say that you don't want to allow remote calls, therefore I'll just disable the ability of anything I call to block.

Like this:

def some_function(self):
# ...
old_block_trap = stackless.get_current().block_trap
stackless.get_current().block_trap = True
try:
some_other_function()
finally:
stackless.get_current().block_trap = oldBlockTrap
Doing this when you have no idea where some_other_function is going to lead, is a bad idea. Who knows what operation will error when it blocks the tasklet to perform some IO. But doing this when you're the operation which is being performed and unexpected blocking will interfere with you, is very useful.

One case where I do this is when I run unit tests, there are two reasons to do so.
  • Stackless is a framework which schedules running microthreads. Unit testing isn't the only thing going on and if we let these other things slice in and do their own thing, there's a chance that they'll get clobbered by changes to their environment. For instance, mock objects may be sitting place of any of the modules or resources which they are trying to access.
  • There is no reason for a unit test to actually block, the blocking resources used should be mocked out. If the blocking actually needs to happen for the testing to be done, then this is an indication that too much is being tested or that functional testing needs to be done instead.
This is a very interesting aspect of RPC to work on.
Differences in latency, memory access, partial failure, and concurrency make merging of the computational models of local and distributed computing both unwise to attempt and unable to succeed.
The paper and I both agree that pretending remote objects are local, or can be used the same as local ones is both impractical and doing it badly. It's interesting to go over the different reasons it gives that this is the case.
The most obvious difference between a local object invocation and the invocation of an operation on a remote (or possibly remote) object has to do with the latency of the two calls.
There's nothing to add here. If you are doing a remote call, you are doing it intentionally, you know the objects you are using involve it and you factor in the latency it causes.
A more fundamental (but still obvious) difference between local and remote computing concerns the access to memory in the two cases—specifically in the use of pointers.
The paper refers to this in the context of a lower level more general programming. I am talking in terms of game programming with a dynamic scripting language like Python. So the thoughts I have on this aren't directly related to the paper.

In general, for the purposes of actual game programming it is best to just disallow remote attribute access. It is just too slow for any general game use and it is easier to lock down remotely callable functions and keep in mind a model of who can access what, than it is to also try and lock down remote attribute access. But if the code running isn't tied to the gameplay and security concerns don't come into play, it can be very useful functionality.

One use case is doing functional testing of a client running against a game server. Here, arbitrary attribute access on any object, local or remote, is a very powerful tool for invoking functionality and checking state. I could obtain an object which represents an object on the server from the server itself. Or I can obtain an object which represents an object on the client which represents that object on the server. That is, just going through my RPC layer to the server, as compared to going through my RPC layer to the client then through the client's RPC layer to the server. This allows me to test the server object from two different perspectives, local usage and remote usage.
Partial failure requires that programs deal with indeterminacy. When a local component fails, it is possible to know the state of the system that caused the failure and the state of the system after the failure. No such determination can be made in the case of a distributed system.
Basically, when a remote operation fails, part of it might have completed and you might not know how much. This might lead to any attempt at recovery doing so in a way which doesn't reflect what actually happened and making things worse.

In Python, when an error happens, you get a traceback. If these errors are happening in the client, someone will eventually report it and be able to provide it to you. If the errors happen in your server, you can catch them, abstract and record them, raising a warning if they are happening too often. The point is, that you can know for sure that partial failure is happening and can react to it with due expediency.

The server should be authoritative. The client won't make any decisions which matter unless the server has given the thumbs up first. The server will keep the definitive state in the database, and will make important decisions by performing an atomic operation which has to reconcile with that state in the database. If the database doesn't perform the operation, the game logic doesn't proceed with the action.

An error on the Python level, will in general only affect the objects involved in the call stack. That's both on the called side and the remote calling side. In the worst case that all use of some functionality is broken, you can warn people away from using it. It is possible that doing so may otherwise leave them as sitting ducks for players and NPCs who encounter them while they are in this state.

However there may be errors which affect the whole server, making it unplayable to the clients connected to it. In this case, it is likely that game entities which are local to each other, are more than likely managed on the same server. This means that if the server goes down for one player, it goes down for all other players who might otherwise be able to interact with them face to face in the game. And of course for NPCs, which might decide to engage in aggression against them.
... the distinction between local and distributed objects as we are using the terms is not exhaustive. In particular, there is a third category of objects made up of those that are in different address spaces but are
guaranteed to be on the same machine. ...

... Parameter and result passing can be done via shared memory if it is known that the objects communicating are on the same machine. At the very least, marshalling of parameters and the unmarshalling of results can be avoided.
This is definitely something worth keeping in mind. It is all too easy to abstract local-remote objects away behind socket communication, to be communicated through the same layer of marshalling and networking as objects that are actually remote.

Like many other elements of a game engine, there is only so much time to work on any given aspect, and you just have to work with what you can actually get done. It is easy to lose track of things like this.

No comments:

Post a Comment