Virtual machine state transfer

On the InWorldz grid to run user created dynamic content we designed a script engine called Phlox. Phlox executes scripts that are written in the LSL scripting language. The script engine consists of a bytecode compiler, a virtual machine, and a runtime environment that provides scripts with access to fast execution and functions that manipulate the state of the world simulation.

Some questions have come up recently about how running scripts attached to live objects on the InWorldz Grid move between simulators. An object on the InWorldz grid is not static. It can move around freely using physics or kinematic movement within the confines of the permissions assigned by region owners. This means that objects may potentially cross between regions either under their own control, or under the control of an avatar driver. Load sharding on the InWorldz grid at the simulation level is done via spatial partitioning. Each region is 256 x 256 meters and within that area all the scripts on the region run in the same process/address space. This means that in order to let an object move from one partition to another, which are potentially running on completely separate servers, we need a way to pack up the script no matter where it happens to be in its execution. We can’t wait for safe points since such a thing does not exist in every script. Take the following LSL fragment as an example. If you don’t know LSL, don’t worry, it is very similar to other C based languages:

default
{
  state_entry()
  {
  }

  touch_start(integer total_number)
  {
    integer a = 0;
    integer b = 100;
    for(; a < b; ++a)
    {
      llOwnerSay("Hello Avatar! " + (string)a);
      llSleep(1.0);
    }
  }
}

When touched, this script will count from 0 to 99 and will output the current result to the owner of the object. The script will continue execution inside the touch_start() event until the count completes in about 100 seconds. This is fine when you are in an environment that will stay running and where you wont have to worry about being moved around, but what happens when an avatar is wearing this script in an object on their wrist as some kind of watch and they wish to move to another region? As mentioned previously, regions are spatially partitioned. The region they want to go to may not even be on the same server. How can we pack up a script like this, or another that may be deep inside nested function calls with no end in sight? We run into the same problem when we simply want to save an object to storage. The script is constantly changing and could be anywhere along its execution when we come by to persist the object data. We can’t simply hope and wait that eventually the script will exit a loop or that a script will be generally well behaved.

We need access to the full call stack as well as all global and local variables involved in the call chain. We need to be able to pack that data up so that when we return to the call on a different machine we can unwind the stack the same way as each call returns. We need access to this data NOW, not in 1 second, not in 10 seconds, and not in 10 minutes.

Since we knew we would need to support full runtime data collection, Phlox was designed to provide this data quickly and on demand at just about any point in a script’s execution. Let’s take a look at some key data structures that govern and track an executing script.

The script once compiled and associated with a running virtual machine gets its very own instance of a RuntimeState class. This tracks the current state of a script as it executes inside the phlox VM. A partial snippet of the structure is listed below:

RuntimeState

The components shown are:

  • IP – The instruction pointer. This variable tracks the current memory address of the instruction that is executing on the phlox virtual machine. This variable is incremented as each instruction in a script is executed, and can vary wildly for jumps and function calls.
  • LSLState – This is the current runstate of the script. If the script is disabled, sleeping, etc it will be indicated here.
  • Globals – A script’s global variable values.
  • Operands – The operand stack. This is used for intermediate values when we’re doing long strings of calculations like x = 1 + 2 * 3 + 4
  • Calls – A set of callstack objects that can be reconstructed that provide us with a current view of all function calls, even nested or recursive ones.

Given access to this information and a LIFO stack of StackFrame objects, we now have everything that we need to reconstruct a script. Each stack frame represents a function call and its structure is shown below.

StackFrame

The components of the StackFrame are:

  • FunctionInfo – Information about a called function such as the number of parameters it takes and its memory address and name
  • ReturnAddress – When this function returns, this member contains the address of the instruction in the script we should return to
  • Locals – The list of variables that were passed to this function

With all of this information, a simplified procedure to capture the full state information becomes pretty straightforward

  1. Stop the script from executing any more instructions on this Phlox VM
  2. Save the IP to identify where in the script we are executing
  3. Save the current global variable values for the script
  4. Save the stackframes list to get a full view of where we’re at in the execution of our script
    1. Along with the stack frame we now have a full view of all functions involved in the current call and all their local variables
  5. Save the operand stack so that even if we’re in the middle of executing a single line of script we can still pause the script and resume its execution without error.

We then serialize this into a compact format, transport it to a new machine and inject it into a running VM. It fires up right where it left off milliseconds later, faster than a single simulation frame, and the script continues to do whatever it was doing on a new server. All thanks to the power of a machine that is completely defined in software.

It should be noted that something like this happens but on a far more massive and exciting scale each time you migrate a running virtual machine to new hardware on your favorite virtual machine platform. See Hyper-V Live Migration FAQ and Chapter 21. Xen live migration for some really interesting information at how the same techniques are used in big VMs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s