The further consequences of free and cheap

The further consequences of free and cheap

I touched a little bit on this subject on the InWorldz blog in a post titled The hard thing about hosting things, but lately this has been coming up more and more, and I’m starting to get annoyed.

There are some very talented people in the OpenSim space. People with a wide range of skills from C++ development, C#, web development in PHP, node.js and other languages. The thing that all of them seem to have in common is that they have a passion for working on 3d software and simulations, but unfortunately we’ve found more and more that this passion and drive is being taken advantage of, and just like hosting prices in OpenSim, a lowest bidder mentality has emerged.

As a business owner and employee of small businesses, I learned a long time ago that when people don’t have to worry about paying their bills, they are more likely to be creative and solve problems in effective ways. They are loyal to your vision, and want to help see it through. They don’t have to seek out work elsewhere to make ends meet and they are appreciative of the sacrifices that you make for them.

Unfortunately the stories I’m hearing from people getting paid for OpenSim based work and from OpenSim grids paints a really disturbing picture of the way people are being used and led astray with promises.

The most recent I’ve heard is from an extremely talented software developer that is also serving as a systems administrator, a devops person, and someone that is constantly on call anytime something breaks. This person tells me that he calculated his effective hourly rate with this schedule and came up with a figure of $0.70/hr. $0.70/hr to be a software developer and systems administrator on call 24/7. No thanks, i’ll pass no matter what vague promises about the future you make to me.

Really? Is this the best that we can do? No wonder we have a hard time getting anything done and this space is considered a joke to many outsiders. I don’t think this particular joke is funny at all.

On a related note, I was just recently made aware of a comment that claimed an $800 bid to implement export, which requires BOTH viewer side and server side changes was “not great”. As if this bid to do custom software development on two separate platforms were easy and should pay minimum wage? What?

If you think paying $800 for software implementation is too much, maybe your business should be charging more to free up real money for the software development that it depends on. People’s time is not a charity to be exploited.

Software development charges from companies and independent contractors typically land in the $70 – $120/hr range. In my experience these are fair numbers, and when you pay a much much lower price to try to get the best deal, you end up getting what you pay for, and many times it’ll have to be completely redone later on. When you have to do something two and three times, the cheap rates end up not meaning very much.

So my final message is pretty simple. As we free up more cash to pay more people for their time, expect that the contractors you’re used to paying pennies to may end up finding their way over here. I won’t make any vague promises. I’ll ask them flat out what kind of compensation they think they need to complete a project and pay accordingly. I won’t put anyone on call 24/7 unless they’re being paid real wages. If they go over budget due to a bad estimate, we’ll work to make it right.

I actually like people. I want to see them succeed and be happy. Only when I see happy people working for us and alongside us do I know that we’ve truly succeeded.

My history with InWorldz

My history with InWorldz

[The following was given to the organizers at the InWorldz 6th birthday celebration. I’m pasting it here for those that haven’t been able to read it]

In 2009 I lost my father at the early age of 60. I had a pretty rough job where I was going 12 hrs a day on average on business software. It was mostly web development and database work which wasn’t really my cup of tea. I was in a pretty bad state and I really was looking for a way to dig myself out of the emotional hole I was in.

I found Second Life as an outlet for my feelings. It was great to be able to express myself artistically. I found that I did a decent job at 3d modeling and that my programming experience translated to scripting.

I created a bunch of 3d stuff that I always gave away, and even owned part of a region for a time. I’ve always enjoyed sharing happiness and experiences with others where I could, and I considered renting a full region to try my hand at really reaching out to people who may have been feeling the same pain of loss and letting them know that they were not alone. The biggest problem was that I found was that it was very difficult to justify paying what amounted to a car payment for entertainment.

I did a lot of work in the IBM sandboxes which were a great place to get quiet work done. They were always kept clean and free of drama by PatriciaAnne Daviau a wonderful person who would become a great friend in my real life as well as my virtual life. Patty knew a Scotsgreymouser Janus who in turn knew Elenia Llewellyn.

Elenia Llewellyn and her business partner Legion Heinrichs were looking for a developer to work on server side code for a piece of 3d simulation software called OpenSim. It looked like a good opportunity for me to segway more into the games and visualization side of programming which I had wanted to learn more about anyways.

I agreed to work on the software and I got my own region there to build on. I ended up creating a place that really helped dispense of the emptiness that I felt. To this day “Tranquillity’s Pad” remains mostly untouched. It is my vision of an afterlife where we transcend physical boundaries and can visit far away places as part of a better, more peaceful existence.

Of course, InWorldz work wasn’t all fun and games. Just about every script I brought in crashed the simulator and I ended up having to do massive amounts of work just to try to get the engine stable for any real work. Eventually, Phlox was born out of my frustration and the frustration of our (unexpected) influx of customers. Phlox completely replaced the legacy script engine and runtime, and brought the first of many giant leaps in stability to InWorldz. I was brought on as a founder and owner of the company, and the rest is history.

I and the rest of the InWorldz staff continue working hard to provide an experience that can be as transformational to others as it was for me. InWorldz is about self discovery, and I hope it can continue to provide others with a refuge when they don’t know where to turn.

Reality check on opensim stats

As I suspected would happen, a recent article on hypergrid business has people speculating that the opensim metaverse is crowding out and outgrowing commercial grids and will now all of a sudden take off without them.

So I did some fact finding and found something very interesting.

There has indeed been growth in opensim grids over the past 3 years, but it has been dismally small. Until people come to terms with this and start advertising their grids (be them free or commercial) this trend will continue. Of the growth that has occurred since 2011, the majority of it has happened on the back of InWorldz.

Using the statistics provided to me by Maria Korolov, I tracked total active user growth for all known opensim grids between December 2011 and December 2014. Are you ready for this?

Total opensim active users growth 2011-2014:

  • 5,043 users (yuck. for comparison over 7700 people have purchased minecraft in the past 24 hours)

Total InWorldz active users growth 2011-2014:

  • 2,803 or about 56% of the total for all new active users

If I go back to our peak a few months ago of over 8300 active users, we account for an even larger percentage.

I understand there are many out there that for one reason or another don’t like InWorldz, but between our continuing small marketing pushes on facebook, and our largest advertising campaign ever coming in the next few weeks, we contribute to a great degree to the inflow of new users to the opensim platform. Our servers and staff have taken care of over 100,000 registered users and we have a lot of lessons learned because of it, which we’re sharing as we can. Starting more infighting over dismal numbers isn’t going to grow the opensim VR space.

Relay for Life on InShape!

Relay for Life on InShape!

On November 1st 2014, something profound is going to happen, and I’m super excited to be a part of it.

For the first time, avatars and the people behind them are going to run the American Cancer Society Relay for Life, for real, virtually.

What in the world does that mean?

RFL_Snapshot_001Using the InWorldz InShape system, beta testers from the virtual world will get together on a virtual track that has been designed by Relay For Life of InWorldz volunteers. They will log into the virtual world, and by using acceleration data supplied by the cell phone in their pockets, they will transmit walking, biking, and running forces from their real life into their avatar. That data will be used to move them through the Relay for Life track.

From exercise bikes, to rowers, to treadmills, the harder they run in real life, the faster their avatar will go. Their dedication will show through form the real to the virtual. Their avatar will become an extension of the good we all want to see in the world.

Business stamina

We will gather from around the world. We will run the same track for the same great cause, and we will help to give hope where hope is needed.

This is what virtual worlds are all about. Fighting for good causes and using virtual worlds technology to augment the human experience. If you are an InShape beta tester, please join me on November 1st to help kick cancer right in the backside. Let’s get together and sweat to draw attention to a disease that affects us all, and needs to be eradicated before it can claim any more lives.

Run for life.

(Watch the livestream here starting at 9 AM PDT)

Virtual worlds – It’s how you use them

Virtual worlds – It’s how you use them

Virtual worlds..

They’ve been tried
They didn’t catch on
They will remain a niche

I don’t think so, but why haven’t they caught on yet?

The problem isn’t “cartoon class” visuals. Heck, there are iPhone and Android games that aren’t even 3d, let alone providing cinema class visuals. That hasn’t stopped them from becoming insanely popular among millions of people. People are willing to overlook visual fidelity when something is fun and engaging. Shoveling the same boring experiences with fresh graphics might help the situation for a while, but in the end, people still need reason to want to be inside a virtual environment.

The problem isn’t technical. Though there are a fair share of technical issues in the MMO/3d space, people are still willing to keep trying and keep coming back as long as the platform is fun and engaging. We should do our best to remove the technical hurdles, continue to fix problems, and try to provide a high quality product. But we must also realize that more importantly, the product must continue to offer new and compelling reasons for people to come back to really make an impact on the world.

In a long run of virtual world platforms, most companies have never really tried to answer the question “what can we use this for?”. Rather they’ve always left the actual activities of the world completely up to the users. This is great to a point, and we want to remain hands off as much as possible, but people also look to those running the virtual worlds for leadership and ideas. They want to know ways they can develop compelling environments that will bring in visitors. They want to show off the awesome stuff they’re building.

I think there has been too much hands off, and too little direction given to people to say “Hey, we have this really cool platform, and now we want to try this idea, will you help us?”

That is the direction we’re moving in now.

An example of this is InShape. This smartphone software combined with virtual worlds will allow people from all over the globe to start having fun during their boring exercise routines. We’re hooking your treadmill, elliptical, and exercise bike motions up to a virtual environment and making your avatar run and bike her way around long beautiful trails inside the InWorldz virtual world. On those really crappy snowy winter days when you cant even get out of the house, InShape will let you run along familiar beaches with your friends from all over the planet.

But what about the current residents and customers of InWorldz? What about the people in the virtual world that aren’t keen on exercising?

Perhaps this can be a motivator to get residents moving, but more importantly, they can still be a huge part of making InShape a success! We need people who are dreamers to create the best running and biking trails in any world. We need dreamers to create the fitness accessories that people are going to want to put on their avatars while they’re working out. We need dreamers to create exercise equipment that uses the InShape data to provide new and exciting experiences and virtual exercise equipment that hasn’t even been invented yet!

The more we utilize the virtual to enhance the real, the more that the power of virtual worlds becomes clear. I can’t wait to start attending regular exercises classes with all my real, and virtually real friends!

This is only the beginning. Let’s work hard to show people all the awesome things that can be done with virtual worlds!

See more on the InShape beta test forum.

Distributed messaging fault tolerance

In a previous post, I was left with a question of how to ensure that messages were not dropped in the face of a specific type of failure. The problem that I was faced with at the end of the article was the case where the consumer side loses its link to the common node where a producer is sending messages to in a quorum.

cli_common_link

In this scenario, the producer continues pumping out messages to a quorum of nodes, unaware of the link failure on the client side. The client does not receive them until it makes a new connection to achieve a read quorum. Without adapting this design, messages will be lost while the client reconnects.

cli_common_link_recover

I have chosen to go with a design that allows the caller to specify a TTL on messages even if they’re not going to be sent to persistent storage. I will then implement an in-memory queue for messages that are received on a node and not claimed. Any message that is not claimed by an active subscriber will go into this queue for a configurable amount of time (I’m thinking 5- 10 seconds max) that will give the consumer time to recover from a temporary failure like the one above without losing messages. Though this wasn’t very important for my use case, it felt wrong to lose messages during ANY kind of failure on what is supposed to be a fault tolerant system.

The other option that I had entertained in the previous article was to just commit messages to all nodes that are up that handle the range, and only return success if the producer was able to contact at least a quorum of them. This would handle the above client side link failure case, but I threw an additional monkey wrench into the mix.

I want to borrow the idea of the coordinator node from Apache Cassandra. That is, any client connects to a single node and doesn’t have to worry about ring topology to begin passing messages with the system. I want to keep the design of the client as simple as I can so that it is easy to port to multiple languages.

coordinator

In this configuration, even if a producer were to write to all the nodes that handle a range on the ring, we could still lose messages if the coordinator node went down. Writing to a quorum of nodes and holding unpiped messages temporarily in a time and space limited queue allows the coordinator to die without a loss of messages, as long as the recovery happens within the TTL specified by the producer of the message.

Disadvantages of this idea are additional latency and additional memory usage. Because I have to wait for the given TTL to expire before pronouncing a message claimed or not, when a client isn’t available to read a message there will be a minimum of TTL seconds in latency before I can respond back to the coordinator node that a message could not be piped. Because I have to keep a queue of messages for the given TTL, there will be memory used for that data.

However, one piece of good news is that because sopmq is optionally persistent, if a message is flagged to be persisted and it can’t be piped, I can store it to the Cassandra backend and return immediately once a quorum of nodes responds that the message could not be piped. In the persistent case, the latency will be minimal, and for us this is the case that is most important because it means we can quickly tell the user that their message was not delivered but was saved.

We’ll see in time how these decisions impact the design and development of the project.

Distributed messaging failure modes

Distributed messaging failure modes

[this is a brainstorming document and implementations are subject to, and will most likely change during development.]

I have a good understanding of a few distributed storage systems that use consistent hashing and quorum reads/writes to load balance, scale out, and provide fault tolerance, and wanted to apply some of these ideas to my own projects. In that light, I’ve decided to dip my toe in the water and design a distributed messaging system that will eventually be used to replace the frail/legacy communications systems present in our virtual world platform.  The messaging system as well as connectivity components are all being designed as open source projects as part of our virtual world future initiative.

After starting the project I found a distributed message queue that meets my requirements, namely apache Kafka. However, since I can always use more distributed systems experience, I’ve decided to push forward with a distributed messaging system that is designed more closely to our requirements.

One of the most difficult challenges is coming up with what guarantees I want to make in the face of failure. Our requirements for this messaging system are.. well, simple compared to some. We need scalability, and fault tolerance. We need two modes of operation. Message piping, and persistent storage with a single consumer who will claim the messages when they make their first connection after being offline.

The “message pipe”

For the majority of the time the system spends running, it will pass messages between online users involved in private IMs as well as group conversations. As an example, each simulator will subscribe queues for instant messages belonging to the users they are managing. When a new group message comes in, consistent hashing on the queue identifier will route the message to the appropriate nodes for processing. The message will then be passed to either consumers connected to the queue and waiting for messages, or it will be dropped.

If there is a group IM session happening and only one person is subscribed to the group chat at that time, all messages posted to the queue will be dropped without being piped and we’ll return a status value indicating as such. We’ll configure this specific queue not to store messages that aren’t immediately consumed.

Since I don’t plan on including any kind of on-disk or in memory queues besides what will be required to forward the messages, this type of queue has the most “interesting” properties during a failure. Obviously plans may change as I determine exactly what guarantees I want to provide.

Message pipe failure modes

Let’s check out a few of the interesting things that can happen to this distributed system when running in production. We’re going to use quorum style reads and writes to provide fault tolerance. This means that every message published will be sent to at least two of three nodes before the send is confirmed, and that every consumer to a message will listen to two of three nodes for incoming messages.

(In all the following diagrams P represents the producer and C represents the consumer. Sorry ahead of time for the horrible diagrams)

Common node failure

Let’s start with an easy one. A situation arises where a single node honestly and truly goes down. Maybe someone tripped over the network cable, or the machine decided it would be a good day to start on fire. This node is seen as down to both the consumer and the producer processes. The node with the X through it below is now down.

common_node_fail

 

We have a problem here. Though the consumer is listening to two different nodes, the producer is no longer writing to a node that the consumer has a link to. If the producer were to continue this way, the consumer would never get any more messages during this session.

Luckily, we’ve chosen a strategy that will prevent the producer from considering a write successful until it can write to at least two nodes. When the next message comes ready to be sent, the producer will block its delivery until it establishes a quorum write with the remaining nodes. As you’ll see below, this behavior allows the message to be delivered to the consumer

common_node_fail_recovery

Once the producer establishes a new link it is able to send the message which will reach the consumer through the already established connection in black on the bottom left. We have fault tolerance for this scenario. When the dead node comes back up, more producers will continue writing to a quorum of nodes, and no messages need be lost.

Producer side common link failure

A situation arises where the producer can not contact a common node in the quorum, but the consumer node can. From the point of view of the producer, the top node is down.

producer_common_link_failure

The recovery from this is similar to the common node going down.

producer_common_link_failure_recovery

The producer will fail to obtain a quorum for writes, and will establish a connection to the bottom left node to repair the condition. At this point it can continue to deliver messages as demonstrated in the common node failure scenario.

Consumer side common link failure

This is the one that has me wondering what the “best” answer to the problem is. In this scenario, the consumer loses connection to the top node, severing it and preventing it from receiving messages from the producer. The top node is still up as far as the producer is concerned and can still write an satisfy a quorum.

consumer_common_link_failure

 

However, when the producer writes to the quorum before the client reestablishes a connection, the consumer will not get the messages, and the producer will see the messages as not consumed. We’ll see a temporary disruption in messaging to this consumer until it is able to reconnect to the bottom right node.

consumer_common_link_recovery

Once this new connection is established, messages will resume flowing to the consumer. But in the meantime, some messages weren’t delivered.

This isn’t a huge deal for our use case, because all the important messages will be stored in this case and forwarded when the consumer reestablishes with the new node. However, I’d still love to determine a good way to solve the problem.

One thought is that I could keep a limited (time, space, or both) queue for messages hitting nodes that don’t yet have a consumer attached. The big problem with this solution is that I will end up eating a ton of memory for messages that will never be consumed.

Another thought is that all writes should go out to all nodes handling the range, but only considered successful if the write makes it to at least a quorum of nodes. That way, this failure scenario will only happen if we’re already in a degraded state.

Definitely interested in hearing ideas from people who have already designed systems like this. Feel free to leave comments!