WSRP Chat on Discord
Open Chat in New Window
Page
of 2

[Guide] Technical Issues and Server Stability

14 replies
Posts:
791
Stars:
+1,091
Admins
Caretaker
This was created as one of Elkhan's ideas!

With so many new player coming to the game, a lot of people are growing frustrated about server stability issues. Some of the more techier people can understand why it's happening. Our goal here is to try and put together an explanation primer as it were, to put on the main WS forums as well, explaining what's going on so new players have more of an understanding on what's causing the problems.

So if any of our more technically minded people want to throw their 2 cents in, please do so!

Note: this is really meant to be about technicals and mechanics, not to be used as a source of complaints and frustration. Please keep any posts on topic!
Posted Sep 30, 15 · OP
Star
x 1
x 1
List
Undo
Posts:
599
Stars:
+321
As I understand it, Wildstar uses a 'Megaserver' system, involving instances, able to (normally) handle a fairly large volume of people concurrently. The live servers before the F2P (Free to Play) system launched were working without major performance incident, as was the test server, as I understand it.

I believe that there are an unexpectedly great number of people returning to the game, or checking it out for the first time, and even that is showing signs of issue. As I tend to call it, this is a 'good problem' because they are getting wonderful attention, but hopefully the experiences are going to be net positives, and not coping with these technical hurdles.

To better give understanding of the context, the problem, and what we believe is being tried to resolve it, I'd welcome anyone with knowledge to share it.

Like Pepper says, I'd like others to explain the systems in place, tell us more about what a Megaserver entails, what exactly instancing is, and what bottlenecks might be seeing issue (I assume login being one of them)?

I'd also like to know what approaches people know that Carbine is trying to take care of this, if we can get any sourced info. We know there are problems. We know of the symptoms. This is about what the system is, not complaining. Understanding. Maybe even offering suggestions by some of our brightest to help, rather than simply criticize.
Posted Sep 30, 15
Star
x 3
x 3
List
Undo
Posts:
495
Stars:
+497
Caretaker
The game development trade organizations sometimes publish postmortems by developers for their released games, which are high level synopses of "what worked, what didn't" in their development journey. I haven't found one specific to scalability of server architecture in game development, but as a general backgrounder the ones on pixel prospector and gamasutra give you a taste of the complexity that goes into making games.

The pattern you'll see: (1) developer makes their best assumptions about a challenge (2) implements a solution that might work and then (3) learns something surprising so (4) original assumptions are amended and the process starts over. There are an UNBELIEVABLE number of things large and small that go into a game, many of them surprisingly time-consuming, and it's difficult to anticipate and test everything. What is easy to describe is often difficult to implement, because "adding more" doesn't usually work unless the task is really simple.

I'm not an MMORPG network engineer, but the difficulty of adding "more" probably applies here. Every time someone clicks on something or moves in WildStar, some data is generated and must be communicated to the server which then responds so the client (what you think of as the game) can update. Every one of these actions takes takes a finite amount of time, and uses up a finite amount of network bandwidth. When the available capacity of a server's internet connection and ability to respond to a request hits the wall, things back up. That's essentially lag. Server population limits attempt to balance this by keeping the number of connections such that there's no way to overwhelm the server while allowing as many people to log-in at the same time.

It might be helpful to imagine this as a fast food restaurant with multiple ordering lanes and one kitchen. During the lunch rush, a whole flood of people come in. Each ordering line can handle some number of orders, on average, per hour. "SO WHY NOT ADD MORE CHECKOUT LINES?" you ask? You could reduce the ordering delay in this way, but if the kitchen can't produce as quickly as the orders come in, the delay in getting your food will increase. The additional volume of orders might also overwhelm the kitchen staff, increasing the number of order errors, which leads to an increased number of customer service issues, which ties up the ordering lines and slows things down even more as each customer is dealt with. "WELL, ADD ANOTHER KITCHEN and ADD MORE WORKERS!" You could do that, but then for most of the day this expensive extra kitchen isn't being used, and those extra workers get paid to do nothing. That's kind of the general idea here. To deal with it, you tune the processes and hopefully fix it down the line. I'm not sure what Carbine does for its servers, but if they are invested in physical hardware tied into a datacenter, they may be limited in how quickly they can provision and test new capacity. It's been what, 48 hours since launch?

Adding extra capacity for a complex network service involves the coordinate between a lot of data sources too. This postmortem about FourSquare (edit: that Foursquare article link is a pretty great overview resource!) melting down is an interesting overview of cascading failure between load-sharing databases, which may be analogous to the kind of surprises (though probably not the same) as what WildStar is facing. All of our character data, ideally, is synchronized frequently enough so it doesn't appear anything is missing. A database server has a finite capacity to handle transactions, and there are multiple database servers for login, character, zones, who knows what. Some responses have to be very fast, like the ones related to combat. Others can be relatively slow because they are not time critical (costume changes). There used to be an AddOn that automated costume changes such that you could create an animated costume effect, and soon after that the servers started having problems because costume changes are apparently quite "expensive" in terms of server resources. That's why we can change costumes only every 15 seconds or something like that now; this was an unanticipated use of what was assumed to be a relatively infrequent operation. There are surprise bottlenecks, race conditions, and dependencies that don't reveal themselves until pushed to the absolute limit. There are lots of things happening now with the new content and the new crush of people that they couldn't test until now. They are watching where things are slowing down and bottlenecking, writing patches to their code, and redeploying frequently to see if it's fixed. As the days go on, I'm sure we'll see things smooth out. It's not like you can buy another HP Pavillion and throw it at the devs and say, "HEY I BOUGHT YOU MORE CAPACITY, HOOK THAT BABY UP TO THE MATRIX, YO!"

This is just a general overview of the kind of problems that a game like WildStar might experience, though I don't know anything about Carbine's inner workings. I did work as a game developer for a while and have been part of the development process for a couple of recognizeable studios, albeit not at the technical director level.
SRI NUTMOON @ ENTITY | HAY ENTREPRENEUR and ERSTWHILE JOURNALIST
ALTS: SRILANA NUTMOON (ic-leveling alt) and JIAN NUTMOON (no relation)
Posted Sep 30, 15 · Last edited Sep 30, 15
Star
x 4
x 4
List
Undo
Posts:
7
Stars:
+4
Whew thats alot to read, but ya if they want to change anything about this "megaserver problem" its gona take a fue weeks, thats if they have a big enough team todo so. an even then they need to hire staff slowly an not just have a big boom an have people around like that has been said.
i think they need to remove of the megaserver idea at some point ,knowing that such things take time an im willing to wait since im having so much fun.
Posted Sep 30, 15
Star
x 1
x 1
List
Undo
Posts:
495
Stars:
+497
Caretaker
We really don't know what a "megaserver" really is. It's unlikely to be a single machine chugging away by itself, but a cluster of machines on really fast network interfaces, or possibly some fancy thing with a bunch of single-board computers on a high-speed backplane, or a hybrid setup involving cloud computing. But yes, to your point, adding another realm could be a last-ditch option for them! That has its own set of headaches, I'm sure, and I'm not qualified to to weigh the operational pros and cons.

Even if they were able to add staff for this, I bet it would be at least a month (more like 3-6 months) before the hires would be familiar enough with the system to make appreciable changes on their own. So I'd guess that they're toughing-it out with who they have available! We're seeing improvements every day, so I am optimistic!

Some of the spikes we're getting now is possible one-time operations that will die down. Existing players with gobs of items that need to be converted might suck up a lot of resources...I recall the first time that I logged into F2P, there was five or ten minute wait on the loading screen when nothing seemed to be happening. It's possible this was related to item conversion? Remember how long it used to take to do a /ptrcopy? It wasn't quick moving that character data for even the smaller amount of PTR testers. Perhaps Carbine didn't pre-emptively converted all account data beforehand, and did it on the fly? Just one of many possibilities!
SRI NUTMOON @ ENTITY | HAY ENTREPRENEUR and ERSTWHILE JOURNALIST
ALTS: SRILANA NUTMOON (ic-leveling alt) and JIAN NUTMOON (no relation)
Posted Sep 30, 15 · Last edited Sep 30, 15
Star
x 3
x 3
List
Undo
Remarus Locke Benefactor
Posts:
112
Stars:
+167
Perhaps Carbine didn't pre-emptively converted all account data beforehand, and did it on the fly? Just one of many possibilities!

They actually said this when they let the first 1000 in, that converting the characters is very resource intensive. It was an offhand comment, but it was said.
Remarus Locke @ Entity - Explorer, Owner of the Skyglade Lodge.
Member of Idle Drifters, an RP/PVE guild.
Posted Oct 1, 15
Star
x 1
x 1
List
Undo
Posts:
3
Stars:
+1
Alright, i'm certain curious about one issue in particular beyond the common lag spike/dc that often occurs by the huge mass of players.

Sometimes the game just decide to not log in, you get stuck into the loading screen after you choose your toon and then threw back to the character list after a few seconds, without any error being reporter of any reason. The lack of feedback is quite annoying, ever since it makes impossible to track what the current issue is.

Having this in mind, what do people here believe it is?
Posted Oct 1, 15
Star
x 1
x 1
List
Undo
It was confirmed to be part of the character copy process via the @WildStarOps twitter - with the large number of new and returning players, some of whom still had very old data, there's a lot of resource use. The same thing that lags you out while moving is also lagging you out while zoning it: a lack of server resources to handle a queued series of events.
Posted Oct 1, 15
Star
x 1
x 1
List
Undo
Soey Benefactor
Posts:
155
Stars:
+117
Also, one thing I would like to point out about the issue of scaling: Many people seem to believe this works kinda like a bottle of water. You put water in a bottle, until it's full. Then you need to make this bottle bigger to keep filling it, and everything will be fine. If it isn't fine, you just didn't make the bottle big enough, right?. Wrong.

A gameserver is not one bottle, it's hundreds of bottles, that are each connected through little pipes. Not only do you need to scale all of these bottles, but also the pipes connecting them, or some of those bottles will overflow no matter how big they are. Also, you need to make sure that all the water is spread across the bottles. Sadly, this is weird water and not regular water, it does not automatically balance itself out over all those bottles by the means of gravity. You need to pump it around to even it out by yourself. Also, at different times of the day, you might differently sized or even shaped bottles, and the requirements change quickly. And then there's not only clear water, but differently colored water, that prefers to be mixed with its own kind. That stuff is crazy and unpredictable devil-water!

And then suddely a bottle breaks. You were prepared, very well so, but there's some things you can't predict. Some things you always only learn after it doesn't work out. So you start to apply bandaids while you work on more permanent solutions. Those bandaids aren't perfect, and they might cause other things to break. Suddenly, everyone wants to have you explain in detail what is going on, whats broken, and what you are going to do about it, but at the same time, they want you to actualy do something about it in the first place. You can't talk and work at the same time. Often times you can't talk at all, because you're still in the process of figuring out what the hell is even going on. There's a bottle leaking but the water runs along several other bottles before it drops down, so finding the source is a painful process of tracing that water back to the point of leakage. That takes time. Sometimes a bottle needs to be completely replaced. There's work to be done.

...long story short: stuff's complicated. give them time to figure it out. don't "demand" they do anything, not even explain right away. let them do their job. They hopefully know what they're doing, and even if they don't, there's nothing you can do about it anyway, except be patient :)
Soey Flamepaw | Aurin, Stalker, Cuddle Consort, Pillow Enthusiast, Hugvertising Expert | Jabbit (EU)
Drakaar Flamebearer | Draken, Warrior, Son of Razak the Oxian, Lord of Stormskull Clan | Jabbit (EU)
Tumblr: OOC | Soey Flamepaw
Posted Oct 1, 15
Star
x 4
x 4
List
Undo
Posts:
28
Stars:
+23
Something to add: Even if they do need better servers/tech, server blades and other components often get stuck in customs, which takes time.

I feel so bad for their netadmins.
Posted Oct 1, 15
Star
x 3
x 3
List
Undo
Page
of 2
NoticeNotices