Hi all,
On 24 September we attempted to run the first live beta for the World Restructuring system—a long awaited moment! As most of you are probably already aware, it didn’t go exactly as we had planned. We’re back today to share our postmortem of the event. For those who are unfamiliar with the term, a postmortem is a process by which a development team reflects on an event (release, outage, etc.) and identifies what went well and what could have gone better. The goal is to identify process improvements that can be applied to future efforts. We’re taking some creative liberty with the format to make it a bit more accessible to readers, but we hope you enjoy the look behind the curtain.
In this post we’ll provide a summary of the beta attempt and an overview of the issues we ran into and share what we learned. At the very end, we’ll discuss our plans for the next beta.
Before we jump in, we do want to clarify that the brief login outage that impacted EU servers just prior to the beta attempt was unrelated. That’s at least one fan theory debunked!
Overview
At 5:00 p.m. UTC (10:00 a.m. Pacific Time) on Friday, 24 September we enabled the World Restructuring feature during the weekly European World vs. World reset. Reset is when players are removed from WvW, match ups are reconfigured, and scores are reset. Switching to World Restructuring at reset ensured a clean slate for the start of the beta.
World Restructuring was successfully enabled for Europe on the live game servers (our first hurdle), but almost immediately we started receiving reports that some players were being placed on the incorrect teams. In the movies, this is when the alarms start blaring and the red lights start flashing. The root issue was that some players had stale (bad) shard data that wasn’t updated before they attempted to join WvW. When an impacted player tried to enter WvW, their game client would request to enter WvW using that bad shard data, causing them to be placed on the wrong map and even assigned the wrong team color. This, coupled with an issue in which the UI was not displaying the correct team names, created mass confusion.
Enter issue number two.
No doubt—there was a lot of excitement for the beta test. That plus the Friday reset meant that turnout was incredibly high, resulting in some deep map queues. Naturally, players began entering the overflow map, Edge of the Mists, to kill some time while waiting for the queue. While we were busy troubleshooting the first issue mentioned above, we received a ping from our lead server engineer, Robert Neckorcuk (of “Inside ArenaNet: Live Game Outage Analysis” fame). He alerted us to the fact that an abnormally large amount of server resources were being used for Edge of the Mists map instances—something like thirty times the norm.
However, this wasn’t because of the increased player activity—there was something more insidious at play. We were seeing new Edge of the Mists instances being created despite there being plenty of remaining player capacity in previously created map instances. In other words, there were way too many map instances being created compared to the number of active players, which was consuming a large portion of our server resources. Running out of server capacity is a bad thing—trust us on this one. This resulted in a bit of a runaway effect that needed to be manually monitored and addressed. The fact that we didn’t know why the instances were being created was worrisome and definitely chipped away at our confidence in being able to resolve the beta issues on the fly. That’s about the time that another one of our engineers reached out about a third issue. Good times!
When releasing new content, features, or running beta events, we try to add additional logging. These logs are records of events occurring within a system. They can be used to ensure the system is working as designed, they can provide insight into any issues we encounter, and help with identifying eventual fixes. We have entire servers dedicated to managing all of the logs being generated across the game. Well, on this day, those servers met their match. An error was being thrown by BattleSrv (the new World Restructuring server) at such a high rate that it started to stress our logging infrastructure. These log servers are critical to operating Guild Wars 2, so this was very unsettling to say the least.
It’s hard to believe, but the four brief paragraphs of text above describe roughly two hours of events, and all the while the play experience in WvW in Europe had seriously degraded. Our team huddled together and mapped out two possible paths forward.
Option A was to implement a handful of short-term mitigations in an attempt to keep the beta running (aka, the long shot). This would have included deploying a “blank build” to the live game, which would require players to log out and download a minor client update. This would force a data refresh on any player impacted by the stale data issue we mentioned earlier, which would then in turn assign them to the correct team. We should note that we believe this would have addressed the issue, but we would have no way of testing it before attempting it. Additionally, there was no guarantee that the other issues wouldn’t continue to manifest over the course of the beta week as new players logged in to participate in the beta. We’d also likely need to perform the same “blank build” update after NA reset just a few hours later. In this scenario we would also have to disable the Edge of the Mists map for the entire beta week to mitigate the server usage issue. We found a fix for the logging issue, but it would have taken some time to roll the change out (on a Friday night no less), during which our logging servers would continue to get pummeled.
Option B was to prematurely end the beta. This, of course, was the option we ultimately decided on. There was simply too much risk and too many unknowns associated with Option A. Once we made the call to end the beta, we were able to seamlessly revert back to World Linking after a quick cycling of the WvW maps.
Later that day…
After a bit of a breather, the team reconvened to conduct our postmortem and create an action plan for the issues we experienced. Here are the biggest wins of the day:
World Restructuring technology worked at scale, even if it didn’t seem like it! This was a huge milestone and gave us a lot of confidence in the technology behind World Restructuring. For every player that was misassigned to a team, there were plenty more that were properly matchmade. Because we were able to successfully enable World Restructuring, we were able to collect enough data during the two-hour beta attempt to help us narrow in on where these issues exist in code. Seeing how these issues manifested on live also helped us better understand how we can improve our internal testing processes to better catch World Restructuring issues in development. Large-scale server features, especially for a system as bespoke as World Restructuring, can be tricky to test completely in the development environment.
Next, the matchmaking algorithm used for the first beta created an incredibly even distribution of players and guilds across teams. In the World Linking system, we can see up to a 50% difference in player activity between worlds. In the first beta, the largest difference between two teams was just 2%—a huge improvement. Creating balanced matches is the primary goal of the World Restructuring system, so this was quite exciting to see.
Finally, we were able to successfully roll back to World Linking with few issues once we made the decision to end the beta. This point is more about the fact that the team was adequately prepared for this scenario by preparing a detailed runbook in advance for rolling the system back. We tested this process multiple times on our dev servers, but never at the scale seen on the Live game. Practice makes perfect.
What’s Next?
There are ten bugs we’ve identified as “must-fix” before we can try another beta, mostly related to addressing the stale data issue detailed above and team names improperly displaying. We’ve made significant progress on these, but there’s still quite a bit of work to be done. As of right now, we’re targeting the November 9 release for these fixes, with our next beta week kicking off on 12 November.
We’ll be making some quality-of-life changes to World vs World in the 9 November release. Your feedback is clear: skirmish reward tracks take entirely too long to complete, especially for new players or players who are playing on the third-place team in a match up. To address this, we’ll be increasing the number of skirmish pips earned for match placement from 3/4/5 to 4/5/6. We’ll also be adding a new +1 bonus skirmish pip for players with a WvW rank between 1-149. Existing rank-based pip bonuses will also increase by +1 (so a total of +2 for Bronze, +3 for silver, and so on).
We’ve also been taking a look at components of the skirmish track and participation system that aren’t achieving their intended design goals. In the short term, we will be experimenting with removing incentives from some unintended gameplay, and we’ll be removing the “participation grace time” granted by repairing structures in WvW. We’ll also be removing the outnumbered pip bonus, though the outnumbered stat enhancement will remain.
That’s it for today’s update! Thanks for reading.
See you soon,
–The Guild Wars 2 Team