Dolphin MEGA Progress Report: April and May 2021

After finishing up the macOS M1 article the blog staff took a little break. Then they saw the date.

Oh shi-

Upon looking at the actual changelog, however, something became readily apparent: this wasn't going to be just a Progress Report; this was going to be a MEGA Progress Report. The long rumored time era of developers merging everything at once had finally come to pass. We have graphical fixes for Super Mario Galaxy and Luigi's Mansion, crash fixes for Star Wars: Rogue Squadron III, Xenoblade Chronicles, Ultimate Spider-Man, The Legend of Zelda: Skyward Sword (AArch64), and new features that make playing games more pleasant! And about AArch64, there are a litany of optimizations and fixes that will change things across most of the library.

And we could go on: Bounding Box, Interpreter, GBA to GCN connectivity, GPU Syncing, Mouse Locking, and still more! There's even a lengthy dev diary at the end for good measure explaining how the great mystery of Pokemon Box's was finally solved. The only way to do it justice is to do it right. So buckle up and get ready for the April and May MEGA Progress Report.

Notable Changes

5.0-14295 - Apple M1 Support for MacOS by Skyler

This change was big enough for it to get its own article. Check it out if you haven't!

Long story short: Dolphin has been ported to run natively on M1 hardware by taking advantage of our AArch64 JIT. The M1 has proven to be a rather powerful device that can outrun like-class x86 devices, and has proven that ARM and x86-64 computers can netplay together in some games! But there was one lingering question from the article: is the M1 that special? Because of Android's (well earned) reputation, many of the improvements to Dolphin's AArch64 JIT have flown under the radar. It's hard to see performance improvements vividly on phones and tablets with weak processors, aggressive governors, and obnoxious driver bugs getting in the way. So in order to right that wrong, we've brought in the best Windows on ARM device: The Surface Pro X!

Note: We have the Surface Pro X 2019 available for testing. Technically, the Surface Pro X 2020 is the "best" Windows on ARM device, however that machine uses the exact same chip (8cx) just with a slight overclock. In single-core benchmarks it performs roughly 9% faster than the 2019 version, so just add 9% to these results for a rough estimate of how the 2020 model would perform.



So how does it perform? Pretty damn well. The Surface Pro X trades blows with our high end 2018 Intel MacBook Pro, which is something we never expected we'd say of an ARM device just two years ago. However the 8cx in the Surface Pro X just doesn't have the raw power of Apple Silicon, usually getting around half the performance of the M1. Still, If you're looking to run Dolphin on an AArch64 device and prefer Windows, you'll find that the Surface Pro X is a very capable little emulation machine. As a bonus, the Pro X uses D3D12 (though only D3D12) which supports features like geometry shaders that are missing in MoltenVK, meaning that games like Mega Man Network Transmission will render correctly.

Either way, it is an exciting time to be in computing and emulation. Thanks to the considerable improvements in recent ARM devices, we're now finally able to see the AArch64 JIT chew through games like we always hoped that it would. And on that note, let's just say that our AArch64 JIT has been getting much better at its job recently.

5.0-14066 - JitArm64: Greatly Improve Rounding and Conversion Accuracy by JosJuice

Now that we have high-performance AArch64 devices, expectations for our AArch64 JIT have been raised. While users still tend to first prioritize performance, there's been increased pressure on the AArch64 JIT to give the same level of accuracy as desktop builds. To that effort, JosJuice has taken up the project of bringing the AArch64 JIT to parity with the x86-64 JIT and this change is a huge step toward closing the gap in compatibility.

You see, the PowerPC CPU in the GameCube and Wii allows software to configure the rounding mode used for floating point calculations, and whether denormals (very very small numbers) should be flushed (rounded to zero) automatically. Our AArch64 JIT just straight up ignored this feature while our x86-64 JIT respected it, so JosJuice decided to implement it. However, this revealed another issue. Sometimes the JIT needs to be able to roundtrip (reverse) a singles to doubles floating point conversion - a floating point number started in single precision (32-bit), was previously converted to double precision (64-bit), and now needs to be converted back to single precision. Our AArch64 JIT was doing this very simply by just emitting an instruction that converts between the two precision levels. However, now that AArch64 was respecting that games can choose to round very small numbers to zero, this instruction could no longer reverse the precision change of the denormals. That data was now zero.

So JosJuice borrowed the x86-64 JIT's solution to this problem and had the AArch64 JIT perform the floating point precision conversion manually with a bunch of integer operations. Not only does this fix the problems introduced by respecting a game's rounding mode and denormal settings, it also made our floating point rounding more accurate than it has ever been before on AArch64. Difficult rounding quirks like the slowly rising platforms in Super Mario 64 are now working properly in our AArch64 JIT with this change!

Will the platform slowly rise over time as it should?
It does! This quirk is now properly emulated in AArch64! Painfully observed over the course of five hours on a Pixel 3.

As for the initial fix of respecting rounding mode and denormal flushing, it fixes many crashing, hanging, and physics bugs only present in the AArch64 JIT.

This was the last thing users saw... BEFORE THE VOID CLAIMED THEM.

This does come with a small performance penalty, but they were necessary for compatibility purposes. Rather than trying to implement this without a performance hit, JosJuice knew that there was much more performance to be gained elsewhere that wouldn't cost accuracy.

5.0-14128 - JitArm64: Implement FPRF Updates by JosJuice

Floating Point Result Flag (fprf) is a feature of the GameCube that lets games check a flag to see information on the previous floating point result. With this functionality, a game has access to whether the float was negative, positive, positive infinity, negative infinity, etc. Very few games rely on this behavior, so the feature wasn't actually implemented in the AArch64 JIT. Any title that did use it would have to fall back to interpreter on many floating point instructions for a rather hefty performance penalty.

Implementing this in AArch64 increases expected CPU performance in many Sega developed games, including F-Zero GX, Super Monkey Ball, and Super Monkey Ball 2.

Bring a controller, though. Super Monkey Ball 2 on a touchscreen is something we'd wish on no one...

Because of the way fprf works, getting solid performance numbers that look pretty on a graph isn't really all that easy. In F-Zero GX, Android devices were already GPU limited before this optimization and don't gain performance. Even on the M1 the performance difference was only recorded in races where the framerate would vary wildly and randomly with 29 AI cars. However, testers did report that the game had less lag overall throughout a Grand Prix.

5.0-14233 - JitArm64 - Implement Floating Reciprocal Estimate Single (fres) and Floating Reciprocal Square Root Estimate (frsqrte) by JosJuice

Not content with just making changes that affect a few games, JosJuice decided to implement some of the more difficult instructions in AArch64. Floating Reciprocal Estimate Single (fres) and Floating Reciprocal Square Root Estimate (frsqrte) are common instructions used in popular games and sometimes are relied on for physics calculations. If you used Dolphin a decade ago, you probably ran into bugs in games such as Super Mario Sunshine, Super Mario Galaxy, and The Legend of Zelda: The Wind Waker thanks to our shoddy implementations back then. These instructions are everywhere! In AArch64, rather than creating crude implementations of these instructions, they simply weren't implemented at all and fell back to Interpreter.

Implementing fres and frsqrte improves CPU performance across most games.

5.0-14239 - JitArm64: Fix Floating Multiply (fmul) Rounding Errors by JosJuice

Perhaps more excitingly, JosJuice also fixed up another common instruction used in physics called Floating Multiply (fmul). This instruction was implemented inaccurately in the AArch64 JIT, meaning that results weren't quite right. Unfortunately, this instruction is also used in game physics all over the place. A few popular examples we know about are Mario Kart: Double Dash!!, Mario Kart Wii, Super Monkey Ball, Super Monkey Ball 2, F-Zero GX, and Donkey Kong Country Returns.

We know that these games were affected by the rounding differences because they all feature some kind of "input replay" system that allows them to save inputs and replay them back later. Because these replays are saved to a memory card or shipped with the game, we can see the replays on both Dolphin and console! This essentially allows for hardware verification of physics! Beyond just breaking replays, fmul errors could give you headaches when trying to play games too. If you played Mario Kart Wii online via the backup servers, these inaccuracies would result in excessive rubberbanding. It also meant you couldn't race against the Staff Ghosts in any of these games and in some rare cases, would make it so well known console strategies wouldn't work. And we're sure there are many more errors caused by bad fmul emulation in other games that we will never be able to verify because they don't have replays.

Physics errors can cause all kinds of problems in Mario Kart Wii.

And with that, we're finally done with the AArch64 changes...

5.0-14254 - Interpreter: Fix Floating Convert to Integer Word (fctiwx) Rounding by JosJuice

Or are we?! So technically this is a fix to Dolphin's interpreter... but since the AArch64 JIT doesn't implement this instruction, it uses the interpreter fallback for this instruction!

The main reason this went unnoticed is that this is a rather rare instruction. Floating Convert to Integer Word (fctiwx) converts a floating point number to an integer number. This is normally a simple task... except that we have to emulate the behavior of the PowerPC version of this instruction. Thankfully, Dolphin already implemented this instruction to an accurate enough degree during the sweeping revolution of x86-64 JIT fixes that brought about the era of replay verification in Dolphin. Everything is fine, right?

We were so blissfully unaware of our mistake.

But there was a bit of an oversight in our grand success. While all of the games that we tested were working in the x86-64 JIT, the only game we tested every CPU backend in was Mario Kart Wii due to easy access to ghosts. Remember how we mentioned that fctiwx was rare? Well, Mario Kart Wii doesn't use that instruction so it was never tested on the interpreter whatsoever!

Fast forward 7 years. JosJuice is going through all of the AArch64 JIT deficiencies and trying to get it to parity with the x86-64 JIT. One of the last known issues is that the physics in F-Zero GX do not sync up in certain replays. So we dug up the original replays used in the ancient video for verification. While it still synced up fine on the x86-64 JIT, no matter what JosJuice did they could not get it to sync up on the AArch64 JIT. Even falling back every single instruction to interpreter wasn't enough.

Upon reviewing the old testing data, testers realized that they never actually tested F-Zero GX's replays in the interpreter. In order to make things right that once went wrong, the test was at long last run under the interpreter.

Oh yeah, that's wrong.

Mystery solved - it wasn't a JIT bug, but an interpreter bug that was hidden by the x86-64 JIT implementation of fctiwx. JosJuice dug into the issue over the course of several days and slowly figured out why it was broken. The x86-64 and PowerPC actually behaved similarly enough with the fctiwx instruction to not need serious adjustments, but the interpreter needed a custom algorithm to try to get an exact replication of what happens regardless of architecture. Unfortunately, the custom code had flaws that caused incorrect rounding that could bleed over into game behavior differences. After numerous failed attempts, JosJuice finally found the flaw and fixed the interpreter implementation. Now you can finally verify your replays on your favorite AArch64 devices or, if you're insane, you can use the pure interpreter and waste half of your day.

As for fctiwx on the AArch64 JIT, JosJuice hasn't implemented this instruction yet as it would be rather complex. The AArch64 architecture has not one but five different versions of fctiwx depending on the rounding mode needed. Our interpreter can handle this instruction on ARM for now.

5.0-13988 - GameINI: Add Ability to Disable icache Per-Game by Pokechu22

Okay, now we're really done talking about the JITs. I promise. This change has almost nothing to do with them other than the fact that Dolphin doesn't target emulating the CPU pipeline perfectly. We've gone into detail about this in the past, but emulating GameCube/Wii processor cache behaviors is a huge performance problem that we're not sure can be handled fast enough to keep games playable. On top of that, the number of unknowns with it means there would need to be tons of hardware testing and likely refactoring of how Dolphin works. Unless an ingenious solution were devised, it's not likely Dolphin's JITs will ever get more than rudimentary icache emulation at a reasonable speed. There's a reason pure interpreter is so slow, and more accurate icache emulation is part of it.

Our current way of handling the rare cases where icache/dcache matter is by either patching the game, or massaging things into working well enough to avoid user-facing bugs. To help with the cause, Pokechu22 added support to disable Dolphin's icache emulation on a per-game basis. This isn't really a performance measure as it doesn't affect performance, but it does affect the behavior of several games. Turning off the JIT version of Dolphin's icache emulation fixes hangs in several games.

If you believe you have a game suffering from icache issues, you can disable icache emulation in any title with the new DisableICache option. Simply go into the game properties page, that goes under the [Core] section of a Game INI. However, this disables icache emulation entirely, no interpreter fallback or anything, so any game that requires icache emulation will just fail. For example, the GameCube Main Menu. As such, we've disabled booting the above GameCube titles from the GameCube Main Menu.

The top box is the default INI showing this game needs this feature and has enabled it. The bottom box would be how you can enable it in games that do not have it enabled by default.

5.0-14007 - DriverDetails: Fix broken vector bitwise AND on Mali drivers by sspacelynx

The Mali driver is a mystery to mankind. They do things in ways that mere mortals cannot and should not comprehend. Thankfully, newcomer to the Progress Report sspacelynx may actually be a lynx from space that is privy to such foreign concepts.

Dropping the grandeur for a moment, the Mali drivers have slowly been improving, but still have a lot of serious problems. Though this doesn't affect Dolphin in most games, probably the most damning thing they've done is implement glCopyImageSubData... on the CPU. That should give you an idea of the kinds of challenges that we're up against when trying to get good performance on Android.

Ignoring the many performance challenges, there are also major bugs with Mali's GLSL shader compiler. While we're not quite sure on the details to why and how, it doesn't seem to be able to handle bitwise AND operations between two integer vectors when just one of them is a non-constant. Once sspacelynx knew what was wrong, it wasn't all that hard to fix. The commendable part was them doing all the debugging required to actually narrow it down. And for their efforts, they've fixed a ton of graphical errors on Mali.

This is the kind of stuff Mali users have come to expect.


But this next Play Store update, they're in for a welcome surprise.

5.0-14203 - Android: Fix Controller Inputs Not Saving Properly by JosJuice

Dolphin on Android sometimes has to jump through loops in order to work properly. One of the rather problematic things is that configurations are split up in really odd and annoying ways. For the past couple of months, there have been reports that Dolphin wouldn't save controller configurations unless the user added it to the per-game INI. This issue confused and confounded JosJuice for quite some time until the behavior was finally reproduced by JMC47.

To reproduce the bug, one had to have mapped a controller or another device, and then try to map something else over it. Dolphin would say that it overwrote the settings but in reality it didn't. Because controller settings are stored in two different places on Android, the GUI settings would show that everything configured fine. Unfortunately, the actual settings that the emulator reads from wasn't getting updated. A bisect revealed a rather curious cause - Settings.SECTION_INI_ANDROID and Settings.SECTION_BINDINGS both have the value "Android". Because they shared the same value, Dolphin thought that Settings.SECTION_BINDINGS was a part of the new-to-android layered config system when it hasn't actually been ported over yet. As such, it was unable to overwrite the old values.

JosJuice quickly fixed the issue, and now users will be able to configure controllers and save their configuration correctly in the latest Play Store release!

5.0-14201 - Android: Adjust Touchscreen Opacity by MayImilae

If you don't use controllers on Android, you may have noticed that Dolphin's touch controls look a little bit nicer now. MayImilae noticed that 5.0-13545 added opacity controls, giving the user more freedom toward configuring the touch interface. However, when she originally designed the onscreen buttons, Dolphin on Android had no feature to actually adjust the opacity. She had to choose a default that would look nice in most cases and made the opacity a sensible 50%. However, 5.0-13545 also set opacity to 50%, meaning that the base images were at 50% opacity and then Dolphin was making them even more transparent on top of that. This made the buttons harder to see by default.

In order to fix things, MayImilae went back and optimized the buttons to make them friendlier to opacity changes, with a much higher base opacity. She also adjusted the default opacity to match the how the buttons looked prior to the unintentional regression.

After the touchscreen button opacity adjustments were merged, they were rather faint by default.
These changes and the new default setting makes them much easier to see.
Now users can fully enjoy the freedom to configure the touchscreen buttons however they like!

5.0-14019 - Fifo: Run/Sync with the GPU on Command Processor Register Access by Stenzek

This particular change has been frozen in carbonite for two years. Way back then, Stenzek found a way to fix up Dolphin enough to run Star Wars Rogue Squadron III: Rebel Strike without crashing. Now you may remember that the game was working for the Dolphin 5.0 release, but shortly afterwards Dolphin's GPU timings got slightly better and the game stopped working. Stenzek's solution to fix the regression was simple and made sense. Only problem? It required more GPU syncing. He wasn't comfortable merging it because of the potential for performance regressions and sought to find another solution.

He never found one, so the change gathered dust and fell into disrepair as the emulator evolved without it. Yet, for those close to the project, there was a hole in our heart. Rogue Squadron III hadn't been able to run in development builds since 5.0-583 and that was just unacceptable.

There's always a bigger bug.

While Stenzek wasn't willing to merge such a risky change, JMC47 asked Stenzek to rebase the change so it could be tested on modern builds. There wasn't much optimism to get it merged, but Stenzek figured it was harmless and rebased it, not realizing that the shroud of the dark side had fallen.

Unfrozen into the modern era, the pull request was tested by a litany of players looking for performance regressions and accuracy benefits, and they found Rogue Squadron III to be quite operational when they arrived. Not only did it still make Rebel Strike run without crashes, it also eliminated random crashes in Xenoblade Chronicles and Star Fox Adventures during extended play sessions. It even fixed the notoriously crashy targeting computer in Star Wars Rogue Squadron II: Rogue Leader. What kind of mystical energy field was behind this change that it could fix so much?

To understand how this makes things more accurate, you also have to understand how Dolphin runs the CPU and GPU in Single Core and SyncGPU modes. Before 5.0-583, Dolphin's Single Core mode had an infinitely fast emulated GPU. It would instantly finish all of its tasks. This was nice for keeping things in sync, but caused problems when games would expect things to take time. There are some games that will push out as many frames as the GPU can, so having an infinitely fast GPU in Single Core wasn't workable. That's not saying our GPU timings are good (they're not) but they're better than nothing.

The problem is that Rebel Strike was working because of this behavior. By not letting the GPU fall behind or take time, we were working around deficiencies in Dolphin's emulation. Once the timings were in place, Rogue Squadron III no longer ran. Now the GPU took time to run and the timings changed what did what when. This made it possible for Dolphin to give stale information to the emulated CPU because it was reading the registers that the GPU hadn't yet run far enough to update. Rebel Strike would then think that the GPU had locked up. It would either then try to reset the GPU or just crash with no backup plan. Maclunkey!

To address this, Stenzek's change added an extra Sync/Run point when Command Processor registers were accessed. This means that if the CPU goes to access those registers, Dolphin stops the CPU thread and runs the GPU thread until it catches up. Once it has, Dolphin then swaps back over the CPU thread and it reads the now fresh registers and continues onward. But as users know, more syncing is slower than less syncing, hence Stenzek's hesitation.

Once this change was in the hands of testers, one of the first things they did was perform performance testing. The early numbers looked extremely discouraging, with a 7% decrease observed in very basic tests. However, when moving the testing onto real games, the performance differences weren't as bleak. For lightweight areas of games such as the main menu of F-Zero GX the performance hit had already dropped to 2%. Fears were that this performance hit would be global, but further testing revealed a more interesting outcome. Most games weren't actually affected at all, and if they were the performance differences would only be noticed in places like menus where performance was already incredibly high. Stenzek still wasn't willing to merge the change, but JMC47 absolutely was.

Stenzek: Merging this? But my lord, is that... legal?
JMC47: I WILL MAKE IT LEGAL.

Before everyone gets their blasters out, there's one major note we left out about this entire change. It does not affect Dual Core! We treat Dual Core very differently than Single Core; for Single Core we try to make things as stable as possible. Dual Core on the other hand we try to make work as well as possible without sacrificing performance. As such, this change was not implemented at all in Dual Core due to worries over the potential performance hit. We're also well aware that our Android users would rather kiss a wookie than sacrifice even a minute percentage of performance. Still, we did performance tests on Android using Single Core to see if it made a difference. Turns out the shoddy drivers become the bottleneck far before anything in this change could make a difference.

There were plans to put up a Single Core performance graph in this section, but it honestly would be pointless. Most games would be identical, with the occasional "470 FPS before" vs "465 FPS after" in a homebrew test making the rest of the graph hard to see. So instead, here's a screenshot of Star Wars: Rogue Squadron III looking badass in the latest development builds.

Textures that best most PS3 games, on the GameCube? Factor 5 merely needed to invent a form of texture streaming to make it happen.

Despite the risks, the decision ended up being all too easy. Users that use Single Core expect stability. Dual Core, on the other hand, is for performance and works well enough in a majority of the games. The real kicker is that the dreaded performance regressions that caused the two year long delay ended up being basically nothing in real situations. If only we had given ourselves to the Dark Side a little sooner.

5.0-14092 - Implement EFB Peeks for Compressed z16 Formats by phire

Another mystery from years past finally tackled. Once Star Wars Rogue Squadron III: Rebel Strike was working again, phire decided that they wanted to finally end an annoying bug. The problem was that during the various cinematic cutscenes, Dolphin was unable to emulate the occlusion tests that the game was doing to determine when to stop drawing the glow around engines.

Dolphin had no idea how to tell if the engines were blocked or visible during cutscenes.
This is especially apparent when the Xwing is fully occluded by the door.

The reason was simple, Rogue Squadron III uses the compressed z16 format during cutscenes in order to use the GameCube's 3x multi-sample anti-aliasing. Dolphin doesn't emulate this whatsoever, as it would just make things look worse with Dolphin's already powerful enhancements. However, by not emulating it, it also means that the game wasn't getting the expected values when it was reading the screen. While we still don't emulate 16-bit depth, phire improved Dolphin so that it would do the proper conversions for EFB peeks in order to fix the engine glow occlusion.

And now things finally work correctly. We'd show the other image working too but... it's just a door.

Sometimes just randomly checking out a game can lead to rather amusing realizations. JMC47 had accidentally started up F-Zero X instead of F-Zero GX while searching for performance regressions in 5.0-14019 while it was still just a pull request. Rather than closing the game and opening the correct one, he decided to roll with it. N64 Virtual Console games are interesting workloads for Dolphin and are always worth testing. His day was ruined when a massive seven second freeze halted his testing.

The delay was so long, you might think the game had crashed!

Leoetlino just happened to be present in the wrong place at the wrong time and was roped into debugging the issue. JMC47 was tasked with bisecting the issue while Leoetlino put together a flame graph of what functions were being called during the freeze.

In this CPU time (horizontal axis) graph, we want lots of tall and thin bars. The long horizontal red bar shows that std::Rb_tree_increment was doing something very bad.

JMC47's bisect came up strange, with two at fault builds both related to JIT Branching. That, combined with the heatmap made it obvious why Dolphin was freezing. F-Zero X was invalidating a ton of code during the transitions. This caused Dolphin to destroy 20,000 blocks at once. This is problematic because these blocks were tracked with std::multimap. For those that don't know, std::multimap is typically implemented with a red-black tree structure, which means that every time Dolphin had to destroy a block, it would have to traverse the tree, find the block, and destroy it. It would do this over and over again... 20,000 times.

Realizing how ridiculously inefficient this was, Leoetlino swapped Dolphin over to use std::unordered_map. Instead of a tree, this was a kind of hash table that didn't require traversing through each value to get to the next one. While it couldn't completely remove the slowdown, it was able to restore Dolphin to pre-regression performance.

While there is still a delay, it's no where near as bad.

Even outside of F-Zero X this is a slight CPU performance gain across the board as JIT blocks get destroyed when games invalidate code. This also makes JIT Cache Clears faster in the cases where games still generate so much code that it overflows.

Note: For those wondering why we used std::unordered_map/set when much faster hash table implementations are available, we actually tried Swiss tables but the performance improvement still wasn't enough to eliminate the stutters entirely.

5.0-14041 - Scissor Offset Fix for Super Mario Galaxy Roar Effects by ezio1900

The moment this pull request was originally submitted, it drew immediate sighs. Dolphin couldn't be screwing up something that simple, could it...?

Super Mario Galaxy uses a unique zoom blur effect during roars in specific scenes. It's simply a bit of extra flavor and you probably wouldn't realize something was missing if you never played on console. However, developers were acutely aware of it, and wanted to find a way to emulate this bugbear once and for all.

ezio1900 stepped up and dove into the world of Super Mario Galaxy with the mission of solving the mysterious missing roar effects. Using the power of RenderDoc, they took a close look at what the game was doing when the distortion effect was supposed to show up.

RenderDoc allows you to record and examine how things render in incredible detail.

It was long thought that the roar had to be quite the convoluted effect. After all, no one else had been able to fix it over many many years of Dolphin development. But the sad truth was that the problem was caused by a stupid assumption in Dolphin that no one noticed. For some reason, Dolphin assumed that ScissorOffset couldn't be negative and even hardcoded ScissorOffset in the Software Renderer! ezio1900's debugging in RenderDoc undoubtedly proved ScissorOffset shouldn't be hardcoded and could definitely be negative. By implementing ScissorOffset correctly in both hardware and software backends, the bosses in Super Mario Galaxy and Super Mario Galaxy 2 finally let loose their true ferocity!

What a cute piranha plant.
Oh no, a scary monster!

5.0-14120 - Fix Out of Bounds Texture Coordinate Behavior by pokechu22

And now we go from the last mainline Mario game on the Wii to the first Mario game on the GameCube. Luigi's Mansion is a well beloved launch title for the GameCube and helped bring players into the era of Next Generation graphics. It's also home to one of the oldest known graphical bugs in Dolphin: The Portrait Blur Effect.

It almost looks as though Mario was sitting behind an open window.
On console, it's much more clear he's behind something.

As all mysteries must eventually be solved, flacs dove in to see what was wrong. That part was actually easy - they found out that Luigi's Mansion was relying on an undefined behavior with setnumtexgens in order to render the effect. setnumtexgens sets the number of texture coordinates, but it's awkward to handle because the number can come up multiple times during the pipeline and can change. This awkwardness is probably the reason why developers didn't realize they were using more texture coordinates than it actually set, which doesn't have any kind of documented behavior. Without hardware testing, flacs was able to get the blur rendering, but wasn't confident that the fix was correct and left it for someone else to pick up later.

Fresh off hardware testing several other issues, pokechu22 took on the role of actually figuring out what was going on with the painting in Luigi's Mansion. The base effect is actually a simple blur achieved with indirect textures. This is not unlike how other blurs and offsets are done in games like The Legend of Zelda: The Wind Waker. The developers used a 128x128 texture generated from EFB copies that shows Mario, and another 64x64 texture that contains random grayscale noise. They also separated the texture coordinates for both of them and used texture scale to create a blur effect.

One problem: like we said above, developers forgot to set the number of tex gens to two, so only texture coordinate zero is actually valid. That's why the blur was never rendered in Dolphin. More surprisingly, the originally intended effect is never shown on console either! Players have instead been seeing a different kind of blur effect created by this undefined behavior. In all likelihood, this bug went unnoticed by developers because the effect they achieved looked pretty good and they simply didn't realize that anything was wrong. While implementing the undefined behavior in Dolphin, pokechu22 also fixed Nintendo's bug so that we could see what could have been.

Dolphin didn't emulate the effect at all, resulting in a clear painting.
Dolphin now emulates the effect correctly.
Here's how the effect looks with the game bug patched out.

With that, another mystery meets its timely end.

Bonus Undefined Behavior

Okay, so hardware testing is supposed to make things more accurate without the risk for regressions. But at this point, pokechu22 might actually be cursed. Every time they hardware verify a behavior they seem to just discover more problems. This one was simple, right? There was no way it could happen again.

Whelp.

Right after the fix for Luigi's Mansion was merged, reports came in that blur effects in Viewtiful Joe were now broken and offset. pokechu22 was pressed back into action and found another undefined behavior used in Viewtiful Joe. That isn't to say pokechu22's hardware testing and implementation of the undefined behavior was incorrect. It just had unintended consequences elsewhere with emulation.

After some investigation, pokechu22 discovered that blur effect is created by rendering an EFB copy, making it transparent, and then drawing it over the screen three times with slightly different offsets. One problem. During the draw, developers turned off the functionality for indirect textures and then proceeded to attempt to use indirect textures. This is rather problematic, as this undefined behavior seems to rely on the order in which pixels are drawn. This is not possible to emulate on Dolphin's hardware backends due to how modern rendering works. Thankfully, the main component of the blur did not use this bug, and pokechu22 was able to restore the behavior to how it was before by using an offset of zero in this case.

5.0-14122 - Update TextureCache Logic for Finding Oversized XFBs by iwubcode

This is a rather annoying concept to understand but thankfully has a very simple effect for users. 5.0-10000 may be an oddly satisfying revision number, but it was also home to a host of regressions. It was causing Dolphin's TextureCache to not recognize identical textures in some cases, making it throw out the "texture" copy and fall back to the RAM copy. This resulted in some pretty awkward issues and made some games unplayable unless you were willing to lock yourself down to Store XFB Copies to Texture and RAM and 1x Native Internal Resolution.

Skies of Arcadia the way it was never meant to look...

5.0-10000 updated the code so that now the XFB's stride was passed down from VideoInterface (VI). Unfortunately, this added an assumption that XFB copies would be contiguous, which isn't true for oversized XFB copies. Because they don't have a stride that matches block width * bytes per block, they take up multiple rows of memory. This means they can't be hashed in a contiguous chunk. iwubcode quickly fixed the assumption so that Dolphin could check hashes on non-contiguous XFB copies.

With the bug, you were locked to 1x Internal Resolution or this.
Now things are back to normal.

5.0-14257 - Bounding Box: Account for Pixel Quads by Techjar

This change can also be called 5.0-14257 - Developers Learn Not to F!#* with Bounding Box Part 1 with the other parts coming up next. It started out so innocently, with a simple issue report claiming that Ultimate Spider-Man was crashing randomly during a certain stage. Usually reports like this in less common games are simply a setting issue or Dual Core being Dual Core. So, JMC47 leapt into action assuming this would be a fairly conventional bug. While he had Ultimate Spider-Man, he hadn't really played it outside of making sure it ran in Dolphin. With his patented setting to fix 99% of the crashes in Dolphin (Dual Core disabled) he ran the game using the savefile the user provided expecting nothing bad to happen.

It crashed.

This segment of the game loved to freeze or crash the entire emulator.

A game crashing on Single Core is like the Bat Signal to developers that something is actually wrong. But JMC47 wasn't convinced that this was a legitimate bug yet. He tried using interpreter, software renderer, and even wondered if the savefile itself had put the game into an illegal state. The only way to confirm that was to transfer the savefile to his console and test it from there. It wouldn't be the first time something like that has happened: Tales of Symphonia has a game crashing bug with the Pink Pearl Ring Quest if you do things in an order the game doesn't expect.

The results on console weren't quite as interesting. The game didn't crash on repeated playthroughs, even as he tried to replicate the crash conditions perfectly. In fact, without having to deal with the rampant crashing, JMC47 started playing deeper into the level, learning the game mechanics, and got up to the mission boss! After confirming the game definitely doesn't crash on console, he returned to Dolphin and tried again there. The game didn't crash. He was able to play through the entire level multiple times and die on the boss. He guessed that some setting he had changed must have fixed it and he didn't notice before moving over to test on console. However, even returning everything to their original settings didn't affect the situation.

Pictured: JMC47 reporting to developers that a wild bug was on the loose.

With no crash, it became harder to figure out what was going on. So, he kept playing the game until he got to the mission boss in Dolphin and even defeated the boss this time around. Thinking that maybe he would never be able to solve this bug, he set down the controller to let the cutscene play out... and the game crashed. Relief and anger hit all at once, but at least there was a guaranteed crash that could be investigated. But what had caused the earlier crashes to disappear?

It turns out it wasn't a setting or anything in Dolphin, it was the player. By getting better at the game, JMC47 spent less time in each zone of the city. He had inadverently been playing well enough that the crash didn't have a chance to manifest. By pure happenstance, he left the game running at the start of the mission and it eventually crashed on its own when he turned the camera. Considering that was much easier than playing through the level each time, that became the new test case for the crash. A bisect was in order.

About six hours after the original bug report, he had found a culprit. 5.0-9892, a Bounding Box change. It has been assumed for many years that Ultimate Spider-Man has been using Bounding Box in order to achieve the stylish multiview comicbook styled cutscenes that it features prominently throughout the game. Bounding Box was forced to On for this title, so Techjar was brought in to look at the issue and reluctantly dug out old hardware tests to check. In a simple test, he found out that Dolphin was off by one when calculating Bounding Box bounds.


Console:  240, 293, 134, 165
Dolphin:  241, 293, 135, 164


No one could quite figure out why Dolphin's numbers were incorrect until Extrems pointed out that Bounding Box is not calculated by pixels, but pixel quads. Dolphin's numbers were too precise. Once Techjar was aware of this quirk, he adjusted Dolphin to take into account pixel quads and the numbers started to behave correctly. From there, all that was left to do was verify things in Ultimate Spider-Man.

Normally our story would end here... but...

5.0-14316 - Bounding Box: Account for Pixel Quads... Better by Techjar

The world of Ultimate Spider-Man was at peace. But a new horror reemerged from the depths of the fixed bug reports to ravage an innocent user once more. Not even days after the Ultimate Spider-Man fix was merged, Paper Mario: The Thousand-Year Door began having save corruption issues in Chapter 5.

This is a very scary issue and one we had to take seriously. In the traincar, there's a reflection effect that, if emulated incorrectly, can corrupt game data leaving you unable to save properly. The last time we ran into this bug was with fractional Internal Resolutions. The fact that this bug was so damaging and dangerous led us to eventually remove the feature. We couldn't risk an enhancement causing people to lose their saves.

These reflections are pretty cool, but don't you dare emulate them wrong.

This user had their savedata corrupted, but thankfully if you have a hex editor, you can fix the GameID and repair the file. We did that, hoping that it was unrelated to the recent adjustment to Bounding Box. Unfortunately, they went back through the area and it corrupted again. Eventually, a bisect confirmed out fears: 5.0-14257 - Bounding Box: Account for Pixel Quads was at fault. But that was hardware verified to be the correct behavior... what exactly were we doing wrong?

It turns out that it was correct... almost all of the time. The Pixel Quad behavior is true when the game is reading back Bounding Box values, but it isn't correct to round the default values. Techjar was rounding the values directly, meaning that even the default values were getting rounded, which was enough to cause the game to go out of bounds and create a chain reaction that puts the game into a state where it can no longer save properly. In order to fix this, Techjar moved the rounding into the shader code so that only the result is rounded, rather than the actual values. This time around, we carefully tested every trouble spot to make sure there wouldn't be another regression before finally merging it.

At this point, Dolphin's Hardware Bounding Box emulation is tested to work in every retail Bounding Box game and the latest round of fixes should make Dolphin more resistant to small imperfections.

5.0-14326 - Bounding Box: Add Fallback For When Bounding Box is Unsupported or Disabled by Techjar

After investigating Paper Mario: The Thousand-Year Door, Techjar discovered that at the start of every draw call, a SDK related function appeared to be writing in default values to the Bounding Box registers. These values made a lot more sense to use than simply zeroing everything out when Bounding Box was disabled, so he hard coded the registers to match these new values. Surprisingly, it made it so that Ultimate Spider-Man would run without crashing even if Bounding Box wasn't supported or was disabled.

Realizing the value of this, he cleaned up the implementation to use the default values as they were provided instead of hardcoding them. As a side-effect, this allowed testers to run through Ultimate Spider-Man without Bounding Box to see what kind of visual differences would crop up! Unfortunately, they didn't really find anything. The stylized cutscenes looked the same, all of the major effects looked to be working. The only thing that made them sure that Bounding Box was even being used was that performance was a bit higher when Bounding Box was disabled. Everything pointed to that maybe the game was writing to the Bounding Box registers without actually using them.

Then, JMC47 noticed some flickering in the distance as he mashed Bounding Box on and off in the level that was crashing the game.

Without Bounding Box, something clearly seems to be missing...
It's missing almost the whole footprint!
In motion, we can clearly see it is more than just a texture.

It turns out that the game was using Bounding Box to do some kind of occlusion effect! It paints a 3D decals onto floors or walls, and then uses Bounding Box to remove the original geometry that would be covering it up. With this, they are able to dynamically punch craters into the ground or walls wherever they want! Seeing this cleared up literally all the mysteries about why the game was crashing and lined up with all of the listed test cases on the Dolphin Wiki.

Do note that games that rely on Bounding Box calculations may still hang even with this improved fallback. Also please don't expect this to make Paper Mario games work without Bounding Box; they rely on Bounding Box for game logic so they will hang during certain scenes with it disabled.

5.0-14311 - D3D11/OpenGL - Cache Bounding Box Reads Between Registers by Stenzek

While making Bounding Box more accurate is nice, what if we could make it much faster. That's what Stenzek did when he implemented a missing optimization in the OpenGL and D3D11 backends. The way Bounding Box works is that uses its four registers to draw a rectangle over the screen. Previously, Dolphin was reading one, stalling the GPU to sync, reading the next one, stalling the GPU... etc. Having one point of a rectangle isn't very useful, so it's very unlikely a game would read only one value and then immediately use it.

Noticing this long ago, Stenzek added a caching mechanism into the D3D12 and Vulkan backends that makes Dolphin do multiple reads at once as long as the GPU doesn't do any draw calls between them. This way, if the game is reading all four Bounding Box registers without drawing, Dolphin only stalls the GPU once instead of four times. By now implementing this change in OpenGL and D3D11, these backends see up to a massive 30% performance boost in Bounding Box limited scenarios.


Note that Ultimate Spider-Man is also a Full MMU title, so the performance gains are limited by other bottlenecks. These improvements are most noteworthy during Store EFB Copies to Texture and RAM effects in these titles and do not affect other areas as much. This is not really an issue as the effects this change does optimize were among the most demanding moments in the game.

You also might have noticed the missing OpenGL data with AMD graphics cards. Well, that's intentional. Stenzek investigated a long reported crash with Bounding Box operations on OpenGL with AMD GPUs and determined it was a driver bug and came up with a workaround. Speaking of that...

5.0-14330 - OpenGL: Work-Around Windows AMD Driver Bounding Box Issue by Pokechu22 and Techjar

This has been an annoying thorn in our side for quite some time. AMD's Windows drivers would crash in Bounding Box titles. All the attention on Bounding Box led to some investigation on the issue from Stenzek and a WIP implementation of a proposed fix from Techjar. That fix was then iterated upon by Pokechu22 into what became the finalized version of the work-around.

We've actually known about this bug for quite some time. For some unknown reason AMD drivers on Windows only write to the first field of a Shader Storage Buffer Object (SSBO) binding in a shader when using atomics. Our Bounding Box implementation uses four field SSBOs, one for each point of the bounding box rectangle, so this bug made it so only the first Bounding Box coordinate was updated. With three out of four coordinates as garbage, Bounding Box operations would crash.

This was easy enough to hack around, but due to AMD's OpenGL performance being rather slow on Windows, we were reluctant to add a hack just for them. After all, most AMD users would be using D3D11, D3D12, or Vulkan for superior performance, right? Well, it turned out that a lot of AMD users do stick with our default backend, OpenGL, and we've seen a lot of reports about broken Bounding Box. In order to make things simpler for our users and reduce the strain on our support staff, we decided that a small hack would be more than worth the maintenance cost.

Pokechu22 made it so that Dolphin would use an int4 instead of an array, which bypasses the issue. Unfortunately, int4 with atomics is not supported under Metal/MoltenVK, so both methods have to be maintained. Even if the fix is a little messy, it'd make things easier on support staff and simpler on users instead of having to be told to avoid a backend. With all of this, we could test some of the issues we were having with OpenGL on a non-NVIDIA card.

5.0-14318 - OpenGL - Force Memory Barrier When Reading Back Bounding Box Values by Stenzek

And now it all comes together with one final Bounding Box change. Remember that crash in Paper Mario: The Thousand-Year Door? Well, because we weren't able to access that area while testing Bounding Box Caching, we didn't realize that Stenzek's optimization broke OpenGL Bounding Box in the room right after the crash.

OpenGL wasn't working quite right.

Except maybe not! Though it wasn't merged yet, Pokechu22 used the workaround above to confirm that Bounding Box on OpenGL was working fine... on AMD. It turns out that only NVIDIA was affected by this regression, which neither Stenzek nor Pokechu22 were using at the time. Techjar started grabbing the Bounding Box values off of his NVIDIA GPU while Stenzek switched over to test on a NVIDIA graphics card.

After investigating, it turns out that on NVIDIA, the OpenGL code path Dolphin is using to write Bounding Box values to memory buffers is not coherent. This is an extremely complex topic, but to put simply, coherency means that when two different pieces of hardware read a memory region associated with a bit of data, both devices will read the same value regardless of where the data actually came from. In the case of Bounding Box, the CPU and GPU need to share and edit the Bounding Box values across system memory and vram, and they both need to see the same values for it to work correctly. However, because on NVIDIA Dolphin is writing Bounding Box values in an incoherent manner, the CPU could be using old data. There's our bug. Why didn't these issues appear before the optimization? ┐(´-`)┌

AMD cards are using a different code path so they aren't affected by this regression.

We could rebuild this OpenGL code path to be coherent, but it would have some potential performance implications. Fortunately, OpenGL already has a solution to this. According to OpenGL spec, a memory barrier can be used to make this codepath coherent. So Stenzek did exactly that.

The memory barrier allowed us to keep the performance boost while fixing the bug.

After tampering with Bounding Box emulation for far too long, everything was finally working again with higher performance and compatibility to boot! Now there won't be any more regressions. ...Ah damn.

5.0-14339 - Bounding Box: Please just work now by Techjar and Stenzek

The date is June 5th, 2021. The Progress Report is being buttoned up and readied for launch. Only one problem: Paper Mario: The Thousand-Year Door and Super Paper Mario are still both broken despite Bounding Box emulation having become more accurate than ever. In fact, Dolphin's software renderer now works in 100% of Bounding Box games without issues. Yet, D3D11, D3D12, Vulkan, and OpenGL all fail at critical junctures to the point where both games are now unplayable. The culprit? The hardware verified Pixel Quad behavior.

Both of the broken games had the same symptom. Paper Mario: The Thousand-Year Door and Super Paper Mario do a special effect at points where they will uncover parts of the screen through Bounding Box tests. The games are very particular about the Bounding Box values when testing this, and getting them wrong can result in the game hanging because it will be waiting for the effect to finish... forever.

In Software the effect appears and the game can continue.
Oh come on.

So, what's going on? We were able to narrow down the issue relatively fast since the Software Renderer wasn't affected. Bounding Box Register 1 was rounding differently in some cases.


Software Before: 608
Software After:  607
Hardware Before: 608
Hardware After:  609


The true value was actually 607.5, but Bounding Box uses integers so it must be rounded to an integer. Originally it rounded to 608, which is technically off by one but it was close enough. But now it is rounding to 609 because of the way Bounding Box is calculated in pixel quads. There's one tiny issue with this: Coordinate 609 is outside of the render area, causing the game to fall into an undefined state and freeze. The real kicker was that the Software Renderer handled things essentially the same, so no one had any clue why there were rounding differences at all. Thankfully, Stenzek came in with the information we needed to push us in the right direction.


When used in a shader, SV_Position describes the pixel location. Available in all shaders to get the pixel center with a 0.5 offset.


Essentially, we were rounding the pixel center in a case where the pixel center was already 0.5, throwing off all rounding thereafter. Even though the previous changes increased accuracy in other areas, because of a flaw somewhere else these improvements broke everything. Techjar removed the extra rounding and the games started working again. And now we know to never touch Bounding Box ever aga-

Wait what was that. Zoom, enhance!
@!#?@!

Well, we aren't done with Bounding Box quite yet. Thankfully the issue above appears to be limited to Vulkan and OpenGL, and possibly a result of driver bugs. At this point, all of the crashes and hangs that we know of are fixed, including some in Disney's Magical Mirror (pictured above) and Disney's Hide & Sneak. It's the lesser of two evils to deal with some visual bugs while we do more investigation into Bounding Box to try to get things perfect. But that's all the Bounding Box coverage we're going to have in this Progress Report. Seriously, if something else breaks between now and publishing it's going into the next Report.

5.0-14304 - Windows: Implement Mouse Lock by Filoppi

Mouse lock has been one of the most requested features for Dolphin. Users often use their mouse in order to control the pointer in Wii games, which means that if you're not running Dolphin in full screen or have multiple monitors, it's very easy for you to lose track of your mouse and click off the window. All the while, the game keeps on playing while you're not in control!

Filoppi has been doing a lot of work to Dolphin's input backend in order to clean things up and bring in new, exciting features. Mouse lock is one of them. The one thing to remember is that this is an implementation for Windows. Because mouse locking has to be extremely precise with its timings as to not let the cursor leak out, each operating system will need its own implementation. For now, the option can only be enabled on the Windows version of Dolphin.

A long awaited feature finally arrives.

Filoppi also implemented a dedicated hotkey to unlock the mouse cursor at will, but using window swapping hotkeys like ALT-TAB will work as well.

5.0-14272 - Serial Interface: Fix NOREP and COMERR by endrift and Bonta

Sometimes in the darkest recesses of Dolphin, you find some of the most troubling behaviors. endrift was investigating connectivity issues between mGBA and Dolphin, and stumbled upon a rather odd behavior. On console, when there is a timeout over Serial Interface (SI) the SI hardware will send NOREP (no reply) to the SDK's error handler. The SDK error handler will then write NO_RESPONSE into the SI Buffer, which the game will read and then cancel the connection. Dolphin however was bypassing this error handling entirely, so when the SI Hardware was supposed to send NOREP, Dolphin simply wrote NO_RESPONSE into the SI buffer instead. This is fine, as long as the game doesn't pay attention to what's going on. And apparently Four Swords+ does.

The game would freeze here, thankfully after saving.

Now Dolphin properly follows all the steps, sending NOREP to the error handler instead of bypassing it. With that, Four Swords+ works correctly.

Bonta further hardware tested another issue with the response communication error (comerr) where Dolphin was also misbehaving. Because both of these changes were touching similar code, they were rolled into one pull request. With both of these issues fixed, Final Fantasy Crystal Chronicles and Four Swords+ Navi Trackers Mode no longer hang in cases where they are trying to reset the GBA emulator.

5.0-14306 - GameINI: Patch Crystal Chronicles Race Condition with GBA by Bonta

One of the most annoying parts of playing Final Fantasy Crystal Chronicles is the incessant waiting for connection it does between every single map change. This happened on real hardware as well, but not to the same mindnumbingly long degree. Usually it'd be two - three seconds of waiting on a GameCube, but up to two minutes in Dolphin.

This could go on forever...

Dolphin and mGBA both seemed to be doing everything reasonably. So in order to try to solve this issue, Bonta wrote a litany of hardware tests to try and figure out the exact timing. What they discovered is that the waiting was seemingly random, usually two to three seconds but sometimes up to ten seconds of the game struggling to connect over and over. Things were finally starting to make sense. There wasn't anything wrong with the emulators, but the game itself.

Looking deeper, Bonta found a rather nasty race condition. During the handshake process while the games connect, Final Fantasy Crystal Chronicles could try to send data to the GBA. That data would be pushed into a queue, and when the connection was established, the transfers wouldn't resume and the game would time out. After ten frames, it would then try the process all over again. It would continue to do this until the timings worked out that it did not send data during the connection process.

This happens on real hardware too, it's just that Dolphin was much more likely to lose the race repeatedly due to timing differences. In order to fix this game bug, Bonta went to the unprecedented step to patch the GBA ROM within Final Fantasy Crystal Chronicles that is sent to the attached GBA. Even funnier? This patch can be used to reduce waiting times on a real console. Because this race condition is extremely annoying and impacts the playability fo the main play mode of the game, it has been enabled by default. Connecting Final Fantasy Crystal Chronicles to any GBA emulator is now painless.


Bonus Development Diary - The Great Mystery of Pokemon Box

5.0-14002 - MMU: Fix SDR Updates Being Silently Dropped in Some Cases by leoetlino

With all of the attention toward GBA <-> GCN communication in recent months, users have gone through many older issues to see if they've been fixed. However, one particular issue still remained a mystery. Within Pokemon Box is an Adventure Mode that lets you play your copy of Pokemon Ruby and Sapphire on your GameCube without a Game Boy Player. How does it do it? Well, emulation. It's also home to a rather obscure Dolphin bug.

Rather than copying a full GBA game from the Game Boy Advance over the GameCube – Game Boy Advance link cable, which would take several minutes, Pokemon Box actually comes preloaded with a GBA emulator and every version of Ruby and Sapphire for a particular region and loads it directly from the disc. When you save in the emulated copy, it updates the savefile on your cartridge. What's special is that this emulator can generate legitimate pokemon with certain IV patterns not possible on a real GBA. These Pokemon are valid and can be traded all the way up to Pokemon Sword and Shield, so being able to run Adventure Mode is of real value to Pokemaniacs.

The problem is accessibility. This feature does not work in Nintendont, Freeloader, or any other method of making your Wii region free. However, it did work in Dolphin ever since GBA <-> GCN support was added with VBA-M. This comes with a caveat - it only worked for the NTSC version of the game. The PAL and JP versions of the game would crash... if you had Full MMU enabled. More curiously, if you had Full MMU disabled, it would load the GBA emulator with no game. This was notable because it's the same failure state as the backup loaders. While this issue has been known for some time, copies of Pokemon Box routinely go for over $1,500 on eBay, so we couldn't look into it due to the extreme cost of getting multiple versions of this exceedingly rare game. But thankfully, an enthusiast with both the PAL and JP versions of the game showed up to provide testing and debug data.

The NTSC version of the game works fine...
However, JP and PAL crash or load no GBA ROM depending on your MMU setting.

On top of the cooperation from Pokemon Box enthusiasts, this wouldn't have been possible without efforts from endrift and Bonta for fixing up GBA <-> GCN to the point where we could consistently debug the issue. After years of waiting, work could finally begin.

Over the course of several days, developers broke down exactly what the game was doing by comparing the various versions. Our initial thought was that it had to do with shoddy GBA <-> GCN connectivity, but the connection procedure worked fine on all versions of the game!

Connecting GBAs for trading worked, leaving adventure mode as the only broken feature.

With at least part of the game working, we thought that maybe this was something simple. That thought quickly went out the window when we checked what Pokemon Box JP was doing. The internal GBA emulator was attempting to load a GBA rom from 0x90000000. Now, that may sound wrong to you if you've been around this blog before. The GameCube's base memory is usually mapped at 0x80000000, and there is 24MB of memory. This means a game typically will have access to 0x8000000 to 0x81FFFFFF. How exactly would it find anything in 0x90000000? Well, on the JP and PAL versions of the game, it wasn't. So, we checked on the NTSC version of Pokemon Box and...

The ROM is there on the NTSC version of the game!
However, JP and PAL have nothing.

The first shoe had dropped. All three versions of the game were doing the exact same thing. The only difference was that it actually worked in the NTSC version. We wondered if they had maybe fixed a bug between releases, but that didn't exactly seem feasible since PAL was released last and NTSC was the middle release. What could be going wrong?

Looking into the process of how the game was mapping the ROM to 0x90000000 was the key. What we observed on the NTSC version of the game was that it would load the ROM into main memory while faking the connection animation with the GBA. Then, after the correct ROM is loaded, it would create a page table that mapped 8 or 16MB of memory at 0x90000000 to the beginning of the ROM depending on the version of the game. Once this is done, the game will say it successfully connected to the GBA and it's ready to go. If everything worked as expected, their GBA emulator will initialize and load the ROM from 0x90000000. However, on the JP and PAL versions of the game, it never mapped the memory! We thought that maybe the fact that the JP GBA roms were 8MB was a potential reason for the difference in behavior, but it turned out the ROMs on the PAL version were 16MB, just like the NTSC version. Dolphin seemingly forgetting to map a page table seemed impossible, but that was the reality we were facing with no explanation why.

booto and leoetlino came together in order to finally break this issue open. booto is one of the foremost experts on the GameCube Memory Management Unit and was a part of Dolphin's original implementation of the MMU. leoetlino is an expert with IOS and reverse engineering games, and was brought in originally because we didn't know why Wii backup loaders couldn't boot the game. He quickly re-discovered the 0x90000000 behavior and theorized that backup loaders were likely broken because they were running the game in Wii mode with MEM2 mapped at the same address, but continued analyzing the game regardless.

Eventually he found the code that mapped the 0x90000000 region in the NTSC and JP versions of the game. The initialization code was almost exactly the same in both versions of the game, so it seemed likely to be an issue with Dolphin's MMU emulation. With booto helping to explain some of the more treacherous parts of the MMU, they eventually stumbled upon a rather foreboding line in the PowerPC Microprocessor Family: The Programming Environments manual (also known as 6xx_pem.pdf).


The HTABORG field must have the same number of lower-order bits equal to 0 as the HTABMASK field has lower-order bits equal to 1.


To understand what this means, what you need to know is that page tables are configured with a 32-bit register called SDR1 (Storage Description Register 1) which contains two fields: HTABORG (Real address of Page Table Origin) and HTABMASK (Encoded size of Page Table). The former is the upper 16 bits of the base address of the page table and the latter determines the table size. The reason for this requirement is that it's simply more efficient based on how the actual hardware works.

On the JP version of Pokemon Box the page table is found at 0xbf0000 and the mask is set to 0x1. Since HTABMASK has exactly one trailing one and HTABORG = 0xbf = 0b10111111 has no trailing zeros, there's a mismatch. The game was creating a misaligned page table... and Dolphin was ignoring it entirely.

if (htaborg & htabmask)
  return;
PowerPC::ppcState.pagetable_base = htaborg << 16;
PowerPC::ppcState.pagetable_hashmask = ((htabmask << 10) | 0x3ff);
Dolphin dropped invalid SDR1 updates silently.

According to the manual, Dolphin was following the rules correctly. The game was clearly violating the alignment requirement. But Pokemon Box worked on console, so does that mean the manual was incorrect?

Looking at the actual processor diagram itself told us that what the game was doing would actually work despite the "you must do this" wording. Because real hardware calculates page table entry addresses by doing a bitwise OR rather than an addition, the second half of the page table is aliased to the first half and things would still work even though half of the table is essentially wasted.

A diagram in 6xx_pem.pdf showing how Page Table Entries (PTE) addresses are generated.

This also matched what we were seeing in the game's very own page table initialization code, which also performs an OR:

The PTE being generated in code.

However, Dolphin was throwing out the whole thing, meaning the ROM was never mapped and the emulator couldn't load it. We removed the excessively strict check to match what was actually happening on console and...

It works!

"But wait, couldn't the hardware be masking HTABORG to fix the alignment?" That is actually a very reasonable question considering this is how some DSP and VI registers behave.

In this case, however, we are certain that there is no implicit masking. If HTABORG were masked, then the JP version would be trying to write page table entries not to 0xbf0000, but to 0xbe0000. That would cause the 0x10000 byte region preceding the page table buffer to be completely clobbered. Theoretically, the game could have just managed to survive a buffer overflow; that wouldn't be the first time a game corrupted its own memory and still ran properly on console by a stroke of luck.

In practice, when we changed Dolphin to mask HTABORG, Pokemon Box instantly crashed because some important file structures had been overwritten. Since the game works on console, we know that real hardware cannot possibly be masking the HTABORG value.

That left one last mystery: why was the NTSC version of the game working? The same faulty code existed there, too! Well, everything just so happened to line up correctly. The page table buffer is heap allocated and it happens to be located at 0xbc0000 on the US version. While HTABMASK is always set to 1 in all versions of the game, HTABORG is equal to 0xbc (an even number) in the US build so the alignment check would pass and the ROM would be mapped correctly.

With everything finally understood about Pokemon Box, the fix was merged and now all three versions of the game work in Adventure Mode.

Last Month's Contributors...

Special thanks to all of the contributors that incremented Dolphin from 5.0-13956 through to 5.0-14344!

You can continue the discussion in the forum thread of this article.

Next entry

Previous entry

Similar entries