How to find an elusive crash?

Recently my team has been exploring our options for doing a 64bit build of our project, now that Unity 4.5 is available with the fix for the crash involving ray casts hitting colliders.

After finally getting most (all?) of our thirdparty libraries/plugins/etc updated or removed so we can successfully build/run our project, we started testing to see if there are any non-DLL based problems with the project before merging our 64bit migration changes into our master branch.

The one thing we have left, and that has been eluding us for days now, is this mysterious crash:

The crash we get manifests as an instantaneous stop of gameplay, with a popup stating that the game has crashed, and that crash information has been put into a folder named with the current date/time.

The crash happens inconsistently (maybe 1/8 times?), so is sometimes difficult to reproduce, however when it DOES occur, it’s always within a few frames of transitioning from having only a menu (and 3D background) to having the whole game world rendering (note this is not a transition that involves loading a new scene, it’s just changing the existing one). Also note that this crash only occurs in 64bit (or at least, we haven’t been able to make it happen in 32bit yet), so we always have to do a build, as it won’t occur in the editor. It seems that this crash occurs more frequently when we select a higher quality setting from the dropdown in the launch screen, but because of the randomness of the crashes, we could be misinterpreting this. It doesn’t seem to matter if we’re building a “development build”, or enabling “script debugging”, so for our tests, we include both options.

In the cases where this crash isn’t triggered in the first few frames of the world rendering, it seems that we can fully play the game, with all features, for as long as we would normally expect.

Looking at the output_log.txt that’s generated by Unity after a crash doesn’t seem to provide us with any useful information:

  • All previous messages in the log file, as far as we can tell, are to be expected with normal operation of the game.

  • The stack trace shown at the end of the log file is different pretty much every time – There’s a few areas of code (not specific lines- more like “modules”, or sometimes certain thirdparty libraries) that seem to show up frequently, and in these cases, it seems that the thing they have in common is that they deal with serialization- sometimes for the network, and sometimes for the HDD. Being that (especially for the case of the HDD) serialization can be slow, we’re somewhat inclined to chalk up this frequency to the fact that more ms are spent in these modules than the others that also show up in this stack trace at a much lower frequency, however we’re not denying the possibility that there may be a problem related to serialization and/or these modules. However something to note is that this stack trace sometimes shows a full stack that is completely outside of our own (and thirdparty) code, like this:

    (0x0000000004CC844F) (Mono JIT code): (filename not available): (wrapper runtime-invoke) object:runtime_invoke_void__this__ (object,intptr,intptr,intptr) + 0xdf (0000000004CC8370 0000000004CC855C) [0000000003D26D48 - Unity Root Domain] + 0x0
    (0x000007FEECCD3662) (mono): (filename not available): mono_set_defaults + 0x2b8e
    (0x00000000FFFFFFFF) ((module-name not available)): (filename not available): (function-name not available) + 0x0
    (0x0000000003D26D48) ((module-name not available)): (filename not available): (function-name not available) + 0x0

Normally we use Visual Studio and UnityVS to code/debug, however it seems that combination is spotty at best for attaching to a running Unity standalone process, so we’ve tried using MonoDevelop instead and attaching it to the process. Just for a sanity check, we experimented with breakpoints and exceptions just to make sure MonoDevelop was properly detecting/breaking when they’re hit, and it does. However, when we manage to trigger this crash, MonoDevelop doesn’t detect anything, and just detaches from the process.

The seemingly randomness of the stack traces are leading us to believe that this may be some sort of race condition to do with multithreading, however in those cases we’d normally expect that the stack trace would show where the offending thread is sitting when it causes the crash, so we’re not fully confident in that assessment.

One thought we had was that maybe Unity 4.5 had introduced some problem, so we tried doing a 32bit build with it, but it doesn’t seem to ever exhibit this crash, so we’re relatively certain it’s tied to 64bit. Unfortunately, since we’re relying on Unity 4.5 for the 64bit fix to the crash involving a ray cast and a collider, we can’t try a previous version of Unity to build in 64bit and test with that. (At least, not easily- we have a huge project, and it will probably take multiple days, at best, to find/remove any code so we don’t encounter the raycast/collider crash)

We’ve taken to stripping out (one at a time) various modules of our project, rebuilding, and then constantly launching the game to the point where it might invoke the crash, repeating until we either get a crash, or feel we’ve spent too much time trying (and thus, are relatively confident there is no crash) so we can try and determine where the crash is coming from. Needless to say, this is taking a long time, and this method doesn’t allow us to definitively know that the crash is GONE, but only if the crash is STILL THERE, which makes these tests less than ideal.

I think there are other things we’ve tried, but I can’t think of them off the top of my head – I’ll add more as I think of them.

Does anybody have any other thoughts on how we might try going about fixing this problem?

Yes, but you wont like them.

When faced with mono misbehavior I’ve basically and to binary chop my way through my app eliminating scripts until I found the one that caused it.

Somewhere out there I once saw a script line that would halt compilation. Basically its just forced an error on compile to stop the compile at that point.

This isn’t really an answer to the question I posed, but it’s the answer to the crash I was getting, and it’s so obscure/impossible to detect that I’m hoping my marking this as an answer will save somebody else the 2 weeks it took me to track it down.

Ultimately this all came down to substances. The solution to our crash was just to delete all of our individual substances and databases. We had been planning to remove them all anyways, so we just blindly removed them all from our project, but what we’re thinking the actual problem comes from (unconfirmed!) is that earlier in the project we had been using the “old format” of substances, (from before they were integrated directly into Unity) and we suspect some of them were still sticking around and not being friendly with our 64bit build.