Jump to content

Math cos/sin.


OvermindDL1

Recommended Posts

On the MCF board of http://www.minecraftforum.net/topic/544429-110100fishtacos-mods-better-caves-fps-height-mod/ there is a mod called fps++, he 'claims' that:

This mod switches notch's sin and cos methods in MathHelper with Math's sin and cos, it seems to increase FPS by 33%, and doesn't decrease the quality of the game at all.

 

PLEASE NOTE: This will only increase FPS if CPU is what's limiting minecraft. It will also increase performance when there are many entities in the world.

 

He has two versions, one that does the above, and another that does:

This version reinstates the lookup table, but downsizes it to make it stay in the l2 cache more consistently. In theory, this makes sin and cos less precise, but in practice, everything feels right, leave a post about your experience with it. If it's acting up, you can install the other one right over it.

 

Does anyone have experience with this?  Is it true for CPU bound code (33% seems *way* excessive)?  I do not in any way have a low end CPU so I have not been cpu constrained, but does anyone have experience with this mod?  Would it be worth having something in Forge use the asm library or so to change those calls on class loading if a mod is able to do that?

Link to comment
Share on other sites

Owh, you shouldn't post these questions. They make me curious, and make me look into the shit on a very deep level .. not trusting how it appears in FPS and stuff, no, benchmarking these functions and see what they do in Java Bytecode. The results? Both surprising (to me) and funny.

 

 

DISCLAIMER: I am a C/C++ programmer who recently took the interest in Minecraft modding; I do know Java for years, but never did much in it. I might have made stupid mistakes in my benchmarks or related code. That said, I doubt I did ;)

 

 

First off, I am not the kind of person who believes in claims like '33% more FPS'. Those numbers are used by PR people, and are more than often complete bullshit. Who's FPS? Under what conditions? And how much will I notice of this? So .. I hope I will give you some results of a more precise order, and I hope it helps you a bit.

 

Let's start off by cutting the bullshit away. The part about the l2 cache, I can be very short in that: nobody can ever prove that claim, and it is just that: a random unfounded claim. On modern desktops there are various of cores which might or might not have shared caches. You run various of applications, which might or might not be doing data intense operations. You, as programmer of an application, have no control what so ever of what goes into the cache and what not. This is a good thing. There are things at work that are much smarter than a programmer can be, in regards to caching or not. So if a solution like making an array shorter improves performance, in a consistent matter, over multiple machines, it is very unlikely you can name L2 cache as reason. And if it would, it would only work on a very small subset of all machines running Minecraft. But of course he does notice a performance gain. So what is really going on?

 

 

What I first tried, was to benchmark the Mathhelper.sin and Math.sin. ( added http://pastebin.com/MeskiB1C in MathHelper.java ; mind you, I changed 'float' to 'double' in MathHelper.sin; this was purely for benchmarking purposes (to compare Math.sin and MathHelper.sin fairly)) The performance difference? 10 to 1. Math.sin is 10 times faster than MathHelper.sin. That is a huge difference. So why did Notch make such caching system, if it doesn't help performance? I am sure he would have benchmarked it. That immediately hints to me there is more at play then the results of my machine and my JRE. So let's dig on.

 

 

These days, all x86 FPUs have hardware support for sin/cos (fsin/fcos, added in 80387, 1987). One thing you can be sure of: it is a damn fast operation. From my understanding only, Sun's JRE 1.5 didn't really use this instruction. Information is a bit fuzzy here, but from what I understand it is not accurate enough for Java's standard outside the range of -1/4 PI - 1/4 PI.

 

In Sun's JRE 1.6 this got solved. As you might understand now, on certain systems with JRE 1.5 you get a piss poor performance of your Math.sin. Caching these values is very wise and give you a very good performance boost. In JRE 1.6? Not so much. In fact, caching values makes the performance suck balls.

 

 

This made me wonder, as I come from a C background: how can a table cache be slower than a FPU call? FPU calls are in general considered expensive, so .. what is going on? The answer is right in your face: table cache and more specifically: array bounds checking. In Java it is very hard to request an element from an array without the system checking if the value is out of bounds. Even if you know 100% sure it never will, it checks 'just in case'. This is one of the principals of Java: never allow buffer overflows. It is great. But not for performance. In result, the FPU call is in many cases much quicker than a table call. Which to me is silly, as in C it would have been the other way around.

 

So .. can you disable array bounds checking in Java? And would that make MathHelper.sin faster? It seems that it is not possible to disable the checking in any clean fashion. You can, however, access memory directly in Java via 'Unsafe' stuff. It is very hard to get this to work, and very ugly, but possible nevertheless. I did some benchmarks on this, cheating along the way to get it to work, as I wanted to know what the difference would be between Math.sin and a cache like MathHelper.sin if it wouldn't do the array bounds checking. The results:

 

Current MathHelper.sin: 500 'arbitrary units'.

Current Math.sin: 44 'arbitrary units'.

MathHelper.sin without bounds checking: 40 'arbitrary units'.

 

It is a tiny bit faster (10%, call it tiny), but it is not worth the hassle and ugly code that comes with it. Using Math.sin should be more than sufficient, and as a bonus increases the accuracy too ;)

 

 

Which brings me to the conclusion. Why did Notch implement a cache? I can think of 3 reasons: he was either using Sun's JRE 1.5 and noticed a huge performance loss there, or he was reading articles from 1985 (where it was common to cache your sin/cos for performance gain), or he was thinking in a C-like fashion. How ever it might be, all benchmarks (and all threads about it on the web) clearly indicate that these days it is better to use Math.sin over any custom made version. Something often called: premature optimizations. Trust your compiler to do the right thing, in many cases it will.

 

 

 

To close up: all the benchmarks I ran on the MathHelper.sin function clearly shows that using Math.sin with Sun's (Oracle's?) JRE 1.6 increases the speed of that function dramatically. So the author of that quote seemed to be spot-on, although an increase of 33% in FPS sounds far fetched. But I cannot disprove it really :)

 

However, such change does require some extensive testing on other machines / OSes before making such changes in for example MCForge; Notch is not an idiot, so he most likely did it for a good reason. Maybe ask him? :)

 

 

tl;dr: the author of the FPS++ seems to be on to something; with Sun's JRE 1.6, Math.sin is a lot quicker.

 

 

PS: I could find no evidence that changing the size of the cache table helps / changes anything. It did nothing for me on this level of benchmarking.

  • Like 1

The only thing necessary for the triumph of evil is for good men to do nothing.

Link to comment
Share on other sites

Owh, you shouldn't post these questions. They make me curious, and make me look into the shit on a very deep level .. not trusting how it appears in FPS and stuff, no, benchmarking these functions and see what they do in Java Bytecode. The results? Both surprising (to me) and funny.

 

 

DISCLAIMER: I am a C/C++ programmer who recently took the interest in Minecraft modding; I do know Java for years, but never did much in it. I might have made stupid mistakes in my benchmarks or related code. That said, I doubt I did ;)

Made me curious as well, but I did not have the time to delve in to it.  ;)

 

Also do C/C++ for a few decades here as well, I have some very nice statistical testing scaffolds that would be perfect for testing code such as this (it tests jitter of memory access, loading of various parts of memory and other things, then tests the functions based on what it found out of the cpu/memory/etc access to within a statistical accuracy, it can take a *very* long time to run, but is very accurate, found it in the boost sandbox repository years ago, made a few minor edits since).

 

Thank you very much for the responses and your research!

Link to comment
Share on other sites

Alright, Well first off I would like to thank you for what appears to be the most intelligent and well thought out/researched post to ever come across this forum. Makes me happy to see that at least some people know what they are talking about.

 

A few months ago when someone first came to me with this I did basically the same bench marking you did, got similar results, but I didn't bother to look into the JVM side of it so had no real explanation for it.

So I was hesitant to do anything and then well, it just slipped through the cracks.

 

Mojang has to care about widespread compatibility, they target Minecraft for the 1.5 JVM due to old macs. We at Forge don't care about them, 1.5JVM makes up a whopping 2% of the Minecraft userbase. And 0% of the modding userbase. So we target 1.6 JVM. So if it's a compatibility issue with 1.5, and it's such a increase, I see no reason not to added it in Forge.

 

I would be really interested in a large dataset that we could see if this is something that would be worth adding to Forge, so if your curiosity is piqued, think you would be up to it?

I propose that you create a small application to do these benchmarks and I can figure a way to distribute it to Forge users and hopefully get a large dataset to work with.

 

If Overmind would be willing i'd like the two of you to get together and whip something up. Overmind is the one in control of everything related to this site. He would be the one to speak to regarding setting up the server side of any stats gathering.

 

Whats the researchers mantra? 'Data, Data, Data!'

 

 

I do Forge for free, however the servers to run it arn't free, so anything is appreciated.
Consider supporting the team on Patreon

Link to comment
Share on other sites

Hmm, that for sure would be one way to establish if it would be faster or slower for most of the users. I am not against doing that, so I will drop by on IRC soon and we will see what we can do.

More important, I was puzzled why there was so little documentation on something that impacts performance as much as this. So .. some more digging. (this reply is mostly intended for the curious, like myself, and for documentation purposes)

 

 

Most documentation on this comes from the bug tracker, funny enough. An interesting read is:

 

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857011

 

Here they basically explain this:

 

In JRE 1.3 Math.sin was quick, as it uses x87 fsin. But fsin has limitations. It is only accurate (by some standard) between -1/4 PI and 1/4 PI, and only accepts inputs ranging from - 2^63 .. 2^63. Java specifications define a much large range, and more accuracy. So, in 1.4(.1) they 'fixed' this bug by removing the fsin call, and calling StrictMath.sin instead.

 

StrictMath is the Java implementation of most of the Math stuff. It uses the FPU, but avoids calls like fsin, ftan, ... Even in latest Java, you can still call this if you like, and you can be sure that results get from these functions carry correctly over all platforms and are very precise.

 

So, how wrong is fsin() outside his range of accuracy? Well, somewhere along the 6th digit it starts to differ. Nothing huge, but .. enough for applications who care to notice. Unacceptable by Java standard, so instead of trying to fix it nicely, they just bypassed it and went for a lot slower solution, but always accurate.

 

In result, 1.4 (and 1.5) use this dead-slow approach. If you read the Java source for 1.6 (and 1.7), you still see that Math.sin calls StrictMath.sin. But .. this is just a fallback. What came along is JNI (Java Native Implementation), in the form of Hotspot. It overwrites Math.sin for a native implementation. For x86 this calls fsin when you are in the -1/4 PI - 1/4 PI range, and falls back to the native case when you are not (the StrictMath.sin).

 

Sadly, Hotspot has a lot of assembly, and is therefor useless for platforms like Linux. Along comes Icetead, a Hotspot alternative without any assembly, which restores this correctly. I haven't checked the source of Icetead (I did of Hotspot), but it should also implement Math.sin in a similar way.

 

What just surprises me, and underlines my personal dislike for Java, is that these kind of changes mostly went undocumented, while it is of huge importance if you care about performance in any way. But, end good all good, I am happy recent JRE + JNIs solve this correctly. :)

 

 

So, for a benchmark, several things should be tested for / harvested:

 

JRE version

JNI version (if possible?)

Platform (Windows, Linux, ...)

CPU set (x86, ...)

Speed of Math.sin

Speed of StrictMath.sin

Speed of MathHelper.sin (for reference, I guess)

 

The difference between Math.sin and StrictMath.sin should indicate if that JRE/JNI/Platform/CPU is accelerated by the x87 instruction fsin.

 

 

PS: sorry for derailing this topic from a normal survey to this :P

The only thing necessary for the triumph of evil is for good men to do nothing.

Link to comment
Share on other sites

You just listed many reasons I do not care for the JVM.  ;)

 

If you can make a testing library for those, along with a standalone running wrapper (just a main that calls the library), along with source of course, then I can set up something on this server for your program to submit the results to.  You might also want to reports what version of SSE is supported as well since it is possible that that JIT could use SSE calls to change the outcome and/or precision.  Is there anything else that should be reported?

 

Oh no derailing at all, this was actually my best hope of an outcome, proper research.  :)

 

On an unrelated note, I love your posts!  I usually post large and detailed posts as well, but people complained to me long ago in the past so I have tried to hold back for a few years now (not always successfully), but I love the posts as they give lots of information.  So no worry about holding back here, others can shorten it if necessary.  :)

 

EDIT1:  I should probably see if I can dig up that statistical testing library in C++, if it can be ported to Java (it is not 'that' large) then it would be able to detect and bypass the jitter introduced by jitting and other things such as the cache warming up and so forth.  Remind me later.  And I am generally always in IRC, ping me and if I am around then I will respond, else I will respond later.  Actually there might already be such a testing library for java, might try searching for one.

 

EDIT2:  Actually I think I did find one for java with a quick search, seems very similar to the C++ version I used, with a few missing features but those should not really be needed:  http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html

The code and such is at http://www.ellipticgroup.com/html/benchmarkingArticle.html but read the article first, and he stresses hard:  the only way to do pristine benchmarks is to do but a single benchmark per JVM session.

 

And also we should have some unique identifier per computer so as if someone runs one multiple times under different load we can calculate the results from that one computer better without one person overwhelming the results.

 

EDIT3:  Also, if we get a good testing framework and reporting system set up then we could use it to benchmark other things as well, like Lex's new event system compared to MC's seemingly much worse one.

Link to comment
Share on other sites

Haha, I know those comments about too large post far too well :) But tnx, nice to know some people do appreciate them :) I indeed also found the IBM benchmark tool, and boy, it is a good one.

 

The latest results: I have been benchmarking this wrong all along. It is very sad, but I have to conclude that FPS++ doesn't really help. Or if it does, it is very local. To explain:

 

 

I did my benchmarks in the static initializer of a class. Nobody told me you had to be careful with them, and no documentation suggest you should. I am unsure, but it seems that if you call a function in the class of the static initializer, it imports the class/function again. Anyway, time goes up non-linear with it, and test results are very wrong. Worst of all: the JIT never kicks in. So most of my benchmarks earlier can be put in the bin!

 

So, let's benchmarks outside the static initializer. This give completely different results. When you do just 1000 loops or so, Math.sin is faster. But when you go over it, MathHelper.sin starts to get faster. A lot in fact. In fact, so much faster that you should want to use MathHelper.sin. On OverMindDL1's request, I made a .jar which includes the IBM benchmark, prepared with 3 tests:

 

MathHelper.sin, with a table size of 65536.

MathHelper.sin, with a table size of 1024.

Math.sin.

 

It is available here: https://dl.dropbox.com/s/be86maw8ltqg6ed/MathSin.jar (source is included, takes 5+ minutes to run! Includes the IBM Benchmark Framework, I hope they don't mind 'distributing' it like this :P Awesome framework btw! Really impressive ...)

 

The result:

 

MathHelper65536.sin(): first = 19.566 ms, mean = 38.262 us (CI deltas: -84.409 ns, +106.498 ns), sd = 67.597 us (CI deltas: -13.994 us, +18.349 us) WARNING: EXECUTION TIMES HAVE EXTREME OUTLIERS, SD VALUES MAY BE INACCURATE
MathHelper1024.sin(): first = 1.947 ms, mean = 36.361 us (CI deltas: -53.842 ns, +55.031 ns), sd = 38.650 us (CI deltas: -7.022 us, +10.210 us) WARNING: execution times have mild outliers, SD VALUES MAY BE INACCURATE
Math.sin(): first = 1.173 ms, mean = 684.942 us (CI deltas: -1.077 us, +1.178 us), sd = 201.183 us (CI deltas: -33.190 us, +59.616 us) WARNING: execution times have mild outliers, SD VALUES MAY BE INACCURATE

 

As you can, the second is fastest, closely followed by the first, and leaving the last far behind. Why? Well, as it turns out the JIT, the part that converts Java bytecode to more native code, only kicks in after a block has been called 1500 times with '-client', and 10,000 times with '-server'. I knew this was already in place, but I never knew it would hit this hard. So, when you do some specific benchmarking, Math.sin is faster for the times the code is not under JIT, and slower when it is. These results all tested under JRE 1.6.

 

It doesn't dismiss the issues earlier. JRE 1.4 and JRE 1.5 did much worse in Math.sin, so there benchmarking would have shown a win for MathHelper.sin immediately. Now it was hiding under the control of the JIT. I really don't like language where I am not in control .. but so be it.

 

 

In conclusion: MathHelper.sin is in fact faster, and I cannot find any statistical evidence FPS++ helps. The only reason I can see it helps, is when the JIT cannot execute, or on cold system. Also decreasing the table size seems not to get any statistical support.

 

 

PS: this is under the assumption the JIT will kick in sooner or later for this function, which is very very likely.

The only thing necessary for the triumph of evil is for good men to do nothing.

Link to comment
Share on other sites

I fully agree that my benchmark might work on the cache too much. And possibly other optimizations Java does. So, I changed the benchmark a lot to avoid most of that, I hope.

 

First, I used reflection to call sin() on the various of test cases. As far as I know Java can't optimize reflection, so it should not be possible to optimize the array calls to something more direct, avoiding the function call or something. At least that is my theory :D

 

Second, every time I call sin(), next thing I do is trash the cache. I did this in a way I would in C-world, so I hope it carries enough in Java: I read a random array of 1024 * 1024 * 2 Integers (this should be between the 8 and 16 MiB or RAM). This should purge out anything in the cache that was there (as most L2 caches are only 2 MiB). Again, at least, my theory.

 

Then running the benchmark gives indeed other results, but still with the same message. MathHelper is faster when the JIT kicks in.

 

I updated the benchmark tool above. Please let me know what you think, or if you have any other ideas / suggestions. The results of this benchmark (on my machine, JRE 1.6):

 

 

class MathHelper65536.sin(): first = 2.731 ms, mean = 74.238 us (CI deltas: -110.049 ns, +137.682 ns), sd = 62.098 us (CI deltas: -13.988 us, +22.066 us) WARNING: EXECUTION TIMES HAVE EXTREME OUTLIERS, SD VALUES MAY BE INACCURATE
class MathHelper1024.sin(): first = 989.552 us, mean = 73.697 us (CI deltas: -260.063 ns, +368.889 ns), sd = 155.644 us (CI deltas: -34.896 us, +72.242 us) WARNING: execution times have mild outliers, SD VALUES MAY BE INACCURATE
class java.lang.Math.sin(): first = 702.065 us, mean = 187.960 us (CI deltas: -1.187 us, +2.028 us), sd = 544.618 us (CI deltas: -183.904 us, +252.559 us) WARNING: EXECUTION TIMES HAVE EXTREME OUTLIERS,
execution times may have serial correlation, SD VALUES MAY BE INACCURATE

 

As you can see, MathHelper is now only twice as fast.

 

PS: please do not compare these results with earlier results. The range of the test changed, and therefor the average execution time did.

The only thing necessary for the triumph of evil is for good men to do nothing.

Link to comment
Share on other sites

I wonder how actually changing it in MC would do, would it affect FPS at all...  Should be able to change it in only one place at least...

 

The JIT definitely has a high computation cost, those 'first' readings are high.  I am curious how MathHelper is faster on these now, what else is in effect.  Anyone have any ideas?

Link to comment
Share on other sites

My bet is that Notch did have JRE 1.5 in mind when making the whole MathHelper class.

 

On Minecraft linking against 1.5, right now the only platforms that would be running 1.5 are very outdated servers or PowerPC Macs running Tiger, so linking against 1.6 leaves behind an insignificant amount of people. Bukkit also links against 1.5, but not the MCPC builds.

 

Getting back on topic, I would like it if Forge included a simple patch to redirect MathHelper to the regular Math calls.

Link to comment
Share on other sites

From the sounds of what Over and True are getting, rewriting it to use Math would actually be detrimental as after JIT kicks in everything outperforms in MathHelper.

Interesting, JIT is quite annoying in some cases :P

 

I do Forge for free, however the servers to run it arn't free, so anything is appreciated.
Consider supporting the team on Patreon

Link to comment
Share on other sites

if mathhelper is better after JIT kickes in, is there not a way to force JIT to kick in immediately? (either by running some stuff behind the scenes during game-load or by somehow manually enabling it?)

Being noobish since 96, being dumb since birth!

Link to comment
Share on other sites

Hi.

 

"How ever it might be, all benchmarks (and all threads about it on the web) clearly indicate that these days it is better to use Math.sin over any custom made version. Something often called: premature optimizations." (TrueBrain)

 

It might not be the case if you use this: sourceforge.net/projects/jafama

(has tables but they are small, so memory fetching might not be too large of an overhead)

 

Remarks about MathHelper sin code:

SIN_TABLE[(int)(par0 * 1024 / Math.PI / 2.0f) & 1023]

 

1) There are two divisions. Divisions are expensive, but here you can remove them.

-> SIN_TABLE[(int)(par0 * 1024 * (1.0/(2*Math.PI))) & 1023]

 

2) Operations are done from left to right (if same priority),

    and you have a constant on the right, so you can add parentheses

    to end up with only one operation (HotSpot is smart enough to

    figure out constants).

-> SIN_TABLE[(int)(par0 * (1024*(1.0/(2*Math.PI)))) & 1023]

-> SIN_TABLE[(int)(par0 * (1024/(2.0*Math.PI))) & 1023]

->This should go way faster.

 

3) It would also be fine to only add parentheses since then the divisions

    will only occur once (also, 2.0f will be turned into a double,

    since it divides a double):

-> SIN_TABLE[(int)(par0 * (1024 / Math.PI / 2.0f)) & 1023]

 

Link to comment
Share on other sites

He ended up testing that with some benchmarking software with a library he got from IBM.  The MathHelper is faster in the 'general' case, and different table sizes did not actually make any real difference.

 

The conversion from float to double in the table would probably vanish due to the jit, and even then it would not matter due to the table being generated at class loading time and not runtime.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.