Sign in to follow this  
Followers 0
tekio

Great explanations of AMD vs Intel Architecture

6 posts in this topic

Got bored and was wondering more about the single performance of Intel greatly spanking that of AMD. Found this thread on Reddit that does a pretty good job of explaining it:

https://www.reddit.com/r/AdvancedMicroDevices/comments/3ecrsj/eli5_why_does_amd_single_thread_performance_suffer/

 

Cheers! Now we know what AMD optimized and Intel Optimized really means! :-)

0

Share this post


Link to post
Share on other sites

well the zen is pretty similar to intel design, where bulldozer(and children) were totally different like.. a david core, to the intel being the goliath core. where the zen core is functionally about 1.5 of the integer cores and the entire fpu but bit beefed up/changed of the bulldozer module also with some kind of cheap smt where. its going to be much faster single threaded and on the back end still have good performance through the higher utilization  by switching in the other queued work any time theres a stall/fetch whatever

 

where the bulldozer was only good if you loaded every thread/everything to the max where it would use all the units, so any kind of mixed workload where it went single threaded intel would pull ahead with say the 3770k vs the 8350, but in specifically highly parallel workloads like transcoding, ray tracing, encryption/decryption file compression/decompression then the amd throws down

 

sort of between the regular intel core or like phenom/athlon2, and xeonphi kind of a weird compromise

1

Share this post


Link to post
Share on other sites

Yes. At work, I'm using an 8350 running Linux natively, one thru two copies of Windows in a VM. It actually does really great.  Always running a Windows VM.

 

While at the same time running about 4 instances of Chrome in Linux (use it for all personal stuff at work like browsing porn, reading the news, and ordering pizza). Then Gimp and Ubuntu server running OpenStack for PHP stuff are always running. Average uptime are in the weeks. 

 

It really does run smooth. Even when I need to load vSphere in a VM for quick test scenarios.  Was gonna get one at home, but decided to wait for the Zen architecture to come out first. 

 

EDIT: also built stations for the Graphics Department since we started posting corp. video on Youtube. Most our designers are interns and use a Mac at school. They love the AMD stations with Win10. With the money I saved over IMacs/MacPro  was able to get them two monitors and beefy hardware. So the 8350 is a workhorse for editing video as well... Apparently, Adobe makes use of multiple cores well. 

 

 

EDIT2: AMD still suffers on some multi-core operations unless optimized.  We did have this discussion before.

 

Example:

 

Besides only one floating point processor per 2 cores, it only has one L2 cache pipeline per 4 cores. So... if the application in not designed with AMD in mind; to use affinity of the correct cores in say... dual threads it can still suffer the same CPU stall it gets in single threaded operations.  So with AMD the developer needs to "optimize" to use cores efficiently:

 

Cores 0,1, 2, 3 == share L2 Cache

Cores 4,5,6,7   == share L2 Cache.  

 

So... when "optimizing" for AMD care would need to be taken to use cores 0-3 and 4-7 different from Intel 8 core CPU's. Or the cache pipeline (or lack of) can cause CPU stall pulling data from memory.

 

I guess the best way to describe is:  Intel does less work efficiently. While the 8350 does more work faster. The end outcome depends on the work to be performed.  Like a sports sedan versus a pickup: depends on the work needed to be done daily. 

Edited by tekio
0

Share this post


Link to post
Share on other sites

Correction to my post above: a developer cannot choose how threads spread across the CPU. That is left up to the operating system for the most part. 

0

Share this post


Link to post
Share on other sites

well the l2 cache was massive(in comparison 2mb v 512kb) but was just trying to minimize the performance hits from cache miss, miss predictions etc, but has really high latency and stuff because is more complicated but the later revisions steamroller/excavator made definite improvements in scheduling/utilization of both threads per module but never got those in the am3+ platform supposedly cause the 28nm stuff wasnt so good for that compared to the 32soi.

 

not sure how the fpu works exactly, but for older stuff(non avx xop basically) should be able to use just a single of the 2 fmacs, where phenom2 just couldnt do avx at all only the 128bit and earlier operations.

 

but depending on the pricing for ryzen could be pretty cool, given that if an i7-990(nahelem refresh/dieshrink 6 core) can render faster than an i7-6700k

 

0

Share this post


Link to post
Share on other sites

But 4 cores are sharing the same L2 that cannot be shared individually.   So that's why AMD increased the clock cycles to 4Ghz and even faster in some cases. It is not pulling through the pipeline efficiently. If each core had its own L2, data could be flowing through the cache pipeline as each EX core processing data. Instead, it needs to wait for the cache pipeline to empty. This what people mean when they say, "Intel does more work per clock cycle".

 

I think you may be a little confused about how L2 Cache works:

disk->Memory->L2 Cache->L1 Cache-> Execution Cores.

 

So with a non-shared L2 Cache per EX core, data will be flowing through that pipeline unfettered by the other cores activity. Not waiting to pull memory and looking for Cache hits that will not exist because another core is using them. 

 

Do you see how optimizing for this architecture can help?  It does have an L3 cache shared by all cores, but that does not stop CPU Stall:

AMD L2 Cache Architecture

Disk->Memeory->L3 Cache

                                      L2 Cache ->

                                          L1 Cache -> Core 00 (ALU)    |

                                          L1 Cache -> Core 01 (ALU)    |   FPU

                                          L1 Cache -> Core 02 (ALU)    |

                                          L1 Cache ->  Core 03 (ALU)   |

 

                                     L2 Cache: ->

                                           L1 Cache ->  Core 04 (ALU)   | 

                                           L1 Cache ->  Core 05 (ALU)   |

                                           L1 Cache ->  Core 06 (ALU)   |    FPU

                                           L1 Cache ->  Core 07 (ALU)   |

 

 

So..... EX core is processing a long algorithm and is at 100%.  CPU Stall hits: three of 4 cores are stalled (this is called CPU Stall) because one EX core is processing and piping data through cache pipeline and the other 2 needs the same L2 Cache to get data from memory (remember only so much can pipleline per clockcycle on shated L2 Cache). To make this worse: AMD does not use threads like Intel supporting CPUID. Thus Windows doesn't efficiently put threads on each core, further complexing the problem. 

 

Intel L2 Cache Architecture: 

Disk->Memory->L3 Cache         

             L2 Cache -> L1 Cache -> Core00 (FPU and ALU) x2 threads

             L2 Cache -> L1 Cache -> Core01 (FPU and ALU)  x2 threads

             L2 Cache -> L1 Cache -> Core03 (FPU and ALU) x2 threads

             L2 Cache -> L1 Cache -> Core 04 (FPU and ALU) x3 threads

 

Intel: more cache hits since each core is simultaneously executing two threads and no CPU Stall. Each is not dependent on the other finishing to pipeline more data. Also, most operating systems efficiently process threads this way.

 

 

Throwing gobs of shared L2 Cache and higher clock cycles looks good on paper. And is good in some terms, but there is a reason most servers use Xeon and Intel. I do like AMD myself for the record. :-)

 

AMD Simply tried ramping the clock cycles to make CPU less noticeable. 

 

Does this make sense, bro? 

 

EDIT:

 

AMD is like the P4 which was kind of a fail:  higher clock cycles and relying on cache hits. But this is seldom going to happen with 4 cores sharing L2. There is a higher probability of CPU stall: magnitudes higher... and it shows in benchmarking a lot. 

 

Then Intel was still behind AMD when the Pentium D came out. It was a multi-core P4. Then the Core2Duo came out that adopted AMD's L3 cache and finally Intel transformed to the current pipeline. More closely resembling the PIII or P4 before the Prescott. The Prescott was really when Intel fell behind AMD for a bit. Cannot remember the specifics of the older Pentiums: but the 3 and first 4 was nice. Then got killed with Prescott. Then intel finally broke out with the Core2 Duo Architecture. :-)

 

 

 

 

Edited by tekio
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0