|
Directory |
| Caching In: P4 Extreme Edition |
|
|
|
| Written by Nigel Woodford | |
| Monday, 03 November 2003 | |
|
Page 5 of 6
Figuring Out Where Everything Is When registers can map into cache locations directly, this is called direct mapping. Things move much faster when registers don’t have to spend time figuring out where the data they need can be found in cache. With direct mapping, the data will either be at location A, or it will eventually be in location A. This allows the processor to be extremely fast by removing latency because no cycles are wasted in figuring out memory locations. The trade off is it doesn’t allow cache to be flexible in any way. When data can be put anywhere, the cache is called “fully associative”. This technique allows cache to be extremely flexible because it can be organized to make the best use of free space. However, this takes a little bit more time for the processor to find what it’s looking for. The compromise solution is for the cache to be “set-associative”, where cache is divided up into smaller chunks of fully associative memory. This way, cache can be flexible in finding available free locations but also fast by allowing the processor to directly map to small chunks of cache. The Pentium 4 has an extremely fast L1 cache. Moving data from L1 into registers can be achieved in just a few nanoseconds by having an extremely small L1 cache (only 8KB for each i- and d- cache) and dividing it into a 4-way set-associative configuration. This allows flexibility, and thanks to the small size and high level of set-association it is very close to being directly mapped. The Athlon 64 FX on the other hand, has a whopping 64KB that is 2-way set associative. This cache is also exclusive, and the HyperTransport layer is responsible for optimizing memory organization and access. It’s been speculated that this large cache size has to do with the processor’s Opteron heritage. Because the Opteron architecture uses the processor itself to achieve multi-processor communication rather than a dedicated controller, it would require some extra breathing room for its cache. Both the L2 and L3 cache on the P4 are 8-way set-associative. The larger sets at each level increase latency when something needs to be fetched. So while the L1 cache has virtually no latency, going to L2 takes a little more time and L3 takes even longer. The Athlon 64 FX, on the other hand, has an incredible 16-way L2 cache that is 1 megabyte in size. Because it has twice as much L2 cache as the P4 and is divided into twice the number of sets, the speed of the resultant configuration is the same allowing it have a fast and flexible L2 cache despite its huge size. What does this huge cache disparity mean for performance? A large cache size allows a larger chunk of data to be moved into it at once, at the cost of flexibility. If the processor is cycling thru a large data structure that is in a block of memory, it can bring that large data structure in all at once. However, it is limited in the number of these chunks that can remain available and easily accessed. A smaller block size allows one to bring data into cache from any location without sacrificing speed. If the proper data can be trickled to a small L1 or L2 cache, the processor can access what it needs immediately and the cache can keep itself flexible by providing choices.
Remember though, that data must continually be pulled from slow
memory and through different levels of cache to keep the instructions
and data flowing. Because the Athlon 64 can operate on 64-bit wide data
structures, it needs a larger L1 cache size to process this wider chunk
of memory. The AMD architecture can then load chunks all at once into
cache and process them, but can’t access it as quickly as the P4 can.
It makes up for it by adding a huge amount of L2 cache chunked up into
small pieces to make finding what it needs easier. |