Quantcast

Forum Login

feed image
Directory Articles Technology Previews

Caching In: P4 Extreme Edition PDF Print E-mail
Article Index
Caching In: P4 Extreme Edition
The Basics
Under the Hood
Intel and AMD
Where Everything Is
In The End

Under the Hood

All this talk of cache structure is fine and good, but what about the way in which it’s filled? The processor will probably have to wait for this to happen, but thanks to pre-fetch logic the cache can be filled before the processor needs the data found therein. Intel’s NetBurst microarchitecture is how the Pentium 4 deals with processor and cache logic. Intel calls the i-cache the “trace cache”, which consists of micro-ops - already decoded instruction logic. The essential gain here is that the trace cache holds pre-decoded instructions for the processor, so the instructions can be fed directly into instruction registers. If everything goes well, the cache will always be filled with these micro-ops and the processor will be that much more efficient. But this doesn’t happen as often as we would like.

If the processor needs something that isn’t in the cache, it’s called a cache miss and there’s some work to do. For starters, we need to get the correct instructions. We’ll hope to find them in L2 cache, and if not there – maybe in L3. The P4 has plenty of bandwidth between the L2 to the L1 trace cache, but data first needs to go to the instruction decoder to get processed into micro-ops. This can be a little cumbersome, especially when it comes to dealing with string processing. Once the instructions have been located, room must be made in the trace cache to fit them. As the new instructions come in to fill the cache, the oldest resident block gets retired back to L2.

Intel’s NetBurst wants to keep the trace cache properly filled with the correct instructions. It also wants to know where everything in cache is. It does this with an instruction table and a front-end branch predictor. The branch predictor figures out when the processor needs to make a choice, and tries to guess what choice it will make based on previous answers. The branch predictor then tells the instruction table to get the relevant branch of instructions ready and put them into the trace cache (with already decoded logic). If done correctly, instructions are decoded a minimal number of times, the correct branches are chosen, and the trace cache is used effectively.

The data cache is a little different and not as complicated as the i-cache. The P4 has a 256 bit wide channel from L2 to the L1 data cache. Instead of accessing an instruction table, the data cache needs to go into special processing registers for computation. Because there’s not much logic involved with data, except in it’s location, a much larger channel can be used between caches without performance losses. Because it operates in a similar fashion, L3 cache is very effective for data processing. It can be used as a big buffer to hold data ready so the processor doesn’t need to time out waiting for data to be fed from main memory into L2.


 
© 2003-2008 Fastsilicon Media. All Rights Reserved