Hammer time

AMD announced the x86-64 specifications back in August 2000, but it wasn’t until April 2003, that the first x86-64 CPU was actually released. Opteron, previously known only as Sledgehammer, is the name of AMD’s eighth-generation enterprise-class processor for servers and high-end workstations. Between the original announcement and the product release, x86-64 was officially renamed AMD64.
The unique selling point of AMD64 is that it does what its name suggests, providing 64-bit operation, combined with the same x86 technology that companies have relied on for many years now. Not only does this allow adopters to preserve their investment in 32-bit technology, but it eases the transition to the more advanced 64-bit model – effectively one computer, powered by Opteron, seamlessly handling both 32- and 64-bit applications.
With Intel launching both the Itanium and Itanium 2 since the AMD64 specification was first aired, you’d be forgiven for wondering when AMD would get its product out the door. Luckily for AMD, both the Itanium (Merced) and its successor the Itanium 2 (McKinley) – collectively known as the Itanium Processor Family, or IPF – have both received a cool reception from the industry, with the processor family even earning the title “Itanic” from several wry analysts.
The problem with Itanium is that it wholly replaces x86-32 with a brand new architecture. While it can run in 32-bit compatibility mode, it requires special emulation to do so, and performance ultimately suffers. AMD’s Opteron, launched on 22 April, looks to take full advantage of this, and now that it’s in the hands of consumers, it’s time to prove its worth.
x86’s heritage
Before you can truly understand the benefits of AMD64, it’s important to understand how we got to the current state of play. Unlike many other processor architectures, x86 has a very clearly defined history. Early on there was the 4004, then the 8008, the 8080, and finally 8086. 8086 was Intel’s first 16-bit CPU, capable of addressing up to 1MB of memory. The 286 followed, introducing protected mode memory access, and allowing lots more memory to be accessible to new programs without breaking existing ones. Then there was the 386. The 486 was the next key release. It was the first to include Level 1 cache on-die, reducing the number of times main memory needed to be accessed. Furthermore, the later models had a ubiquitous maths co-processor (x87) built in, adding 80-bit floating point mathematics capabilities to the CPU which, at the time, made Doom run considerably faster!
The Pentium CPU, so named to allow Intel to trademark the brand, made few changes to the x86 architecture. However, by the time it upgraded to MMX, both AMD and Cyrix were actually creating better CPUs than Intel. The Pentium Pro was the first CPU to be specifically designed for 32-bit code. The 386, 486, and Pentium CPUs could run 32-bit code, but they were designed to run both 16-bit and 32-bit code in equal amounts. The Pentium Pro also included an integrated Level 2 cache, greatly increasing both the speed and cost of the CPU.
The Pentium 4 CPU, despite being much faster than the 386, still bears a great resemblance to its ancestor. The main set of x86 instructions, remain the same. They both have the same number of registers, and both use the same style of maths co-processor. While other architectures have eschewed backwards compatibility, x86-32 has seen a history of tweaks and speed hikes. When Intel came to design its own 64-bit CPU, it threw this out and started afresh. Unfortunately, that has stifled adoption of the Itanium architecture, because most companies have hefty hardware and software investments in the x86 CPU architecture.
Why 64-bit is needed?
We’ve had 32-bits for such a long time now, that it might seem a bit unnecessary to double things. 32-bits mean 32 binary digits, so 32 1s and 0s are used to represent numbers. If we have a 32-bit unsigned integer (that is, it can only be positive), it can hold numbers from 0 to 4,294,967,295 (2^32). Now, because memory is allocated using bytes, it’s also referenced using bytes. When a program wants to read from a byte, it simply looks up its number. As seen, numbers max out at 4 billion or so, which means that the maximum byte of memory that a 32-bit program can reference is number 4,294,967,295. This means that the range of memory that can be accessed is 4,294,967,296 bytes, which is divisible by 1,024. So 4,294,967,296 bytes is 4,194,304 kilobytes, which is 4,096 megabytes, which is 4 gigabyes. What this means is that the maximum amount of memory a 32-bit program can reference is 4GBs – sounds like a lot, doesn’t it? Well, consider the topic of databases. Large companies buy very powerful servers designed to hold their entire database in RAM for the fastest access – in situations like this, 4GBs soon runs out.
Switching to 64-bit numbers, squares the maximum value of integers, so the largest amount of memory that can be addressed becomes 18,446,744,073, 709,551,616 bytes – 16 Exabytes! Of course, this is all virtual. There are other limitations generally involved when things are implemented in reality. With the Opteron, 40 of the 64 bits can be used, which gives 1 Terabyte (a thousand gigabytes) of memory. Furthermore, the Opteron has a 48-bit virtual memory address space, allowing 256 Terabytes to be accessed.
So, lots more RAM is made available – but what else does switching to 64-bit help? One simple advantage is in the field of encryption, where very large numbers need to be processed quickly. 32-bit CPUs can perform 64-bit mathematics (such as long integers in GCC), however the operation needs to be split into two for calculation, then recombined. 64-bit CPUs can natively perform 64-bit mathematics, and so avoid this speed hit.
Hammer time!
AMD took a different route in the form of a CPU evolution, rather than Intel’s revolution. The Hammer family of CPUs (Sledgehammer for servers and Clawhammer for workstations) were designed around a new AMD64 architecture, retaining complete compatibility with 32-bit code, whilst adding the option of 64-bit computing. Moreover, they perform 64- and 32-bit operations natively, and no emulation is required, due to two modes of operation: ‘Long Mode’, and ‘Legacy Mode’. Legacy mode AMD64 CPUs operate as existing x86-32 chips do – they support 16-bit and 32-bit applications and operating systems. Processors that implement AMD64, of which the Opteron is the first, boot into Legacy Mode by default, and are entirely compatible with existing software.
Long Mode is where the magic happens, and can be split into two sub-modes: ‘64-bit Mode’ and ‘Compatibility Mode’. Both these sub-modes require a 64-bit operating system to run. But the difference lies in the applications: 64-bit Mode only supports 64-bit applications running on a 64-bit operating system, whereas Compatibility Mode supports 32-bit applications running on a 64-bit operating system. Users can switch to a 64-bit operating system immediately whilst continuing to use their existing software – the operating system will perform better, which will yield some performance boosts at the application level.
One of the most highly anticipated revisions made in AMD64, is the addition of new registers. Registers store small amounts of information directly in the CPU, and provide instant access to that data – it doesn’t need to be moved from cache or main memory. More registers mean more data can be held on the CPU, which means less shuffling around is required to get the results of calculations. AMD has doubled the number of general purpose registers (GPRs) to sixteen, along with the number of XMM registers. Having more registers is key to improving performance, particularly when there were so few to begin with. That said, some are already saying that 16 GPRs aren’t enough – RISC architectures often have registers numbering in the hundreds!
The biggest XMM (see Glossary) performance increase will be the XMM instructions. Thanks to Intel pushing so hard to have them adopted, these are now used extensively in multimedia programs such as 3D Studio and Quake 3. As a result, just by incorporating the XMM instructions, the Opteron will leap ahead in performance when compared to the Athlon chip.
In 64-bit Mode, the Opteron switches to a new flat memory segmentation model. Segmented memory was previously used as a method to allow operating systems to isolate programs from each other, thereby increasing reliability, but because most modern OSes do this in software, there’s a lot of wasted space in the x86-32 architecture. This has been stripped out for 64-bit Mode Opterons, which allow new 64-bit operating systems to have much simpler code to handle memory management.
AMD64 and operating systems
Thanks to the hard efforts of its community, Linux has supported AMD64 for some time now. The new 64-bit kernel is based on the existing i386 port, and a lot of effort has been made to ensure that the new features of the Opteron are used to the best possible advantage. SuSE and AMD have been working very closely to develop the new software, known as GPL, which is released as part of the 2.5 kernel. Customers of UnitedLinux, such as companies who bought SuSE’s Linux Enterprise Server 8 (SLES 8) product, will be able to benefit from this code already, as the UnitedLinux kernel has many 2.5 backports, including AMD64 support. As such, SLES 8 was the first server operating system available to fully support Opteron’s architecture.
In late 2002, Microsoft delivered a development release of AMD64 Windows to several of its industry partners. It has recently announced that it’s developing native 64-bit versions of its Windows XP and Windows Server 2003 operating systems for the Opteron and Athlon 64 platforms. Specific launch dates are a little vague, with MS saying that beta releases are expected in the middle of 2003. With Windows, Linux and the major BSDs (Berkeley Software Distributions) developing support for Opteron, its adoption should be swift.
Can’t touch this!
Hot topics right now are: “how fast can Opteron users expect their new computers to be”, and “will it perform better in 32-bit mode than the Itaniums”? The second question is easier to answer. Almost certainly. Because Itanium shares no relation with the x86 architecture, it’s difficult to emulate a 32-bit CPU. The Opteron, however, defaults to 32-bit mode by default, and effectively becomes a particularly fast Athlon. If you use applications that make use of SSE and SSE2 (pretty much any 3D software/game), you’ll see an immediate and considerable speed boost. Once software is specifically compiled for 64-bit mode, you’ll see a further 10 to 15 per cent speed improvement from the use of the new registers, rising much higher for computationally-intensive code.
One stumbling block could be AMD’s choice of model rating for the new Opteron. Each Opteron will be given a three-digit number to represent its speed and scalability. The first digit will either be a 1 for uniprocessor boxes, 2 for two-way systems, 4 for four-way, or 8 for eight-way. Digits two and three are a relative performance figure, and will start at 40. So, an Opteron 140 will be the slowest uniprocess Opteron machine, whereas an Opteron 141 will be faster than the 40 (faster by an unknown amount), and an 899 will presumably be the fastest eight-way Opteron machine available. Confused yet?
What’s next?
The Opteron is the first chip to be built on the AMD64 architecture, but it won’t be the last. As Opteron is designed to be used in servers and workstations, AMD are working on a new version of its Athlon XP desktop CPU – dubbed Athlon 64/Clawhammer. This will also make use of the AMD64 architecture. However, there are no plans as yet to discontinue production of the x86-32 Athlon XP or Duron chips, despite the fact that a Clawhammer CPU would be able to emulate (and outperform) each of them.
In the first half of 2004, AMD is slated to release a 0.09 micron version of the Opteron (Athens), the Athlon 64 (San Diego), and the Mobile Athlon 64 (Odessa), which should allow them to continue increasing clock speeds. Intel meanwhile are still making quite drastic changes to the Itanium. With the Opteron’s fixed x86 heritage behind it, it’s more likely that the interface will stay the same, with some parts being shuffled and tweaked internally to increase performance instead of switching to an Opteron 2.
Conclusion
The Opteron is the biggest step forward AMD has ever made, and consequently also its biggest risk. The one thing threatening Opteron’s success is AMD’s “almost as good as Intel, but cheaper” reputation. While this has been true in the past, it would damage Opteron sales to be seen as a cheap version of Itanium, or “the poor man’s 64-bit CPU”. The server market carries a much higher margin level than the desktop market, which should in theory allow AMD to reach profitability quickly if they can keep its prices at a good level. However, if AMD somehow manage to sink back into its old persona of being billed as the “value option”, this increase in revenue would certainly be jeopardised.
It’s our opinion that AMD64 has a strong proposition for both the consumer and enterprise market. Long term popularity is likely to be dictated by the availability of software to take advantage of it, but that doesn't seem much of an obstacle in the Linux world at least. Time will tell, and the stakes may be high for AMD, but the Opteron and the AMD64 chips that will follow, certainly seem to have a future.
Keep on top of your 64-bit processing jargon, with our complete guide to the terminology of the modern computing world.
3DNow! AMD-designed multimedia extensions to x86-32
Athlon AMD’s x86-32 CPU brand
Athlon 64 AMD’s proposed AMD64 desktop CPU brand
Bit BInary digit, a 0 or a 1
Byte Eight binary digits
CISC Complex Instruction Set Computer, uses complex instructions that are often decoded to RISC instructions
Clawhammer Codename for AMD’s desktop 64-bit CPU, Athlon 64
CPU Central Processing Unit, forms the core of your PCs processing abilities
GPR General Purpose Register, a register than can be used for general programming
IA-64 Intel’s Itanium architecture
Instruction Simple binary digit that translates to one CPU operation when called
Instruction Scheduling The act of re-arranging instructions in a non-destructive manner to improve performance
IPF Itanium Processor Family, includes Itanium and Itanium 2
Itanium Intel’s IA-64 implementation
Madison Codename for Itanium 3
McKinley Codename for Itanium 2
Megabyte 1,024 Kilobytes
Opteron AMD’s server and workstation 64-CPU
Register A storage area directly on a CPU, and able to hold a small amount of data
RISC Reduced Instruction Set Computer, uses simpler instructions for faster execution at the expense of larger software
Sledgehammer Codename for AMD’s Opteron
SSE Streaming SIMD Extensions, more Intel-designed multimedia extensions to x86-32
SSE2 Streaming SIMD Extensions 2, Even more Intel-designed multimedia extensions to x86-32
SIMD Single Instruction, Multiple Data, the technique of allowing one CPU instruction to act on multiple registers
Terabyte 1,024 Gigabytes
VLIW Very-Long Instruction Word, method for allowing compilers better control over execution, while making CPUs simpler
x86 Common name for 386-compatible CPUs
x86-32 32-bit x86 CPU architecture
x86-64 Original name for AMD64 64-bit architecture
x87 The mathematics architecture used in x86 PCs
XMM Internal name for SSE and SSE2 combined

