In the figure there is a simplified diagram of the floating point unit (FP) contained in a Bulldozer module. It is organized as an external coprocessor, which receives instructions to run from the outside and reports the results back to the requesting core. The connection between FP units and integer core is bidirectional: it receives instructions and data from the integer cores, reports a completion signal back (the retire of the instructions is still run by the integer core), is connected with two 64-bit buses to the integer cores for the execution of convert instructions from integers to FP numbers and vice versa and is connected to the load and store units of the two integer cores to perform memory operations.
It consists of a unified scheduler, which can receive up to 4 instructions per clock cycle alternating the two integer cores and send running up to 4 instructions per clock cycle, even mixing those of both threads, at 80/128 bits. The scheduler is fully data-driven: when the data needed and the execution units are free, the instruction is executed, being careful to be fair between the two threads.
The execution units are 4 and are able to perform 2 FP FMAC operations (Fused Multiply Accumulate, i.e. a multiplication and a fused accumulation, namely a calculation of the type d = a + b * c) and 2 IMAC operations (Integer Fused Multiply Accumulate, such as the FMAC, but on integers) per cycle. The x87 operations are managed by FMAC. The divisions and square roots are always handled by FMAC. Some special operations are managed by IMAC, such as permute, the memory one are handled by the two IMAC pipeline and registers movements are mostly carried out on the fly without taking execution units. Thanks to a patent filed by AMD, FMAC and IMAC units are able to perform even simple additions or multiplications with the same circuit, without unnecessary duplication. The units are 128 bits and can be joined together to perform 256-bit operations.