\documentclass[conference,twocolumn]{IEEEtran} \usepackage[skip=0.2\baselineskip]{caption} \usepackage{longtable} \usepackage{booktabs} \usepackage{graphicx} \title{ECE 570 Final Project Report} \author{Danila Fedorin\\fedorind@oregonstate.edu} \begin{document} \maketitle \section*{Part 1: Address Prediction Benchmarks} In this part, the \emph{Taken}, \emph{Not Taken}, \emph{Bimodal}, \emph{2-Level} and \emph{Combined} branch predictors were run against three benchmarks. The results are recorded in Figure \ref{fig:ap1}. Figure \ref{fig:ap1graph} provides a bar chart of this data. Results are grouped by benchmark to make it easier to compare various branch prediction algorithms. \begin{figure}[h] \begin{tabular}[]{@{}llllll@{}} \toprule Benchkmark & Taken & Not Taken & Bimod & 2 level & Comb \\ \midrule Anagram & .3126 & .3126 & .9613 & .8717 & .9742 \\ Go & .3782 & .3782 & .7822 & .6768 & .7906 \\ GCC & .4049 & .4049 & .8661 & .7668 & .8793 \\ \bottomrule \end{tabular} \caption{Address prediction rates of various predictors} \label{fig:ap1} \end{figure} \begin{figure}[h] \begin{center} \includegraphics[width=\linewidth]{ap1.png} \end{center} \caption{Address prediction rates by benchmark} \label{fig:ap1graph} \end{figure} As expected, the two stateless predictors, \emph{Taken} and \emph{Not Taken}, perform significantly worse than the others. These predictors do not keep track of the behavior of various branches, and thus have limited ability to predict the direction or target of a branch. Out of the stateful predictors, the \emph{2-level} predictor seems to perform the worst. Unsurprisingly, the \emph{Combined} predictor, which is a combination of the other two stateful predictors, performs better than its constituents, since it's able to switch to a better-performing predictor as needed. I was confused why the \emph{Taken} and \emph{Not Taken} predictors had identical address prediction rates. I would have expected the \emph{Taken} predictor to correctly predict more addresses, since structures like loops will typically have more ``taken'' branches than ``not taken`` ones. At first, I thought that this is explained by both stateless predictors having no BTB - functions like \texttt{bpred\_update} do not initialize these tables, and they are not used for prediction. However, this shouldn't entirely account for the identical numbers of address hits - after all, the \emph{Taken} predictor should always return the expected target address, while the \emph{Not Taken} predictor should, in the case of conditional jumps, return \texttt{PC+1}. This seems consistent with the code. However, I think I see what is happening. I looked at the following fragment from \texttt{sim-outorder.c} (which was \textbf{not} added by me): \begin{verbatim} bpred_lookup(pred, /* branch address */fetch_regs_PC, /* target address */ /* FIXME: not computed */0, ... \end{verbatim} It seems as though the target address is always predicted to be zero, because it is not computed at the time of this function call. The text ``FIXME'' indicates that this may be a bug or temporary issue. This prediction, in turn, seems to mean that the \emph{Taken} branch predictor will return \texttt{0} in all cases. I confirmed that this is the case by adding a call to \texttt{printf} to the \texttt{BPredTaken} case of \texttt{bpred\_lookup}. To me, this seems like an issue, because code for other predictors uses \texttt{0} to represent ``not taken''. Consider, for instance, the following snippet from later on in the same function: \begin{verbatim} return ((*(dir_update_ptr->pdir1) >= pred_cutoff) ? /* taken */ pbtb->target : /* not taken */ 0); \end{verbatim} Zero here is clearly used to denote ``not taken''. So, it seems as though all in all, \emph{Taken} always returns ``not taken''. Amusingly, the same will be the case with \emph{Not Taken}. It returns \texttt{PC+1} in the case of conditional jumps (which is equivalent to returning zero, since the code in \texttt{sim-outorder.c} converts zero to \texttt{PC+1}), or, in the case of unconditional jumps, it returns the expected target address (zero), which is \textit{also} \texttt{PC+1}! The fact that the two predictors have the same address prediction rate seems to be due to the ``FIXME'' in the simulator code. By the way, I write \texttt{PC+1} to mean ``the next program counter'', since I do not know the width in bytes of the simulated instructions. \section*{Part 2: IPC Benchmarks} In this section, we present the IPC results from the previously listed predictors. Figure \ref{fig:ipc} contains the collected data, and Figure \ref{fig:ipcgraph} is a bar chart of that data. \begin{figure}[h] \begin{tabular}[]{@{}llllll@{}} \toprule Benchkmark & Taken & Not Taken & Bimod & 2 level & Comb \\ \midrule Anagram & 1.0473 & 1.0396 & 2.1871 & 1.8826 & 2.2487 \\ Go & 0.9512 & 0.9412 & 1.3212 & 1.2035 & 1.3393 \\ GCC & 0.7878 & 0.7722 & 1.2343 & 1.1148 & 1.2598 \\ \bottomrule \end{tabular} \caption{IPC by benchmark} \label{fig:ipc} \end{figure} \begin{figure}[h] \begin{center} \includegraphics[width=\linewidth]{ipc.png} \end{center} \caption{IPC by benchmark} \label{fig:ipcgraph} \end{figure} Once again, the stateless predictors perform significantly worse than the stateful predictors. Also, \emph{Taken} performs better than \emph{Not Taken}. This is likely because most of the given programs have loops, in which the conditional branch is taken many times while the loop is iterating, and then once when the loop terminates. Predicting ``not taken'' in this case would lead to many mispredictions. However, as described above, it seems like \emph{Taken} and \emph{Not Taken} return the same addresses, so I'm not completely sure where the IPC difference is coming from. Once again, the \emph{Bimodal} predictor performs better than the \emph{2-Level} predictor, and both are outperformed by \emph{Combined}, which leverages the two at the same time. \section*{Part 3 - Bimodal Exploration} In this section, the \emph{Bimodal} branch predictor is further analyzed by varying the size of the BTB. BTB sizes range from 256 to 4096. The data collected from this analysis is shown in Figure \ref{fig:ap2}. As usual, the data is shown as a bar graph in Figure \ref{fig:ap2graph}. \begin{figure}[h] \begin{tabular}[]{@{}llllll@{}} \toprule Benchkmark & 256 & 512 & 1024 & 2048 & 4096 \\ \midrule Anagram & .9606 & .9609 & .9612 & .9613 & .9613 \\ Go & .7430 & .7610 & .7731 & .7822 & .7885 \\ GCC & .8158 & .8371 & .8554 & .8661 & .8726 \\ \bottomrule \end{tabular} \caption{Bimodal address prediction rates by benchmark} \label{fig:ap2} \end{figure} \begin{figure}[h] \begin{center} \includegraphics[width=\linewidth]{ap2.png} \end{center} \caption{Bimodal address prediction by benchmark} \label{fig:ap2graph} \end{figure} As expected, increasing the BTB size for the Bimodal predictor seems to improve its performance. Since instructions are assigned slots in the BTB according to their hashes (which can collide), having a larger BTB means that there is a smaller chance of collisions, and, therefore, that branch targets are more accurately predicted. The exception appears to be the Anagram benchmark, where the changes to performance are small enough to be unnoticable in the visualization. This could be because the Anagram benchmark has only a few important branches, which means that increasing the BTB size does not prevent any further collisions. The benchmark also takes less real time to run on my machine, which is an indicator that it is less complex than the Go and GCC benchmarks (which further supports the above theory). \section*{Part 4 - Combined Branch Predictor Explanation} It appears as though the combined branch predictor works by considering the decisions of both a 2-level and a bimodal branch predictor. To decide which predictor to listen to, the combined predictor uses a third predictor, named \texttt{meta} in the code. The \texttt{meta} predictor appears to be another bimodal predictor, but instead of deciding whether a branch is taken or not taken, it decides whether to use the two-level or the bimodal predictor to determine the branch outcome. If the two predictors managed by \texttt{meta} disagree about the direction, then \texttt{meta}'s 2-bit counter is updated to favor the correct predictor. Otherwise, if the two predictors both predict ``taken'' or ``not taken'', \texttt{meta} is unaffected, since neither predictor did better than the other. Because \texttt{meta} is implemented as a 2-bit predictor, it can tolerate at most one use of the wrong branch predictor before switching to the other (if the current predictor is ``strongly'' predicted). Of course, a combined predictor need not specifically consist of a bimodal and a 2-level predictor. This approach can be applied to combine any two predictors. However, in the code for the simulator, it seems like only this specific combination is possible. \section*{Part 5 - 3-Bit Branch Predictor} For this part, I modified the SimpleScalar codebase to add a 3-bit branch predictor. The code will be included with this report, but not in this document. After implementing this predictor, I simulated it with the same BTB sizes as the previous extended simulations of the Bimodal (2-bit) predictor. Figure \ref{fig:ap3} contains this data, and Figure \ref{fig:ap3graph} contains the visualization of that data. \begin{figure}[h] \begin{tabular}[]{@{}llllll@{}} \toprule Benchkmark & 256 & 512 & 1024 & 2048 & 4096 \\ \midrule Anagram & .9610 & .9612 & .9615 & .9616 & .9616 \\ Go & .7507 & .7680 & .7799 & .7897 & .7966 \\ GCC & .8192 & .8385 & .8554 & .8656 & .8728 \\ \bottomrule \end{tabular} \caption{3-Bit address prediction rates} \label{fig:ap3} \end{figure} \begin{figure}[h] \begin{center} \includegraphics[width=\linewidth]{ap3.png} \end{center} \caption{3-Bit address prediction rates} \label{fig:ap3graph} \end{figure} As with the bimodal branch predictor, the 3-bit predictor benefits from larger BTB sizes in the Go and GCC benchmarks, but seems to remain very consistent in the Anagram benchmark. The differences between this predictor and the related bimodal predictor are hard to see in this diagram. To better compare the two predictors, I computed the percent improvement to address prediction rates of the 3-bit branch predictor relative to the bimodal one. Figure \ref{fig:2v3} displays this information. From this figure, it appears as though the 3-bit predictor performs better than the bimodal one in most cases. However, it does perform slightly worse with a 2048-sized BTB in the GCC benchmark. The Go benchmark sees the most improvement (around 1\%). A 3-bit predictor performs better when branches generally follow the same direction, except for occasional groups in the other direction. If the Go benchmark implements the Chinese game of the same name, it's possible that the program behaves very much in this manner. For instance, if the program is scanning the board to find groups of ``dead'' pieces, starting at a recently placed piece, it will likely find pieces nearby, but occasionally run into empty spaces like ``eyes''. If the benchmark implements a Go AI, I'm not sure how it would behave computationally, but perhaps it also follows the same pattern. \begin{figure}[h] \begin{center} \includegraphics[width=\linewidth]{2v3.png} \end{center} \caption{Percent improvement of 3-bit predictor over the bimodal predictor.} \label{fig:2v3} \end{figure} \end{document}