diff --git a/final/report.tex b/final/report.tex index 9447a45..2e3a777 100644 --- a/final/report.tex +++ b/final/report.tex @@ -2,11 +2,23 @@ \usepackage[margin=1in]{geometry} \usepackage{graphicx} \usepackage{amsmath} +\usepackage{hyperref} +\usepackage{xcolor} +\definecolor{link}{HTML}{006275} +\hypersetup{ + colorlinks, + citecolor=black, + filecolor=black, + linkcolor=link, + urlcolor=black +} \title{Final Project Report} \author{Danila Fedorin} \begin{document} \maketitle -\section*{General Design and Considerations} +\tableofcontents +\pagebreak +\section{General Design and Considerations} The goal of this assignment was to create a 256-byte SRAM memory unit. In order to minimize wire delays, I chose to split each bit into \textbf{4 columns of 64 SRAM cells each}. This was motivated by the following factors: @@ -18,7 +30,7 @@ each}. This was motivated by the following factors: the decision to shrink the columns as much as possible. However... \item \emph{Smaller} columns became a routing challenge. Even with a 4-column split, to properly connect each cell of the SRAM column, the SRAM cells themselves need - to accomodate an additional three \textsc{Wl} lines. Due to the pitch requirements + to accommodate an additional three \textsc{Wl} lines. Due to the pitch requirements on metals three and four, this is the upper limit (for reasonably sized cells). Alternatives included splitting the decoder into pieces, but for large numbers of columns, this meant that the decoder signal traveled through significant amounts @@ -53,35 +65,189 @@ capacitance $\frac{C}{4}$. Between each fragment, I added the aforementioned $5\ transistors, as well as 16 always-off $5\lambda$ transistors, which simulated the remaining memory cells. I also placed \textsc{Din}, \textsc{Ad0}, and \textsc{Rwt} behind the default-sized flip-flops attached to the clock to simulate something like a pipeline stage. My overall design is shown -in figure \ref{fig:top-design-sim}. +in Figure \ref{fig:top-design-sim}. \pagebreak -\section*{Performance Results} -I made three measurements of my performance. +\begin{figure}[h] + \centering + \includegraphics[width=\linewidth]{toplevel_design.png} + \caption{Top-level design for a single bit.} + \label{fig:top-design} +\end{figure} -\begin{itemize} - \item Without flip-flopping my inputs and outputs, I was able to clock my design around - 950\textit{ps}. - \item With flip-flops on my inputs (but not on my output), I was able to clock my design - around 1.24\textit{ns}. However, at this delay, the output of the gate came in very close to - the falling edge of the clock. - \item With flip-flops on my inputs and my outputs, I was able to clock my design at 2.6\textit{ns}. - This significant delay was to allow enough setup time for the flip flop. -\end{itemize} +My SRAM cell ended up being $30\lambda$ units tall when arrayed. With +a total of 64 cells in a single column, this led to a wire length of $1920\lambda$. +However, since my write block was now included in the column, I added another $300\lambda$ +of length to this number, to a total of roughly $2200\lambda$. + +\begin{figure} + \centering + \includegraphics[width=0.6\linewidth]{toplevel.png} + \caption{Architecture of top-level simulation.} + \label{fig:top-design-sim} +\end{figure} + +\pagebreak +\section{Performance Results} +I was able to clock my design at 1.38ns. There is a caveat to this clock speed: my \textsc{Bt} and +\textsc{Bf} lines are not pulled all the way to \textsc{Gnd} when they are written low. This +doesn't seem to be a problem - it's sufficient to flip the furthest cell in the design in +every situation I've tested. However, from what I hear, this was discouraged during one of +the office hours (which I was unable to attend). With the constraint of pulling the wires +all the way down, my design can operate at around 2.1ns. % Two factors lead to these upper limits. % \begin{itemize} \item \textit{Write capacitance} makes it increasingly difficult to overwrite the value - in the cell. Clocking my design any faster than 950\textit{ps} or 1.24\textit{ns} - (depending on the case) leads my cell to \textit{almost} flip, but not resolve correctly. + in the cell. Clocking my design any faster leads my cell to \textit{almost} flip, but not resolve correctly. I have found no way to work around these limits once my wire was properly sized, and my write block was placed in the middle of the column. \item \textit{Flop, decoder, and read delays} are the major limitation when both the inputs - and the outputs of the circuit are connected to flip flops. Even though the output - of the read block is correct, it doesn't arrive fast enough to be captured by the next cycle. - Furthermore, in some cases, the signal to open a memory cell arrives later than the - \textsc{Trig} signal for the senseamp, making it read too early and thus output the incorrect value. + and the outputs of the circuit are connected to flip flops. The most significant + instance of this issue is my write block: both \textsc{Din} and \textsc{Rwt} arrive + around $300\textit{ps}$ into the cycle. This means two things: a) if the previous + operation was ``read'', then the block does not start writing until halfway into + the positive phase of the clock and b) if the data being written is different + from the data in the previous cycle, for half the time, the write block will write + the old data (until the flip flop switches). \end{itemize} +\section{Components} +\subsection{Decoder} +\subsubsection{In My Own Words} +The decoder in this design is exact same one as we were given in lecture. +It computes all combinations of two consecutive bits using a \textsc{Nand} gate; for +each combination, there are 4 adjacent two-bit combinations, +leading to a 4 \textsc{Nor} gates connected to each \textsc{Nand}. There are now +16 combinations of 4 adjacent bits; each combination of the lower 4 bits +needs to be compared with each of the 16 combinations of the upper 4 bits, +leading to 16 \textsc{Nand} gates connected to each \textsc{Nor}. This +results in 256 unique \textsc{Wl} wires. Finally, these need to be attached +to the clock, so that cells aren't open randomly. This is done using an \textsc{And} +gate (a \textsc{Nand} followed by an inverter). + +% TODO: Domino logic +% TODO: More inverters? + +\pagebreak +\subsection{Read Block} +\subsubsection{In My Own Words} +The read block uses a \emph{sense amplifier} to detect small changes on the bitlines, +which it then translates into a zero-or-one output. The changes in the wires are below +the threshold of what could be considered digital logic; all the sense amplifier +designs I've come across rely on metastability, a state in which even tiny fluctuations +can significantly alter the outcome\footnote{My favorite analogy is a pencil balanced on its tip. +Technically, it's stable; however, even a small air current -- one you can't feel -- can knock it over.}. +The \textsc{Trigger} signal, which depends on the clock and \textsc{Rwt}, puts the amplifier +into a metastable state. From there, the connected bitlines cause it to resolve one way +or another. Finally, if one of the wires resolves, a value is written into the keeper circuit +at the end, which ensures that the value that was read continues to be expressed until +the next read operation. + +\subsubsection{Details} +For my read block, I used a different sense amplifier design. The design based +on the two \textsc{Nand3} gates was easy to understand and build, but was less +sensitive, and tended to behave strangely under pressure. This led to difficulties +with debugging (the output would, for instance, flip completely at certain +wire widths), and was seemingly random. Instead, I used +an \textbf{improved latch-based sense amplifier design} from . % TODO: cite +The design I used is shown in Figure \ref{fig:latch-amp}. +I left it sized at $40\lambda$, since larger amplifiers seem to take longer +to trigger and exit metastability. + +The read block is not a particular bottleneck in this design. The main concern +was to handle the \textbf{``false start'' activation of the write block}. Because the \textsc{Rwt} +input is behind a latch, it takes nearly $300\textit{ps}$ to pull up or down after +the initial clock. Thus, if a write occurred during a previous cycle, the write block will +activate for a short period of time before the read block does. The memory cell +will overpower this initial misfire\footnote{According to my additional simulations, this is true even when the memory cell is close to the write block.}, but in this case, both \textsc{Bt} and \textsc{Bf} +will be below \textsc{Vdd}. The ``improved sense amplifier'' seems to handle this +case better than the one based on two \textsc{Nand} gates. I think that both Reed and +Graham experienced this occurrence -- they seemed to post very similar waveforms +to the community Discord group chat. + +\pagebreak +\subsection{Write Block} +\subsubsection{In My Own Words} +The write block converts a ``data in'', or \textsc{Din}, signal +into a one-hot representation. It does so by pulling one of the bitlines high, and the other +low. Once the memory cell connects to the bitlines, it takes on the charge provided by the +write block, and is therefore overwritten. In my design, two PMOS transistor for each bitline +are used to pull down; one of the transistors is triggered by \textsc{Din} signal (which wire +we pull down depends on the signal itself!), and the other by a combination of the clock +and \textsc{Rwt} (we don't want to touch the wires when reading!). + +\subsubsection{Details} +My write block was not significantly different from the original design. Under the assumption +that data arrives first, I placed the transistors attached to \textsc{Din} and $\overline{\textsc{Din}}$ +close to \textsc{Gnd}, each followed by a transistor attached to the ``write'' signal. +I also configured the write block to only precharge when the clock is low. + +I experimented with making the write block pull wires up when writing (during high clock). However, +I did not find this to be of significant use. Since the wires are initially precharged, +there is no more time spent on charging them up; furthermore, the memory cell being written to +does not have enough ``strength'' to pull the wire down enough. + +A curiosity of this design is that reads didn't seem to work with hich clock speeds. When enough +time is spent reading the wires, the memory cell in question is able to gradually exhaust the amount +of charge on one of these wires. Since the original, \textsc{Nand}-based sense amplifier required +all inputs to be high to properly function, this led to it eventually ``flipping'' and producing +the wrong output. This was only an issue above $5\textit{ns}$, and only with the original sense amplifier +design, though. + +One thing to note about the write block is that its \textbf{clock input is deliberately delayed} compared +to the ``actual'' clock. This is because of an issue with \textsc{Din}. Since this +input is behind a latch, it takes around $300\textit{ps}$ to arrive after the rising clock +edge. If the previous value of \textsc{Din} was different than its current one, the write +block will start writing the wrong value. This will typically mean that the block cannot properly +perform the write. The delay on the clock input serves to mitigate this issue, by giving more +time for \textbf{Din} to settle before starting to write. To compensate for this delay, I sized +the write block's pull down transistors quite large ($100\lambda$), so that they can pull +the wire down, even starting $300\textit{ps}$ into the cycle. This is why the ``clock'' input +in my diagrams is colored black, unlike every other clocked component. The delay is achieved +by 6 sequenced inverters, two of which are sized 10x larger than the rest. + +\pagebreak +\subsection{Memory Cell} +\subsubsection{In My Own Words} +The memory cell consists of two cross coupled inverters whose outputs +are disconnected from the bitlines by two additional nMOS transistors. When disconnected, +this cell reliably holds its value; one inverter's output turns off the other, and symmetrically, +the ``off'' output of that other inverter keeps the first one on. However, this cell is pretty +small; all of its transistors have size $5\lambda$ is the smallest size that can be properly +connected with a standard $2\lambda\times2\lambda$ via. Thus, when the ``write line'' (signal +connected to the gates of the two outside transistors) is asserted, the charge from the +surrounding bitlines can easily overpower the cell, causing it to switch to a different value. + +\subsubsection{Details} +There are few notable things about my cell design. Even though it was recommended that we only +use metals one and two for the internal wiring, I went up to metal three for cross-connecting +the two internal inverters. This was the only way I found to keep the height of the cell to +minimum. This limited my routing options somewhat; to compensate, I also used metal three for +the vertical wires, \textsc{Bt} and \textsc{Bf}. This allowed me to use metal four for the +\textsc{Wl} (access) signal. Since this was the only use of metal four, I had enough free +room to route thee additional \textsc{Wl} signals to the remaining three columns. + +My general principle for designing the layout was that, in an 8-bit, 4-column design, \textbf{a single +unit of height costs as much as 64 units of width}. Thus, I was fairly liberal with my layout's +width, but made sure to minimize the height of the design. The most significant bottleneck +was the gate oxide ``poking out'' of the ends of the design. In total, I was able to achieve +a height of $30\lambda$ when arrayed. + +Other designs with smaller height were possible, but I found them undesirable. For instance, +Reed's now-famous design used a significant amount of high-level metals to achieve its tiny, +almost square area. This, however, makes routing \textsc{Wl} signals fairly complicated. They either +need to go to yet another layer of metal, or the decoder needs to be split into 4 pieces. The former +is undesirable as per the requirements for this assignment; the latter incurs the cost of additional +decoder hardware between columns, thereby significantly increasing the wire length and signal +delays. Since delays incurred by the flip flops and other signals are already becoming +a significant factor in my design, I thought it would be best to avoid such delays. + +Other ideas I am aware of include putting \textit{all} the transistors in a single, horizontal line. +While this certainly succeeds at reducing the height, it incurs all the same issues described +above - it becomes nigh impossible to wire further \textsc{Wl} lines through each column, +unless the decoder is split into bits, in which case the width of the entire assembly drastically increases, +slowing down all signals. + \end{document}