Update report.
This commit is contained in:
parent
6b963c967b
commit
64ee80be63
206
final/report.tex
206
final/report.tex
|
@ -2,11 +2,23 @@
|
|||
\usepackage[margin=1in]{geometry}
|
||||
\usepackage{graphicx}
|
||||
\usepackage{amsmath}
|
||||
\usepackage{hyperref}
|
||||
\usepackage{xcolor}
|
||||
\definecolor{link}{HTML}{006275}
|
||||
\hypersetup{
|
||||
colorlinks,
|
||||
citecolor=black,
|
||||
filecolor=black,
|
||||
linkcolor=link,
|
||||
urlcolor=black
|
||||
}
|
||||
\title{Final Project Report}
|
||||
\author{Danila Fedorin}
|
||||
\begin{document}
|
||||
\maketitle
|
||||
\section*{General Design and Considerations}
|
||||
\tableofcontents
|
||||
\pagebreak
|
||||
\section{General Design and Considerations}
|
||||
The goal of this assignment was to create a 256-byte SRAM memory unit. In order
|
||||
to minimize wire delays, I chose to split each bit into \textbf{4 columns of 64 SRAM cells
|
||||
each}. This was motivated by the following factors:
|
||||
|
@ -18,7 +30,7 @@ each}. This was motivated by the following factors:
|
|||
the decision to shrink the columns as much as possible. However...
|
||||
\item \emph{Smaller} columns became a routing challenge. Even with a 4-column split,
|
||||
to properly connect each cell of the SRAM column, the SRAM cells themselves need
|
||||
to accomodate an additional three \textsc{Wl} lines. Due to the pitch requirements
|
||||
to accommodate an additional three \textsc{Wl} lines. Due to the pitch requirements
|
||||
on metals three and four, this is the upper limit (for reasonably sized cells).
|
||||
Alternatives included splitting the decoder into pieces, but for large numbers
|
||||
of columns, this meant that the decoder signal traveled through significant amounts
|
||||
|
@ -53,35 +65,189 @@ capacitance $\frac{C}{4}$. Between each fragment, I added the aforementioned $5\
|
|||
transistors, as well as 16 always-off $5\lambda$ transistors, which simulated the remaining memory cells.
|
||||
I also placed \textsc{Din}, \textsc{Ad0}, and \textsc{Rwt} behind the default-sized flip-flops
|
||||
attached to the clock to simulate something like a pipeline stage. My overall design is shown
|
||||
in figure \ref{fig:top-design-sim}.
|
||||
in Figure \ref{fig:top-design-sim}.
|
||||
|
||||
\pagebreak
|
||||
\section*{Performance Results}
|
||||
I made three measurements of my performance.
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{toplevel_design.png}
|
||||
\caption{Top-level design for a single bit.}
|
||||
\label{fig:top-design}
|
||||
\end{figure}
|
||||
|
||||
\begin{itemize}
|
||||
\item Without flip-flopping my inputs and outputs, I was able to clock my design around
|
||||
950\textit{ps}.
|
||||
\item With flip-flops on my inputs (but not on my output), I was able to clock my design
|
||||
around 1.24\textit{ns}. However, at this delay, the output of the gate came in very close to
|
||||
the falling edge of the clock.
|
||||
\item With flip-flops on my inputs and my outputs, I was able to clock my design at 2.6\textit{ns}.
|
||||
This significant delay was to allow enough setup time for the flip flop.
|
||||
\end{itemize}
|
||||
My SRAM cell ended up being $30\lambda$ units tall when arrayed. With
|
||||
a total of 64 cells in a single column, this led to a wire length of $1920\lambda$.
|
||||
However, since my write block was now included in the column, I added another $300\lambda$
|
||||
of length to this number, to a total of roughly $2200\lambda$.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.6\linewidth]{toplevel.png}
|
||||
\caption{Architecture of top-level simulation.}
|
||||
\label{fig:top-design-sim}
|
||||
\end{figure}
|
||||
|
||||
\pagebreak
|
||||
\section{Performance Results}
|
||||
I was able to clock my design at 1.38ns. There is a caveat to this clock speed: my \textsc{Bt} and
|
||||
\textsc{Bf} lines are not pulled all the way to \textsc{Gnd} when they are written low. This
|
||||
doesn't seem to be a problem - it's sufficient to flip the furthest cell in the design in
|
||||
every situation I've tested. However, from what I hear, this was discouraged during one of
|
||||
the office hours (which I was unable to attend). With the constraint of pulling the wires
|
||||
all the way down, my design can operate at around 2.1ns.
|
||||
%
|
||||
Two factors lead to these upper limits.
|
||||
%
|
||||
\begin{itemize}
|
||||
\item \textit{Write capacitance} makes it increasingly difficult to overwrite the value
|
||||
in the cell. Clocking my design any faster than 950\textit{ps} or 1.24\textit{ns}
|
||||
(depending on the case) leads my cell to \textit{almost} flip, but not resolve correctly.
|
||||
in the cell. Clocking my design any faster leads my cell to \textit{almost} flip, but not resolve correctly.
|
||||
I have found no way to work around these limits once my wire was properly sized, and my
|
||||
write block was placed in the middle of the column.
|
||||
\item \textit{Flop, decoder, and read delays} are the major limitation when both the inputs
|
||||
and the outputs of the circuit are connected to flip flops. Even though the output
|
||||
of the read block is correct, it doesn't arrive fast enough to be captured by the next cycle.
|
||||
Furthermore, in some cases, the signal to open a memory cell arrives later than the
|
||||
\textsc{Trig} signal for the senseamp, making it read too early and thus output the incorrect value.
|
||||
and the outputs of the circuit are connected to flip flops. The most significant
|
||||
instance of this issue is my write block: both \textsc{Din} and \textsc{Rwt} arrive
|
||||
around $300\textit{ps}$ into the cycle. This means two things: a) if the previous
|
||||
operation was ``read'', then the block does not start writing until halfway into
|
||||
the positive phase of the clock and b) if the data being written is different
|
||||
from the data in the previous cycle, for half the time, the write block will write
|
||||
the old data (until the flip flop switches).
|
||||
\end{itemize}
|
||||
|
||||
\section{Components}
|
||||
\subsection{Decoder}
|
||||
\subsubsection{In My Own Words}
|
||||
The decoder in this design is exact same one as we were given in lecture.
|
||||
It computes all combinations of two consecutive bits using a \textsc{Nand} gate; for
|
||||
each combination, there are 4 adjacent two-bit combinations,
|
||||
leading to a 4 \textsc{Nor} gates connected to each \textsc{Nand}. There are now
|
||||
16 combinations of 4 adjacent bits; each combination of the lower 4 bits
|
||||
needs to be compared with each of the 16 combinations of the upper 4 bits,
|
||||
leading to 16 \textsc{Nand} gates connected to each \textsc{Nor}. This
|
||||
results in 256 unique \textsc{Wl} wires. Finally, these need to be attached
|
||||
to the clock, so that cells aren't open randomly. This is done using an \textsc{And}
|
||||
gate (a \textsc{Nand} followed by an inverter).
|
||||
|
||||
% TODO: Domino logic
|
||||
% TODO: More inverters?
|
||||
|
||||
\pagebreak
|
||||
\subsection{Read Block}
|
||||
\subsubsection{In My Own Words}
|
||||
The read block uses a \emph{sense amplifier} to detect small changes on the bitlines,
|
||||
which it then translates into a zero-or-one output. The changes in the wires are below
|
||||
the threshold of what could be considered digital logic; all the sense amplifier
|
||||
designs I've come across rely on metastability, a state in which even tiny fluctuations
|
||||
can significantly alter the outcome\footnote{My favorite analogy is a pencil balanced on its tip.
|
||||
Technically, it's stable; however, even a small air current -- one you can't feel -- can knock it over.}.
|
||||
The \textsc{Trigger} signal, which depends on the clock and \textsc{Rwt}, puts the amplifier
|
||||
into a metastable state. From there, the connected bitlines cause it to resolve one way
|
||||
or another. Finally, if one of the wires resolves, a value is written into the keeper circuit
|
||||
at the end, which ensures that the value that was read continues to be expressed until
|
||||
the next read operation.
|
||||
|
||||
\subsubsection{Details}
|
||||
For my read block, I used a different sense amplifier design. The design based
|
||||
on the two \textsc{Nand3} gates was easy to understand and build, but was less
|
||||
sensitive, and tended to behave strangely under pressure. This led to difficulties
|
||||
with debugging (the output would, for instance, flip completely at certain
|
||||
wire widths), and was seemingly random. Instead, I used
|
||||
an \textbf{improved latch-based sense amplifier design} from . % TODO: cite
|
||||
The design I used is shown in Figure \ref{fig:latch-amp}.
|
||||
I left it sized at $40\lambda$, since larger amplifiers seem to take longer
|
||||
to trigger and exit metastability.
|
||||
|
||||
The read block is not a particular bottleneck in this design. The main concern
|
||||
was to handle the \textbf{``false start'' activation of the write block}. Because the \textsc{Rwt}
|
||||
input is behind a latch, it takes nearly $300\textit{ps}$ to pull up or down after
|
||||
the initial clock. Thus, if a write occurred during a previous cycle, the write block will
|
||||
activate for a short period of time before the read block does. The memory cell
|
||||
will overpower this initial misfire\footnote{According to my additional simulations, this is true even when the memory cell is close to the write block.}, but in this case, both \textsc{Bt} and \textsc{Bf}
|
||||
will be below \textsc{Vdd}. The ``improved sense amplifier'' seems to handle this
|
||||
case better than the one based on two \textsc{Nand} gates. I think that both Reed and
|
||||
Graham experienced this occurrence -- they seemed to post very similar waveforms
|
||||
to the community Discord group chat.
|
||||
|
||||
\pagebreak
|
||||
\subsection{Write Block}
|
||||
\subsubsection{In My Own Words}
|
||||
The write block converts a ``data in'', or \textsc{Din}, signal
|
||||
into a one-hot representation. It does so by pulling one of the bitlines high, and the other
|
||||
low. Once the memory cell connects to the bitlines, it takes on the charge provided by the
|
||||
write block, and is therefore overwritten. In my design, two PMOS transistor for each bitline
|
||||
are used to pull down; one of the transistors is triggered by \textsc{Din} signal (which wire
|
||||
we pull down depends on the signal itself!), and the other by a combination of the clock
|
||||
and \textsc{Rwt} (we don't want to touch the wires when reading!).
|
||||
|
||||
\subsubsection{Details}
|
||||
My write block was not significantly different from the original design. Under the assumption
|
||||
that data arrives first, I placed the transistors attached to \textsc{Din} and $\overline{\textsc{Din}}$
|
||||
close to \textsc{Gnd}, each followed by a transistor attached to the ``write'' signal.
|
||||
I also configured the write block to only precharge when the clock is low.
|
||||
|
||||
I experimented with making the write block pull wires up when writing (during high clock). However,
|
||||
I did not find this to be of significant use. Since the wires are initially precharged,
|
||||
there is no more time spent on charging them up; furthermore, the memory cell being written to
|
||||
does not have enough ``strength'' to pull the wire down enough.
|
||||
|
||||
A curiosity of this design is that reads didn't seem to work with hich clock speeds. When enough
|
||||
time is spent reading the wires, the memory cell in question is able to gradually exhaust the amount
|
||||
of charge on one of these wires. Since the original, \textsc{Nand}-based sense amplifier required
|
||||
all inputs to be high to properly function, this led to it eventually ``flipping'' and producing
|
||||
the wrong output. This was only an issue above $5\textit{ns}$, and only with the original sense amplifier
|
||||
design, though.
|
||||
|
||||
One thing to note about the write block is that its \textbf{clock input is deliberately delayed} compared
|
||||
to the ``actual'' clock. This is because of an issue with \textsc{Din}. Since this
|
||||
input is behind a latch, it takes around $300\textit{ps}$ to arrive after the rising clock
|
||||
edge. If the previous value of \textsc{Din} was different than its current one, the write
|
||||
block will start writing the wrong value. This will typically mean that the block cannot properly
|
||||
perform the write. The delay on the clock input serves to mitigate this issue, by giving more
|
||||
time for \textbf{Din} to settle before starting to write. To compensate for this delay, I sized
|
||||
the write block's pull down transistors quite large ($100\lambda$), so that they can pull
|
||||
the wire down, even starting $300\textit{ps}$ into the cycle. This is why the ``clock'' input
|
||||
in my diagrams is colored black, unlike every other clocked component. The delay is achieved
|
||||
by 6 sequenced inverters, two of which are sized 10x larger than the rest.
|
||||
|
||||
\pagebreak
|
||||
\subsection{Memory Cell}
|
||||
\subsubsection{In My Own Words}
|
||||
The memory cell consists of two cross coupled inverters whose outputs
|
||||
are disconnected from the bitlines by two additional nMOS transistors. When disconnected,
|
||||
this cell reliably holds its value; one inverter's output turns off the other, and symmetrically,
|
||||
the ``off'' output of that other inverter keeps the first one on. However, this cell is pretty
|
||||
small; all of its transistors have size $5\lambda$ is the smallest size that can be properly
|
||||
connected with a standard $2\lambda\times2\lambda$ via. Thus, when the ``write line'' (signal
|
||||
connected to the gates of the two outside transistors) is asserted, the charge from the
|
||||
surrounding bitlines can easily overpower the cell, causing it to switch to a different value.
|
||||
|
||||
\subsubsection{Details}
|
||||
There are few notable things about my cell design. Even though it was recommended that we only
|
||||
use metals one and two for the internal wiring, I went up to metal three for cross-connecting
|
||||
the two internal inverters. This was the only way I found to keep the height of the cell to
|
||||
minimum. This limited my routing options somewhat; to compensate, I also used metal three for
|
||||
the vertical wires, \textsc{Bt} and \textsc{Bf}. This allowed me to use metal four for the
|
||||
\textsc{Wl} (access) signal. Since this was the only use of metal four, I had enough free
|
||||
room to route thee additional \textsc{Wl} signals to the remaining three columns.
|
||||
|
||||
My general principle for designing the layout was that, in an 8-bit, 4-column design, \textbf{a single
|
||||
unit of height costs as much as 64 units of width}. Thus, I was fairly liberal with my layout's
|
||||
width, but made sure to minimize the height of the design. The most significant bottleneck
|
||||
was the gate oxide ``poking out'' of the ends of the design. In total, I was able to achieve
|
||||
a height of $30\lambda$ when arrayed.
|
||||
|
||||
Other designs with smaller height were possible, but I found them undesirable. For instance,
|
||||
Reed's now-famous design used a significant amount of high-level metals to achieve its tiny,
|
||||
almost square area. This, however, makes routing \textsc{Wl} signals fairly complicated. They either
|
||||
need to go to yet another layer of metal, or the decoder needs to be split into 4 pieces. The former
|
||||
is undesirable as per the requirements for this assignment; the latter incurs the cost of additional
|
||||
decoder hardware between columns, thereby significantly increasing the wire length and signal
|
||||
delays. Since delays incurred by the flip flops and other signals are already becoming
|
||||
a significant factor in my design, I thought it would be best to avoid such delays.
|
||||
|
||||
Other ideas I am aware of include putting \textit{all} the transistors in a single, horizontal line.
|
||||
While this certainly succeeds at reducing the height, it incurs all the same issues described
|
||||
above - it becomes nigh impossible to wire further \textsc{Wl} lines through each column,
|
||||
unless the decoder is split into bits, in which case the width of the entire assembly drastically increases,
|
||||
slowing down all signals.
|
||||
|
||||
\end{document}
|
||||
|
|
Loading…
Reference in New Issue
Block a user