88 lines
5.2 KiB
TeX
88 lines
5.2 KiB
TeX
|
\documentclass{article}
|
||
|
\usepackage[margin=1in]{geometry}
|
||
|
\usepackage{graphicx}
|
||
|
\usepackage{amsmath}
|
||
|
\title{Final Project Report}
|
||
|
\author{Danila Fedorin}
|
||
|
\begin{document}
|
||
|
\maketitle
|
||
|
\section*{General Design and Considerations}
|
||
|
The goal of this assignment was to create a 256-byte SRAM memory unit. In order
|
||
|
to minimize wire delays, I chose to split each bit into \textbf{4 columns of 64 SRAM cells
|
||
|
each}. This was motivated by the following factors:
|
||
|
\begin{itemize}
|
||
|
\item \emph{Larger} columns were eliminated due to the high cost of interconnect.
|
||
|
Even large write blocks were not able to charge the ``far ends'' of the wire
|
||
|
at shorter clock cycles. Increasing wire width did not help; although resistance
|
||
|
decreased, the capacitance increased, leading to small net gains. Thus, I made
|
||
|
the decision to shrink the columns as much as possible. However...
|
||
|
\item \emph{Smaller} columns became a routing challenge. Even with a 4-column split,
|
||
|
to properly connect each cell of the SRAM column, the SRAM cells themselves need
|
||
|
to accomodate an additional three \textsc{Wl} lines. Due to the pitch requirements
|
||
|
on metals three and four, this is the upper limit (for reasonably sized cells).
|
||
|
Alternatives included splitting the decoder into pieces, but for large numbers
|
||
|
of columns, this meant that the decoder signal traveled through significant amounts
|
||
|
of wire, and was thus slower.
|
||
|
\end{itemize}
|
||
|
|
||
|
For each of the 4 64-bit columns, I attached separate read and write blocks. However,
|
||
|
my placement of the write block was unorthodox. I observed that, although the write block
|
||
|
is perfectly capable of quickly manipulating the bitlines close to it, the changes
|
||
|
to the wires take too long to propagate through to the end. I addressed this with two separate
|
||
|
changes:
|
||
|
|
||
|
\begin{itemize}
|
||
|
\item I added \textbf{additional precharge transistors} along the column, a total of 4.
|
||
|
Each was sized at $5\lambda$, much like the SRAM transistors themselves. When the clock
|
||
|
was low, these PMOS transistors became transparent, and helped precharge the bitlines faster.
|
||
|
Doing so helped avid hysteresis. However, this did not help with writing during high clock,
|
||
|
so...
|
||
|
\item I also \textbf{placed the write block in the middle of the column}. This increased the distance
|
||
|
between my furthest SRAM cell and the read block (since the write block now contributed to wire
|
||
|
length). However, this made it significantly easier to drive the entire length of the wire,
|
||
|
which was my main bottleneck. This was because the maximum distance from the write
|
||
|
block to any cell in the column was halved. Since my read circuit continued to work in this
|
||
|
configuration, I did not place it in the middle of the column, as that would needlessly
|
||
|
increase the length of the wires.
|
||
|
\end{itemize}
|
||
|
|
||
|
This led to the configuration shown in Figure \ref{fig:top-design}. To simulate this design, I placed
|
||
|
a memory cell at the very top of my column, which is the furthest spot from both the read and write
|
||
|
circuit. I also split the wire into 4 equally-sized fragments, each with resistance $\frac{R}{4}$ and
|
||
|
capacitance $\frac{C}{4}$. Between each fragment, I added the aforementioned $5\lambda$ precharge
|
||
|
transistors, as well as 16 always-off $5\lambda$ transistors, which simulated the remaining memory cells.
|
||
|
I also placed \textsc{Din}, \textsc{Ad0}, and \textsc{Rwt} behind the default-sized flip-flops
|
||
|
attached to the clock to simulate something like a pipeline stage. My overall design is shown
|
||
|
in figure \ref{fig:top-design-sim}.
|
||
|
|
||
|
\pagebreak
|
||
|
\section*{Performance Results}
|
||
|
I made three measurements of my performance.
|
||
|
|
||
|
\begin{itemize}
|
||
|
\item Without flip-flopping my inputs and outputs, I was able to clock my design around
|
||
|
950\textit{ps}.
|
||
|
\item With flip-flops on my inputs (but not on my output), I was able to clock my design
|
||
|
around 1.24\textit{ns}. However, at this delay, the output of the gate came in very close to
|
||
|
the falling edge of the clock.
|
||
|
\item With flip-flops on my inputs and my outputs, I was able to clock my design at 2.6\textit{ns}.
|
||
|
This significant delay was to allow enough setup time for the flip flop.
|
||
|
\end{itemize}
|
||
|
%
|
||
|
Two factors lead to these upper limits.
|
||
|
%
|
||
|
\begin{itemize}
|
||
|
\item \textit{Write capacitance} makes it increasingly difficult to overwrite the value
|
||
|
in the cell. Clocking my design any faster than 950\textit{ps} or 1.24\textit{ns}
|
||
|
(depending on the case) leads my cell to \textit{almost} flip, but not resolve correctly.
|
||
|
I have found no way to work around these limits once my wire was properly sized, and my
|
||
|
write block was placed in the middle of the column.
|
||
|
\item \textit{Flop, decoder, and read delays} are the major limitation when both the inputs
|
||
|
and the outputs of the circuit are connected to flip flops. Even though the output
|
||
|
of the read block is correct, it doesn't arrive fast enough to be captured by the next cycle.
|
||
|
Furthermore, in some cases, the signal to open a memory cell arrives later than the
|
||
|
\textsc{Trig} signal for the senseamp, making it read too early and thus output the incorrect value.
|
||
|
\end{itemize}
|
||
|
|
||
|
\end{document}
|