# 1. Memory Hierarchy

Memory is categorized into volatile and nonvolatile memories, with the former requiring constant power ON of the system to maintain data storage.

Furthermore, a typical computer system provides a hierarchy of different times of memories for data storage.

# Different levels of the memory hierarchy

1. Cache (MB): Cache is the fastest accessible memory of a computer system. It's access speed is in the order of a few nanoseconds. It is volatile and expensive, so the typical cache size is in the order of megabytes.

2. Main memory (GB): Main memory is arguably the most used memory. When discussing computer algorithms such as quick sort, balanced binary sorted trees, or fast Fourier transform, one typically assumes that the algorithm operates on data stored in the main memory. The main memory is reasonably fast, with access speed around 100 nanoseconds. It also offers larger capacity at a lower cost. Typical main memory is in the order of 10 GB. However, the main memory is volatile.

3. Secondary storage (TB): Secondary storage refers to nonvolatile data storage units that are external to the computer system. Hard drives and solid state drives are examples of secondary storage. They offer very large storage capacity in the order of terabytes at very low cost. Therefore, database servers typically have an array of secondary storage devices with data stored distributedly and redundantly across these devices. Despite the continuous improvements in access speed of hard drives, secondary storage devices are several magnitudes slower than main memory. Modern hard drives have access speed in the order of a few milliseconds.

4. Tertiary storage (PB): Teriary storage refers storage designed for the purpose data backup. Examples of tertiary storage devices are tape drives are robotic driven disk arrays. They are capable of petabyte range storage, but have very slow access speed with data access latency in seconds or minutes.

In this course, we are mainly concerned with the main memory and the secondary storage of the memory hierarchy.

# 2. Data movement

In order to perform computation, data must travel from the secondary storage to the CPU. Due to the vast differences in access speed and storage capacity, the operating system performs a series of sophisticated steps (both at the levels of hardware and the software).

# 2.1. Move data to and from secondary storage

The hard drive has a local data cache. Because the hard drive access speed is so much slower, disk access are normally done asynchronously at the hardware level. The hard drive will be instructed to fetch or save data from a physical location of a certain size. The hard drive controller will acknowledge the instruction right away, and will take sometime to populate the disk cache with the requested data block, and then, at a later time, raise an interrupt to notify the interested party that the data access is complete. This model of interrupt based data transfer is asynchronous.

The data transfer speed (from memory to hard drive or vice versa) depends on:

1. contiguousness of the accessed physical locations on the disk
2. the size of the accessed data per request

# 2.2. Performance characteristics

The mechanical design of the hard drive naturally leds itself to superior sequential access.

A hard drive consists of an (array of) rotating disk. Each disk has a head driven by a mechanical arm which performs the bitwise read and write. If a sequential data block is accessed, all bits can be read or written during a single rotation.

However, random access of the hard drive data may require multiple rotations for the head to reach all the regions on the disk.

Seek time is measures average time it takes the head to travel to the region of interest on disk. Typically, the seek time is a few milliseconds.

Roational speed (rpm) Seek time (ms)
15,000 2
10,000 3
7,200 4.16

Source: wikipedia pin

Data transfer rate is the steady state speed the head can read or write data between the disk buffer and the disk. The data tranfer rate can reach 1 terabites / second.

# 2.3. Programming exercise

We will experiment with the effects of disk buffer and seek time on the efficiency of data transfer from disk to main memory.

We model a program as a sequence of read requests:

$$r_1, r_2, \dots, r_i, ...$$

Each request $r_i$ can be characterized by

• $L(r_i)$: the start location of the data to be read.
• $S(r_i)$: the size of the data to be read.
• $t(r_i)$: the duration of $r_i$.

The overall efficiency is the data transfer rate which can be defined as: $$R_j = \frac {\mathrm{total\ data\ transfered}} {\mathrm{total\ time}} = \frac {\sum_{i\leq j} S(r_i)} {\sum_{i\leq j} t(r_i)}$$

We would be interested in the steady state data transfer rate: $$R = \lim_{j\to\infty} R_j$$

Let's consider the case that each read request has a constant size: $$\forall i,\ S(r_i) = B$$

It turns out that $R = R(B)$, that is the transfer rate is determined by block size of each read request. A typical relationship between $R$ and $B$ is as follows:

When data is requested in large enough blocks, we see a drastic improvement in data transfer rate.

# 3. Buffer Manager

Buffer manager is another technique to further combat disk latency. Given that data should be transferred in blocks, all disk I/O issued by the database are done in units of pages.

A buffer manager allows a section of the main memory, known as the buffer pool, to act as a cache in order to reduce the necessary disk I/O activies.

The buffer pool is partitioned into frames. Each frame is the size of a page. The buffer manager follows a classical cache manager algorithm to perform lazy disk I/O. Namely, all requests of disk read and write are first performs as frame read and write in the buffer pool. Data is transferred between the buffer pool and disk only when a requested page is not cached by the buffer pool, or new pages need to be cached but the buffer pool is full.

class BufferManager {
BufferPool pool = makePool();
DiskManager disk = makeDisk();

} else {
tryEvictPage();
pool.cache(p);
}
}

void write(Page p) {
q.update(p);
q.setDirty();
} else {
tryEvictPage();
pool.cache(p);
p.setDirty();
}
}

void tryEvictPage() {
if(poll.isFull()) {
Page q = poll.getNextEviction();
if(q.isDirty())
disk.write(q);
poll.evict(q);
}
}
}


pool is an object that maintains the frames. It follows an eviction rule to select the next page to be evicted in pool.getNextEviction().

read loads a disk page by its address. If the page address addr is already found in the buffer pool, we simply return that. Otherwise, we must perform disk read. After the page is loaded, we cache the page. Eviction may become necessary if the buffer pool is full. When a page is evicted, we must check if it is dirty. A dirty page is if its content has been altered since it's loaded from disk. Dirty pages must be written to disk when evicted.

write writes a page. The write request is deferred. Instead of performing a disk write, the buffer manager caches the page (with possible eviction). It is understood that eventually the cached page will be expired by the eviction rule. Since it's a dirty page, the eventual eviction will write the page to disk.

tryEvictPage removes a page q chosen by an eviction rule (getNextEviction) from the pool. If q is dirty, it needs to be written to disk before being evicted.