COMP7801 Relational Database

Structure of Relational Databases

Overview

definition
- formally, given sets \(D_1, D_2, ..., D_n\) a relation \(r\) is a subset of \(D_1 \times D_2 \times ... \times D_n\)
- thus a relation is a set of n-tuples (\(a_1, a_2, ..., a_n\)) where each \(a_i \in D_i\)
attribute type
- each attribute of a relation has a name
- the set of allowed values for each attribute is called the domain of the attribute
- examples of simple domain types
  - integer
  - string
  - date
- note
  - in advanced (non-relational) database systems, we may have complex attribute types
  - e.g., in spatial database
    - point
    - polygon
    - poly-line

Reduction of an E-R Schema to Tables

Relation

schema
- \(A_1, A_2, ..., A_n\) are attributes
- \(R = (A_1, A_2, ..., A_n)\) is a relation schema
- \(r(R)\) is a relation on the relation schema \(R\)
  - e.g., customer(Customer-schema)
instance
- the current values (relation instance) of a relation are specified by a table
- an element \(t\) of \(r\) is a tuple, represented by a row in a table

Database

definition
- a database consists of multiple relations which are inter-related

Relational Algebra and SQL

Query languages

definition
- language in which user requests information from the database
categories
- procedural
- non-procedural
pure languages
- relational algebra
- relational calculus
- from underlying basis of query languages that people use

Relational algebra

overview
- procedural language
- 6 basic operators
  - select
  - project
  - union
  - set difference
  - Cartesian product
  - rename
- additional operations
  - natural join
  - division, etc.
- the operators take one or more relations as inputs and give a new relation as a result
- all inputs and outputs are relations - sets of tuples
  - in SQL, the output of an operator is a multiset of tuples
select operation
- notation
  - \(\sigma_p(r)\)
- \(p\) is called the selection predicate
- example
project operation
- notation
  - \(\Pi_{A_1, A_2, ..., A_k}(r)\)
- the result is defined as the relation of k columns obtained by erasing the columns that are not listed
- duplicate rows removed from result, since relations are sets
- example
union operation
- notation
  - \(r\cup s\)
- union must be taken between compatible relations
  - \(r\) and \(s\) must have the same arity
  - attribute domains of \(r\) and \(s\) must be compatible
- example
intersection operation
- notation
  - \(r\cap s\)
- intersection must be taken between compatible relations
  - \(r\) and \(s\) must have the same arity
  - attribute domains of \(r\) and \(s\) must be compatible
- example
difference operation
- notation
  - \(r-s\)
- differences must be taken between compatible relations
  - \(r\) and \(s\) must have the same arity
  - attribute domains of \(r\) and \(s\) must be compatible
- example
Cartesian-product operation
- notation
  - \(r\times s\)
- example
natural join operation
- notation
  - \(r\infty s\)
- example
  - \(R=(A, B, C, D)\)
  - \(S=(E, B, D)\)
  - result schema = \(R=(A, B, C, D, E)\)
  - \(r\infty s\) is defined as
    - \(\Pi_{r.A, r.B, r.C, r.D, s.E}(\sigma_{r.B=s.B \cap r.D=s.D}(r\times s))\)
aggregate functions and operations
- an aggregation function takes a collection of values and returns a single value as a result
  - avg: average value
  - min: minimum value
  - max: maximum value
  - sum: sum of values
  - count: number of values
- aggregate operation in relational algebra
  - notation
    - \(_{G_1, G_2, ..., G_n} g_{F_1(A_1),F_2(A_2), ..., F_n(A_n)}(E)\)
  - \(E\) is any relational-algebra expression
  - \(G_1, G_2, ..., G_n\) is a list of attributes on which to group (can be empty)
  - each \(F_i\) is an aggregate function
  - each \(A_i\) is an attribute name
  - example

SQL

basic structure
- form
  - select \(A_1, A_2, ..., A_n\)
  - from \(r_1, r_2, ..., r_m\)
  - where \(P\)
- equivalent to the relational algebra
  - \(\Pi_{A_1, A_2, ..., A_n}(\sigma_P (r_1\times r_2\times ... \times r_m))\)
- the result of a SQL query is a multiset of tuples
the select clause
- to force the elimination of duplicates, insert the keyword "distinct" after "select"
the where clause
- comparison results can be combined using the logical connectives "and", "or", and "not"
- comparisons can be applied to results of arithmetic expressions
the from clause
- is Cartesian product
- a join can be expressed by a Cartesian product followed by selections on the join attributes
aggregate functions
- group by
  - in the select clause outside of aggregate functions we must have
    - attributes that appear in the "group by" list and/or
    - aggregate functions on attributes of each group
- having
  - predicates in the having clause are applied after the formation of groups whereas predicates in the where clause are applied before forming groups

Query evaluation and optimization

purpose
- many equivalent expressions to the original query can be derived
- the query optimizer uses statistical data and appropriate algorithms to compute an expression of low evaluation cost
example
- \(\Pi_{customer-name}(\sigma_{branch-city = Brooklyn}(branch \infty account \infty depositor))\)

Storage of Relations

Physical storage media

cache
- fastest and most costly form of storage
- volatile
  - you lose power, you lose everything
- managed by the computer system hardware
main memory
- fast access
- generally too small (or too expensive) to store the entire database
- volatile
magnetic disk
- data is stored on spinning disk, and read/written magnetically
- primary medium for the long-term storage of data; typically stores entire database
- data must be moved from disk to main memory for access, and written back for storage
  - much slower access than main memory
- direct-access – possible to read data on disk in any order, unlike magnetic tape
- capacities range up to several TB currently
- survives power failures and system crashes
  - disk failure can destroy data, but is very rare
- magnetic hard disk mechanism
storage and memory hierarchy

Optimization of disk-block access

block
- data is transferred between disk and main memory in blocks
- sizes range from 512 bytes to several kilobytes
  - smaller blocks: more transfers from disk
  - larger blocks: more space wasted due to partially filled blocks
  - typical block sizes today range from 4 to 16 kilobytes
algorithm
- disk-arm-scheduling algorithms order pending accesses to tracks so that disk arm movement is minimized

Storage access

purpose
- in order to minimize the number of block transfers between the disk and memory
- reduce the number of disk accesses by keeping as many blocks as possible in main memory
buffer
- portion of main memory available to store copies of disk blocks
buffer manager
- subsystem responsible for allocating buffer space in main memory
- if the block is already in the buffer
  - the requesting program is given the address of the block in main memory
- if the block is not in the buffer
  - the buffer manager allocates space in the buffer for the block, replacing (throwing out) some other block, if required, to make space for the new block
    - buffer-replacement policies
      - most OS replace the block least recently used (LRU strategy)
      - LRU works well for unpredicted access patterns
      - however, queries have well-defined access patterns (such as sequential scans), and a database system can use the information in a user’s query to predict future references
        
        LRU can be a bad strategy for certain access patterns involving repeated scans of data
        
        mixed strategy with hints on replacement strategy provided by the query optimizer is preferable
  - the block that is thrown out is written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk
  - once space is allocated in the buffer, the buffer manager reads the block from the disk to the buffer, and passes the address of the block in main memory to requester
example

File organization

overview
- the database is stored as a collection of files
- each file is a sequence of records
- a record is a sequence of fields
- each record has an address in the file, which is called record pointer or record id (simply rid)
- a simple approach
  - assume record size is fixed
  - each file has records of one particular type only
  - different files are used for different relations
organization of records in files
- heap
  - a record can be placed anywhere in the file where there is space
- sequential
  - store records in sequential order, based on the value of the search key of each record
- hashing
  - a hash function computed on some attribute of each record
  - the result specifies in which block of the file the record should be placed
- other
  - records of each relation may be stored in a separate file
  - in a "clustered file organization" records of several different relations can be stored in the same file
    - motivation: store related records on the same block to minimize I/O
    - however, not good for queries accessing only a few relations
    - in general, this representation is barely used

Questions

Q: Suppose I just want to access your age, but in practice, the main memory read all your information from the specific block. Why?

A: The disk may be slower than main memory, in order to reduce the number of block transfers between the disk and memory, the main memory would read all your informtion.