7 minutes
COMP7801 Relational Database
Structure of Relational Databases
Overview
- definition
- formally, given sets \(D_1, D_2, ..., D_n\) a relation \(r\) is a subset of \(D_1 \times D_2 \times ... \times D_n\)
- thus a relation is a set of n-tuples (\(a_1, a_2, ..., a_n\)) where each \(a_i \in D_i\)
- attribute type
- each attribute of a relation has a name
- the set of allowed values for each attribute is called the domain of the attribute
- examples of simple domain types
- integer
- string
- date
- note
- in advanced (non-relational) database systems, we may have complex attribute types
- e.g., in spatial database
- point
- polygon
- poly-line
Reduction of an E-R Schema to Tables
Relation
- schema
- \(A_1, A_2, ..., A_n\) are attributes
- \(R = (A_1, A_2, ..., A_n)\) is a relation schema
- \(r(R)\) is a relation on the relation schema \(R\)
- e.g., customer(Customer-schema)
- instance
- the current values (relation instance) of a relation are specified by a table
- an element \(t\) of \(r\) is a tuple, represented by a row in a table
Database
- definition
- a database consists of multiple relations which are inter-related
Relational Algebra and SQL
Query languages
- definition
- language in which user requests information from the database
- categories
- procedural
- non-procedural
- pure languages
- relational algebra
- relational calculus
- from underlying basis of query languages that people use
Relational algebra
- overview
- procedural language
- 6 basic operators
- select
- project
- union
- set difference
- Cartesian product
- rename
- additional operations
- natural join
- division, etc.
- the operators take one or more relations as inputs and give a new relation as a result
- all inputs and outputs are relations - sets of tuples
- in SQL, the output of an operator is a multiset of tuples
- select operation
- notation
- \(\sigma_p(r)\)
- \(p\) is called the selection predicate
- example
- notation
- project operation
- notation
- \(\Pi_{A_1, A_2, ..., A_k}(r)\)
- the result is defined as the relation of k columns obtained by erasing the columns that are not listed
- duplicate rows removed from result, since relations are sets
- example
- notation
- union operation
- notation
- \(r\cup s\)
- union must be taken between compatible relations
- \(r\) and \(s\) must have the same arity
- attribute domains of \(r\) and \(s\) must be compatible
- example
- notation
- intersection operation
- notation
- \(r\cap s\)
- intersection must be taken between compatible relations
- \(r\) and \(s\) must have the same arity
- attribute domains of \(r\) and \(s\) must be compatible
- example
- notation
- difference operation
- notation
- \(r-s\)
- differences must be taken between compatible relations
- \(r\) and \(s\) must have the same arity
- attribute domains of \(r\) and \(s\) must be compatible
- example
- notation
- Cartesian-product operation
- notation
- \(r\times s\)
- example
- notation
- natural join operation
- notation
- \(r\infty s\)
- example
- \(R=(A, B, C, D)\)
- \(S=(E, B, D)\)
- result schema = \(R=(A, B, C, D, E)\)
- \(r\infty s\) is defined as
- \(\Pi_{r.A, r.B, r.C, r.D, s.E}(\sigma_{r.B=s.B \cap r.D=s.D}(r\times s))\)
- notation
- aggregate functions and operations
- an aggregation function takes a collection of values and returns a single value as a result
- avg: average value
- min: minimum value
- max: maximum value
- sum: sum of values
- count: number of values
- aggregate operation in relational algebra
- notation
- \(_{G_1, G_2, ..., G_n} g_{F_1(A_1),F_2(A_2), ..., F_n(A_n)}(E)\)
- \(E\) is any relational-algebra expression
- \(G_1, G_2, ..., G_n\) is a list of attributes on which to group (can be empty)
- each \(F_i\) is an aggregate function
- each \(A_i\) is an attribute name
- example
- notation
- an aggregation function takes a collection of values and returns a single value as a result
SQL
- basic structure
- form
- select \(A_1, A_2, ..., A_n\)
- from \(r_1, r_2, ..., r_m\)
- where \(P\)
- equivalent to the relational algebra
- \(\Pi_{A_1, A_2, ..., A_n}(\sigma_P (r_1\times r_2\times ... \times r_m))\)
- the result of a SQL query is a multiset of tuples
- form
- the select clause
- to force the elimination of duplicates, insert the keyword "distinct" after "select"
- the where clause
- comparison results can be combined using the logical connectives "and", "or", and "not"
- comparisons can be applied to results of arithmetic expressions
- the from clause
- is Cartesian product
- a join can be expressed by a Cartesian product followed by selections on the join attributes
- aggregate functions
- group by
- in the select clause outside of aggregate functions we must have
- attributes that appear in the "group by" list and/or
- aggregate functions on attributes of each group
- in the select clause outside of aggregate functions we must have
- having
- predicates in the having clause are applied after the formation of groups whereas predicates in the where clause are applied before forming groups
- group by
Query evaluation and optimization
- purpose
- many equivalent expressions to the original query can be derived
- the query optimizer uses statistical data and appropriate algorithms to compute an expression of low evaluation cost
- example
- \(\Pi_{customer-name}(\sigma_{branch-city = Brooklyn}(branch \infty account \infty depositor))\)
Storage of Relations
Physical storage media
- cache
- fastest and most costly form of storage
- volatile
- you lose power, you lose everything
- managed by the computer system hardware
- main memory
- fast access
- generally too small (or too expensive) to store the entire database
- volatile
- magnetic disk
- data is stored on spinning disk, and read/written magnetically
- primary medium for the long-term storage of data; typically stores entire database
- data must be moved from disk to main memory for access, and written back for storage
- much slower access than main memory
- direct-access – possible to read data on disk in any order, unlike magnetic tape
- capacities range up to several TB currently
- survives power failures and system crashes
- disk failure can destroy data, but is very rare
- magnetic hard disk mechanism
- storage and memory hierarchy
Optimization of disk-block access
- block
- data is transferred between disk and main memory in blocks
- sizes range from 512 bytes to several kilobytes
- smaller blocks: more transfers from disk
- larger blocks: more space wasted due to partially filled blocks
- typical block sizes today range from 4 to 16 kilobytes
- algorithm
- disk-arm-scheduling algorithms order pending accesses to tracks so that disk arm movement is minimized
Storage access
- purpose
- in order to minimize the number of block transfers between the disk and memory
- reduce the number of disk accesses by keeping as many blocks as possible in main memory
- buffer
- portion of main memory available to store copies of disk blocks
- buffer manager
- subsystem responsible for allocating buffer space in main memory
- if the block is already in the buffer
- the requesting program is given the address of the block in main memory
- if the block is not in the buffer
- the buffer manager allocates space in the buffer for the block, replacing (throwing out) some other block, if required, to make space for the new block
- buffer-replacement policies
- most OS replace the block least recently used (LRU strategy)
- LRU works well for unpredicted access patterns
- however, queries have well-defined access patterns (such as sequential scans), and a database system can use the information in a user’s query to predict future references
- LRU can be a bad strategy for certain access patterns involving repeated scans of data
- mixed strategy with hints on replacement strategy provided by the query optimizer is preferable
- buffer-replacement policies
- the block that is thrown out is written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk
- once space is allocated in the buffer, the buffer manager reads the block from the disk to the buffer, and passes the address of the block in main memory to requester
- the buffer manager allocates space in the buffer for the block, replacing (throwing out) some other block, if required, to make space for the new block
- example
File organization
- overview
- the database is stored as a collection of files
- each file is a sequence of records
- a record is a sequence of fields
- each record has an address in the file, which is called record pointer or record id (simply rid)
- a simple approach
- assume record size is fixed
- each file has records of one particular type only
- different files are used for different relations
- organization of records in files
- heap
- a record can be placed anywhere in the file where there is space
- sequential
- store records in sequential order, based on the value of the search key of each record
- hashing
- a hash function computed on some attribute of each record
- the result specifies in which block of the file the record should be placed
- other
- records of each relation may be stored in a separate file
- in a "clustered file organization" records of several different relations can be stored in the same file
- motivation: store related records on the same block to minimize I/O
- however, not good for queries accessing only a few relations
- in general, this representation is barely used
- heap
Questions
Q: Suppose I just want to access your age, but in practice, the main memory read all your information from the specific block. Why?
A: The disk may be slower than main memory, in order to reduce the number of block transfers between the disk and memory, the main memory would read all your informtion.