Introduction to CMU 15-445/645
Carnegie Mellon University's 15-445/645 course, Introduction to Database Systems, is a famous and comprehensive course designed to teach students the fundamentals of database systems. The course covers various topics, from data modeling and relational algebra to SQL, database design, transaction management, and query optimization.
Initially, I started this course to refresh my knowledge before beginning my studies at CMU 15-721. However, as I progressed through the course material, I came to appreciate its value and significance. Consequently, I decided to devote more time to studying it and writing blogs about the topics and lectures that I found most compelling. This approach ensured that I was able to fully absorb and benefit from the course content.
The format of the blogs that I intend to write may vary. Some posts may take the form of simple lecture notes, while others may focus on a particular topic or concept covered in the course and provide a more in-depth analysis. By taking this approach, I can make the most of my time and concentrate on the most essential aspects of the course.
The first lecture in the course is an introductory lecture for the course is split into two sections the first section discusses the course materials and boundaries, also specifies the grading system and the other section (which is important to me) talks about data models, relational models, and Relational Algebra.
Database Management System
A DBMS is software that allows applications to store and analyze information in a database.
A general-purpose DBMS is designed to allow the definition, creation, querying, update, and administration of databases by some data model.
A data model is a collection of concepts for describing the data in the database.
- Examples: relational (most common), NoSQL (key/value, graph), array/matrix/vectors
A schema is a description of a particular collection of data based on a data model.
Before the 1970s constructing and maintaining database applications was a challenging task because of the tight coupling between the logical and physical layers.
The logical layer of a database describes the entities and attributes it contains, while the physical layer pertains to how those entities and attributes are stored. In the past, the physical layer was defined within the application code. Therefore, if we wanted to change the way data was stored, we would need to modify the code for the entire application to align with the new physical layer.
Relational model
In 1970 the relational model was invented to avoid rewriting DBMSs every time they wanted to change the physical layer.
This relational model has three key points:
Store the database in simple data structures (relations or tables).
Access data through high-level language (like SQL).
Physical storage is left up to the implementation.
Three main concepts
To achieve the goal relational data model defines three main concepts:
Structure: The definition of relations and their contents. including the attributes that the relations possess and the values those attributes can hold.
Integrity: Ensure the database’s contents satisfy constraints. An example constraint would be that any value for the year attribute has to be a number.
Manipulation: How to access and modify a database’s contents.
Relations, Tuples, and Keys.
A relation (AKA Table) is an unordered set that contains the relationship of attributes that represent entities.
- NOTE: Since the relationships are unordered, the DBMS can store them in any way it wants, allowing for optimization.
A tuple (AKA record or row) is a set of attribute values (also known as its domain) in the relation.
Originally, values had to be atomic or scalar, but now values can also be lists or nested data structures.
Every attribute can be a special value, NULL, which means for a given tuple the attribute is undefined.
Keys
A relation’s primary key (PK) uniquely identifies a single tuple.
Some DBMSs automatically create an internal primary key if you do not define one.
A lot of DBMSs have support for autogenerated keys so an application does not have to manually increment the keys, but a primary key is still required for some DBMSs.
A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.
Data Manipulation Languages (DMLs)
Methods to store and retrieve information from a database. There are two classes of languages for this:
Procedural: The query specifies the (high-level) strategy the DBMS should use to find the desired result based on sets/bags. (relational algebra)
Non-Procedural (Declarative): The query specifies only what data is wanted and not how to find it. (relational calculus).
Relational Algebra
Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation.
Each operator takes in one or more relations as inputs and outputs a new relation.
To write queries we can “chain” these operators together to create more complex operations.
There are a lot of great articles and websites that explain Relational Algebra you can check them if you want to go further in the details (e.g. Relational Algebra).
There are numerous excellent articles and websites available that provide in-depth explanations for Relational Algebra (e.g. Relational Algebra, Relational Algebra in DBMS: Operations with Examples).
Conclusion
Carnegie Mellon University's 15-445/645 is a comprehensive course on database systems. It covers various topics like data modeling, Indexing, query optimization, transactions, concurrency control, and many more.
In the upcoming blogs, I will share my learning notes and takeaways from the course as I proceed in studying it.