Module 6: Database Design & Structure for Applications
Writing efficient SQL queries is only half the battle. The way you structure your database – the tables, columns, data types, and relationships you define – has a massive impact on your application’s performance, reliability, and maintainability. This module covers the fundamentals of good database design.
Lesson 6.1: Choosing Data Types
When creating a table, you must specify a data type for each column. Choosing the most appropriate data type is crucial for:
- Data Integrity: Ensures only valid data is stored (e.g., you can’t put text in a number column).
- Storage Efficiency: Different types take up different amounts of disk space.
- Performance: Operations on some data types are faster than others.
Here are common SQL data types (specific names might vary slightly across PostgreSQL, MySQL, SQLite, etc.):
- Integer Types: For whole numbers.
INTEGER
orINT
: Standard integer. Good for IDs, counts, quantities. (e.g.,users.user_id
,products.stock_quantity
).SMALLINT
: Uses less storage for smaller numbers (e.g., -32,768 to 32,767).BIGINT
: For very large whole numbers.
- Decimal Types (Fixed-Point): For exact numeric values, especially money.
DECIMAL(precision, scale)
orNUMERIC(precision, scale)
: Stores numbers exactly as specified.precision
is the total number of digits,scale
is the number of digits after the decimal point. Essential for financial calculations to avoid rounding errors. (e.g.,products.price DECIMAL(10, 2)
,orders.total_amount DECIMAL(12, 2)
).
- Floating-Point Types: For approximate numeric values.
REAL
orFLOAT
,DOUBLE PRECISION
: Store approximations of real numbers. Suitable for scientific calculations but generally avoid for money due to potential small inaccuracies.
- Text (String) Types:
VARCHAR(n)
: Variable-length text up to a maximum ofn
characters. Efficient for storing text where the length varies but has a reasonable upper limit. (e.g.,users.first_name VARCHAR(100)
,categories.category_name VARCHAR(255)
).TEXT
: Variable-length text for potentially very long strings (like descriptions, comments, articles). Maximum length is usually very large. (e.g.,products.description
,orders.shipping_address
).CHAR(n)
: Fixed-length text. Always storesn
characters, padding with spaces if needed. Less commonly used now thanVARCHAR
.
- Date/Time Types:
DATE
: Stores only the date (year, month, day).TIME
: Stores only the time of day.TIMESTAMP
orDATETIME
: Stores both date and time. Ideal for recording when events happened. (e.g.,users.registration_date
,orders.order_date
).TIMESTAMP WITH TIME ZONE
(TIMESTAMPTZ
): Stores timestamp information along with time zone data. Crucial for applications dealing with users or events across different time zones.
- Boolean Type:
BOOLEAN
: StoresTRUE
orFALSE
values (some databases useBIT
orTINYINT(1)
). Perfect for flags indicating state (e.g.,is_active
,is_published
,has_paid
).
- Other Common Types (Mention):
UUID
: Stores Universally Unique Identifiers (e.g.,a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11
). Often used as primary keys instead of auto-incrementing integers, especially in distributed systems.JSON
/JSONB
: Stores JSON data directly within a column. Increasingly popular for flexible or unstructured data fields.JSONB
(in PostgreSQL) is usually preferred as it stores data in an optimized binary format.BLOB
(Binary Large Object): Stores raw binary data like images or files (though storing large files directly in the database is often discouraged; usually, you store a file path or URL instead).
Choosing wisely means thinking about the nature of the data: Is it a whole number? Does it need exact precision (money)? Is it short text or potentially long? Is it a simple flag?
Lesson 6.2: Keys Revisited: Primary & Foreign Keys
We’ve used Primary Keys (PKs) and Foreign Keys (FKs) extensively in joins, but let’s solidify their role in database design and integrity.
- Primary Key (PK):
- Purpose: Uniquely identifies each row within a table. Think of it as the row’s permanent, unique address.
- Rules: Must contain unique values; cannot contain NULL values.
- Common Implementation: Often an
INTEGER
orBIGINT
column with auto-incrementing behavior (AUTO_INCREMENT
in MySQL,IDENTITY
in SQL Server, sequence +DEFAULT
orSERIAL
type in PostgreSQL) or aUUID
. - Examples:
users.user_id
,products.product_id
,orders.order_id
.
- Foreign Key (FK):
- Purpose: Establishes and enforces a link between two tables. It’s a column (or columns) in one table that refers to the Primary Key in another table.
- Rules: The value(s) in the FK column(s) must match an existing value in the referenced PK column, or the FK value can be
NULL
(if the column definition allows it). - Benefit: Enforces Referential Integrity – prevents “orphaned” records. You can’t have an
order_item
referencing anorder_id
that doesn’t exist in theorders
table. - Examples:
orders.user_id
referencesusers.user_id
;products.category_id
referencescategories.category_id
.
Foreign Key Constraints (ON DELETE
, ON UPDATE
):
These rules define what happens to the referencing rows (in the table with the FK) when the referenced row (in the table with the PK) is deleted or updated. Our schema uses:
ON DELETE SET NULL
(forproducts.category_id
): If a category is deleted, products in that category will have theircategory_id
set toNULL
.ON DELETE CASCADE
(fororders.user_id
andorder_items.order_id
): If a user is deleted, all their orders are automatically deleted. If an order is deleted, all its line items are automatically deleted. UseCASCADE
with caution!ON DELETE RESTRICT
(fororder_items.product_id
): Prevents deleting a product if it exists in anyorder_items
row. This protects order history. (NO ACTION
is similar, often the default).
Defining clear PKs and FKs is fundamental to creating a reliable relational database structure. They are the backbone of relationships and data consistency.
Lesson 6.3: Basic Normalization (Why We Split Tables)
Normalization is the process of organizing database tables to reduce data redundancy and eliminate undesirable characteristics like Update, Insertion, and Deletion Anomalies (which we discussed briefly in Module 2). While there are formal levels (1NF, 2NF, 3NF, BCNF, etc.), understanding the core goals is most important for practical application design.
Goals of Normalization:
- Minimize Data Redundancy: Store each piece of information only once (where practical). E.g., store user’s email in the
users
table, not repeated in everyorders
row. - Avoid Modification Anomalies: Ensure changes require updating data in only one place. E.g., If a category name changes, you only update it in the
categories
table, not in every product row. - Ensure Data Integrity: Make the database structure less prone to inconsistencies.
Simplified View of Normal Forms (using our E-commerce schema):
- First Normal Form (1NF):
- Rule: Each cell contains a single, atomic value. No repeating groups within a cell (like storing multiple product IDs in a single
orders
column). Each row must be unique (guaranteed by a Primary Key). - Our Schema: Tables like
users
,products
,orders
adhere to this. We created theorder_items
table specifically to handle the many-to-many relationship between orders and products atomically.
- Rule: Each cell contains a single, atomic value. No repeating groups within a cell (like storing multiple product IDs in a single
- Second Normal Form (2NF): (Applies mainly when tables have composite primary keys)
- Rule: Be in 1NF + Every non-key attribute must depend on the entire primary key.
- Example Violation: Imagine if
order_items
(PK:order_item_id
, but conceptually related toorder_id
+product_id
) also storedproduct_name
.product_name
depends only onproduct_id
, not the whole conceptual key of the order item link. Storing it violates 2NF and causes redundancy. We correctly keepproduct_name
only in theproducts
table.
- Third Normal Form (3NF):
- Rule: Be in 2NF + No non-key attribute should depend on another non-key attribute (no transitive dependencies).
- Example:
products
table hascategory_id
(a non-key attribute functionally dependent onproduct_id
). If we also storedcategory_name
in theproducts
table,category_name
would depend oncategory_id
(another non-key attribute). This is a transitive dependency (product_id
->category_id
->category_name
). Changing a category’s name would require updating many product rows. By splitting intoproducts
(withcategory_id
FK) andcategories
(withcategory_id
PK andcategory_name
), we achieve 3NF and avoid this anomaly.
The Trade-off: Normalization leads to more tables, which means queries often require more JOIN
s. However, it significantly improves data integrity, reduces storage space, and makes updates much more reliable. For most application databases (OLTP – Online Transaction Processing), aiming for 3NF is a common and good practice.
Lesson 6.4: Indexes – Speeding Up Your Application
Imagine trying to find a specific topic in a 1000-page book without an index – you’d have to read page by page! A database index serves a similar purpose for your data tables.
What is an Index? An index is a special lookup table (often structured as a B-tree) that the database engine can use to speed up data retrieval operations. It essentially stores copies of data from one or more columns in a sorted order, along with pointers to the actual table rows.
Why Use Indexes? Without an index, finding rows matching a WHERE
clause or performing a JOIN
often requires the database to scan the entire table (a “Full Table Scan”). This is very slow for large tables. Indexes allow the database to quickly locate the relevant rows, dramatically improving SELECT
query performance.
When are Indexes Most Useful?
- Primary Keys: Databases almost always automatically create a unique index on the Primary Key column(s).
- Foreign Keys: Crucial! Always consider creating indexes on Foreign Key columns. Joins are extremely common, and indexing the FK columns significantly speeds them up. (e.g., Index
orders.user_id
,products.category_id
,order_items.order_id
,order_items.product_id
). - Columns Frequently Used in
WHERE
Clauses: If you often filter byemail
,status
,product_name
, etc., an index on that column can provide a significant speed boost. (e.g., Indexusers.email
for login). - Columns Frequently Used in
ORDER BY
Clauses: An index can sometimes help speed up sorting, as the index might already store data in the desired order.
Creating an Index (Syntax varies slightly):
SQL
-- Index naming convention often includes table and column(s)
CREATE INDEX idx_products_category_id ON products (category_id);
CREATE INDEX idx_users_email ON users (email);
-- Index on multiple columns (order matters!)
CREATE INDEX idx_orders_user_date ON orders (user_id, order_date);
The Trade-offs:
- Benefit: Dramatically speeds up data retrieval (
SELECT
,JOIN
,WHERE
, sometimesORDER BY
). - Cost 1: Slower Writes: When you
INSERT
,UPDATE
, orDELETE
data, the database must also update any indexes that include the affected columns. Too many indexes can slow down write operations. - Cost 2: Disk Space: Indexes take up additional storage space.
General Strategy: Start by indexing Primary Keys (automatic) and Foreign Keys. Then, identify slow queries in your application (tools like EXPLAIN
or EXPLAIN ANALYZE
, covered briefly later, help diagnose this) and consider adding indexes to columns used in their WHERE
or JOIN
clauses. Don’t just index everything!
Well-chosen data types, properly defined keys, a normalized structure, and strategic indexing are the cornerstones of a robust and performant application database. Thinking about these design aspects early saves significant headaches later on. Next, we’ll briefly touch on how SQL fits into the broader application context and discuss some common pitfalls.