Module 6 – Analytics, Science and Life

Module 6: Database Design & Structure for Applications

Writing efficient SQL queries is only half the battle. The way you structure your database – the tables, columns, data types, and relationships you define – has a massive impact on your application’s performance, reliability, and maintainability. This module covers the fundamentals of good database design.

Lesson 6.1: Choosing Data Types

When creating a table, you must specify a data type for each column. Choosing the most appropriate data type is crucial for:

Data Integrity: Ensures only valid data is stored (e.g., you can’t put text in a number column).
Storage Efficiency: Different types take up different amounts of disk space.
Performance: Operations on some data types are faster than others.

Here are common SQL data types (specific names might vary slightly across PostgreSQL, MySQL, SQLite, etc.):

Integer Types: For whole numbers.
- INTEGER or INT: Standard integer. Good for IDs, counts, quantities. (e.g., users.user_id, products.stock_quantity).
- SMALLINT: Uses less storage for smaller numbers (e.g., -32,768 to 32,767).
- BIGINT: For very large whole numbers.
Decimal Types (Fixed-Point): For exact numeric values, especially money.
- DECIMAL(precision, scale) or NUMERIC(precision, scale): Stores numbers exactly as specified. precision is the total number of digits, scale is the number of digits after the decimal point. Essential for financial calculations to avoid rounding errors. (e.g., products.price DECIMAL(10, 2), orders.total_amount DECIMAL(12, 2)).
Floating-Point Types: For approximate numeric values.
- REAL or FLOAT, DOUBLE PRECISION: Store approximations of real numbers. Suitable for scientific calculations but generally avoid for money due to potential small inaccuracies.
Text (String) Types:
- VARCHAR(n): Variable-length text up to a maximum of n characters. Efficient for storing text where the length varies but has a reasonable upper limit. (e.g., users.first_name VARCHAR(100), categories.category_name VARCHAR(255)).
- TEXT: Variable-length text for potentially very long strings (like descriptions, comments, articles). Maximum length is usually very large. (e.g., products.description, orders.shipping_address).
- CHAR(n): Fixed-length text. Always stores n characters, padding with spaces if needed. Less commonly used now than VARCHAR.
Date/Time Types:
- DATE: Stores only the date (year, month, day).
- TIME: Stores only the time of day.
- TIMESTAMP or DATETIME: Stores both date and time. Ideal for recording when events happened. (e.g., users.registration_date, orders.order_date).
- TIMESTAMP WITH TIME ZONE (TIMESTAMPTZ): Stores timestamp information along with time zone data. Crucial for applications dealing with users or events across different time zones.
Boolean Type:
- BOOLEAN: Stores TRUE or FALSE values (some databases use BIT or TINYINT(1)). Perfect for flags indicating state (e.g., is_active, is_published, has_paid).
Other Common Types (Mention):
- UUID: Stores Universally Unique Identifiers (e.g., a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11). Often used as primary keys instead of auto-incrementing integers, especially in distributed systems.
- JSON/JSONB: Stores JSON data directly within a column. Increasingly popular for flexible or unstructured data fields. JSONB (in PostgreSQL) is usually preferred as it stores data in an optimized binary format.
- BLOB (Binary Large Object): Stores raw binary data like images or files (though storing large files directly in the database is often discouraged; usually, you store a file path or URL instead).

Choosing wisely means thinking about the nature of the data: Is it a whole number? Does it need exact precision (money)? Is it short text or potentially long? Is it a simple flag?

Lesson 6.2: Keys Revisited: Primary & Foreign Keys

We’ve used Primary Keys (PKs) and Foreign Keys (FKs) extensively in joins, but let’s solidify their role in database design and integrity.

Primary Key (PK):
- Purpose: Uniquely identifies each row within a table. Think of it as the row’s permanent, unique address.
- Rules: Must contain unique values; cannot contain NULL values.
- Common Implementation: Often an INTEGER or BIGINT column with auto-incrementing behavior (AUTO_INCREMENT in MySQL, IDENTITY in SQL Server, sequence + DEFAULT or SERIAL type in PostgreSQL) or a UUID.
- Examples: users.user_id, products.product_id, orders.order_id.
Foreign Key (FK):
- Purpose: Establishes and enforces a link between two tables. It’s a column (or columns) in one table that refers to the Primary Key in another table.
- Rules: The value(s) in the FK column(s) must match an existing value in the referenced PK column, or the FK value can be NULL (if the column definition allows it).
- Benefit: Enforces Referential Integrity – prevents “orphaned” records. You can’t have an order_item referencing an order_id that doesn’t exist in the orders table.
- Examples: orders.user_id references users.user_id; products.category_id references categories.category_id.

Foreign Key Constraints (ON DELETE, ON UPDATE):

These rules define what happens to the referencing rows (in the table with the FK) when the referenced row (in the table with the PK) is deleted or updated. Our schema uses:

ON DELETE SET NULL (for products.category_id): If a category is deleted, products in that category will have their category_id set to NULL.
ON DELETE CASCADE (for orders.user_id and order_items.order_id): If a user is deleted, all their orders are automatically deleted. If an order is deleted, all its line items are automatically deleted. Use CASCADE with caution!
ON DELETE RESTRICT (for order_items.product_id): Prevents deleting a product if it exists in any order_items row. This protects order history. (NO ACTION is similar, often the default).

Defining clear PKs and FKs is fundamental to creating a reliable relational database structure. They are the backbone of relationships and data consistency.

Lesson 6.3: Basic Normalization (Why We Split Tables)

Normalization is the process of organizing database tables to reduce data redundancy and eliminate undesirable characteristics like Update, Insertion, and Deletion Anomalies (which we discussed briefly in Module 2). While there are formal levels (1NF, 2NF, 3NF, BCNF, etc.), understanding the core goals is most important for practical application design.

Goals of Normalization:

Minimize Data Redundancy: Store each piece of information only once (where practical). E.g., store user’s email in the users table, not repeated in every orders row.
Avoid Modification Anomalies: Ensure changes require updating data in only one place. E.g., If a category name changes, you only update it in the categories table, not in every product row.
Ensure Data Integrity: Make the database structure less prone to inconsistencies.

Simplified View of Normal Forms (using our E-commerce schema):

First Normal Form (1NF):
- Rule: Each cell contains a single, atomic value. No repeating groups within a cell (like storing multiple product IDs in a single orders column). Each row must be unique (guaranteed by a Primary Key).
- Our Schema: Tables like users, products, orders adhere to this. We created the order_items table specifically to handle the many-to-many relationship between orders and products atomically.
Second Normal Form (2NF): (Applies mainly when tables have composite primary keys)
- Rule: Be in 1NF + Every non-key attribute must depend on the entire primary key.
- Example Violation: Imagine if order_items (PK: order_item_id, but conceptually related to order_id+product_id) also stored product_name. product_name depends only on product_id, not the whole conceptual key of the order item link. Storing it violates 2NF and causes redundancy. We correctly keep product_name only in the products table.
Third Normal Form (3NF):
- Rule: Be in 2NF + No non-key attribute should depend on another non-key attribute (no transitive dependencies).
- Example: products table has category_id (a non-key attribute functionally dependent on product_id). If we also stored category_name in the products table, category_name would depend on category_id (another non-key attribute). This is a transitive dependency (product_id -> category_id -> category_name). Changing a category’s name would require updating many product rows. By splitting into products (with category_id FK) and categories (with category_id PK and category_name), we achieve 3NF and avoid this anomaly.

The Trade-off: Normalization leads to more tables, which means queries often require more JOINs. However, it significantly improves data integrity, reduces storage space, and makes updates much more reliable. For most application databases (OLTP – Online Transaction Processing), aiming for 3NF is a common and good practice.

Lesson 6.4: Indexes – Speeding Up Your Application

Imagine trying to find a specific topic in a 1000-page book without an index – you’d have to read page by page! A database index serves a similar purpose for your data tables.

What is an Index? An index is a special lookup table (often structured as a B-tree) that the database engine can use to speed up data retrieval operations. It essentially stores copies of data from one or more columns in a sorted order, along with pointers to the actual table rows.

Why Use Indexes? Without an index, finding rows matching a WHERE clause or performing a JOIN often requires the database to scan the entire table (a “Full Table Scan”). This is very slow for large tables. Indexes allow the database to quickly locate the relevant rows, dramatically improving SELECT query performance.

When are Indexes Most Useful?

Primary Keys: Databases almost always automatically create a unique index on the Primary Key column(s).
Foreign Keys: Crucial! Always consider creating indexes on Foreign Key columns. Joins are extremely common, and indexing the FK columns significantly speeds them up. (e.g., Index orders.user_id, products.category_id, order_items.order_id, order_items.product_id).
Columns Frequently Used in WHERE Clauses: If you often filter by email, status, product_name, etc., an index on that column can provide a significant speed boost. (e.g., Index users.email for login).
Columns Frequently Used in ORDER BY Clauses: An index can sometimes help speed up sorting, as the index might already store data in the desired order.

Creating an Index (Syntax varies slightly):

SQL

-- Index naming convention often includes table and column(s)
CREATE INDEX idx_products_category_id ON products (category_id);

CREATE INDEX idx_users_email ON users (email);

-- Index on multiple columns (order matters!)
CREATE INDEX idx_orders_user_date ON orders (user_id, order_date);

The Trade-offs:

Benefit: Dramatically speeds up data retrieval (SELECT, JOIN, WHERE, sometimes ORDER BY).
Cost 1: Slower Writes: When you INSERT, UPDATE, or DELETE data, the database must also update any indexes that include the affected columns. Too many indexes can slow down write operations.
Cost 2: Disk Space: Indexes take up additional storage space.

General Strategy: Start by indexing Primary Keys (automatic) and Foreign Keys. Then, identify slow queries in your application (tools like EXPLAIN or EXPLAIN ANALYZE, covered briefly later, help diagnose this) and consider adding indexes to columns used in their WHERE or JOIN clauses. Don’t just index everything!

Well-chosen data types, properly defined keys, a normalized structure, and strategic indexing are the cornerstones of a robust and performant application database. Thinking about these design aspects early saves significant headaches later on. Next, we’ll briefly touch on how SQL fits into the broader application context and discuss some common pitfalls.