Summary of "what is data warehouse | Lec-1"
Main ideas and lessons
- Purpose of the video: Explains what a data warehouse is, how it looks/works, and how it differs from a database. It also briefly touches on parallel processing, Spark vs. data warehouses, and rules/design principles for data warehouses.
Data warehouse concept (analogy)
- A warehouse is where goods are stored.
- A data warehouse is analogous: it stores data and includes the compute/storage hardware used for processing (e.g., CPU, RAM, SSD/HDD).
Data warehouse vs database (key differences)
-
Scale / volume
- Data warehouse: built for large volumes (terabytes → petabytes → exabytes).
- Database: typically built for smaller volumes (the video frames many databases as more “small-data” compared to DWs, even though some databases can scale).
-
Workload pattern
- Data warehouse: read-heavy (frequent queries like
SELECT). - Database: oriented more toward write operations (insert/update/delete).
- Data warehouse: read-heavy (frequent queries like
-
Latency tolerance
- Data warehouse: not strongly focused on low latency; multi-second queries can be acceptable.
- Database: often requires low latency, with payment/UPI-style examples where delays are unacceptable.
-
Data model
- Data warehouse: tends to be more denormalized (allows redundancy).
- Database: tends to be more normalized (less redundancy).
-
Why denormalization is useful
- Redundancy is often acceptable in a DW because it can help avoid expensive joins.
- Too many joins in normalized models can slow analytics-style queries.
-
Storage format / access pattern
- Data warehouse: emphasizes columnar storage (e.g., Parquet/ORC-style “columnar base” idea).
- Filters only needed columns → faster analytics queries.
- Database: often uses row-based storage to make inserts/updates practical (updating whole rows).
- Note: the speaker acknowledges databases can sometimes be columnar too.
- Data warehouse: emphasizes columnar storage (e.g., Parquet/ORC-style “columnar base” idea).
Parallel processing
- Data warehouses use multiple machines to handle parallel requests.
- The video notes that parallel processing is not the only optimization dimension being discussed.
OLTP vs OLAP framing
- The video references OLTP (Online Transaction Processing) vs OLAP (Online Analytical Processing).
- Rule of thumb:
- OLTP-like needs → use databases/transaction systems.
- OLAP-like needs → use data warehouse solutions.
Examples of data warehouse solutions
- Data warehouses mentioned: Teradata, Snowflake, Amazon Redshift.
- Databases mentioned (as examples/sources or OLTP-style systems): Microsoft SQL (with MySQL implied by context), Oracle, PostgreSQL.
Why not just use Spark instead of a data warehouse?
- The video argues Spark supports:
- parallel processing
- columnar storage
- But it also contrasts Spark with DWs:
- Spark can handle semi-structured and unstructured data, while DWs commonly focus on structured data.
- DWs may lack built-in features like AI/ML APIs and streaming, whereas Spark can support streaming and AI/ML alongside processing.
- Spark is described as commodity-based (runs on cheaper servers), while DW solutions can be more expensive.
- Conclusion: real systems often use both; architects decide where each fits (Spark where advantageous; DW where advantageous).
“Rules” / design principles for a data warehouse (detailed bullet points)
-
Core rule: Data comes from many transactional/source systems
- A data warehouse is built by taking source data from multiple transactional systems (or other locations).
- The source data is copied/ingested into the warehouse for analytics.
- Sources may include:
- Databases
- Files
- Other system types (general “different sources”)
-
Integration
- Integrate data from multiple source systems into one unified environment.
- Purpose: enable analysis “from one place.”
- Example contexts implied:
- Sales performance (from a sales database)
- Employee performance (from an employee database)
- Marketing campaign outcomes (from a marketing database)
-
Subject-oriented
- A data warehouse is designed around business subjects/topics (e.g., losses, sales, marketing).
- It should answer business questions tied to those subjects.
- Data mart is introduced:
- Data mart = highly subject-oriented (focuses on one domain, e.g., marketing only).
- Data warehouse covers a broader set of subjects/questions.
-
Time-variant
- A data warehouse represents historical data.
- The designer chooses a retention window in advance (examples mentioned: 1 year, 6 months, 3 months, 2 months, 5 years, 10 years).
-
Non-volatile
- Once data is loaded and the time period boundaries are established, warehouse data is not meant to change continuously.
- Contrast with transactional sources:
- Sources are volatile (data keeps updating over time).
- Goal:
- Business reports should give consistent results for a given timeframe (avoid the same KPI changing depending on query time).
Speaker(s) / sources featured
- Speaker: Manish Kumar
- No other named speakers were featured.
- Mentioned products/companies (examples, not speakers): Teradata, Snowflake, Amazon Redshift, Spark, Microsoft SQL Server, Oracle, PostgreSQL.
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...