Summary of "Week 3: Storage Services in AWS and Azure"
Week 3 — Storage Services in AWS and Azure
Main ideas and concepts
- Purpose of the lab: learn to create and manage cloud storage, and perform basic object operations (upload, download, delete) in both AWS and Azure.
- Importance of cloud storage: storage is a foundational cloud capability that can hold structured, semi-structured, and unstructured data (images, CSVs, videos, code, backups, logs). It is essential for databases, analytics, and machine learning workflows.
- Two primary services covered:
- AWS: Amazon S3 (object storage)
- Azure: Azure Blob Storage (within an Azure Storage Account)
- Key difference: in Azure you create a Storage Account to access Blob Storage; in AWS you create S3 buckets directly.
- Organizational model: S3 buckets and Blob containers classify and organize data (similar to folders or databases). Bucket/container names must be unique; you also choose region and access controls.
Analogy: S3/Blob storage behaves like disk/folders on your laptop — there is no fixed schema like SQL tables. For table-like operations you bring data into analytics or database tools.
AWS-specific concepts and linked services
S3 basics
- Create buckets, pick region, set access controls and management options (public/private, IAM, encryption, versioning).
- Store any file types and any volume of data.
- Use buckets to group data (e.g., one bucket for logs, another for ML data).
Linked services and how they interact with S3
- Amazon Athena: run SQL-like queries directly on files in S3.
- AWS Glue: crawls files (CSV, etc.) in S3 and creates table definitions (Data Catalog) from inferred schemas.
- S3 Select: run SQL-like queries directly against individual objects to extract subsets of data without reading the entire object.
- Amazon Redshift: data warehouse that often loads or queries data stored in S3.
- Amazon EMR: Hadoop/Spark big-data processing.
- Amazon QuickSight: visualization on top of data (can use S3/Redshift).
Typical AWS workflow (lab)
- Create an S3 bucket (unique name, choose region, configure access).
- Upload CSVs or other files.
- Use Athena / S3 Select / Glue (and optionally EMR or Redshift) to query and process data.
- Perform object operations (upload, download, delete) and use SQL-like queries to simulate CRUD.
Azure-specific concepts and linked services
Storage account and Blob storage
- Create an Azure Storage Account first; then create Blob containers to hold objects.
- Upload CSVs or other files into Blob containers.
Linked services and how they interact with Blob storage
- Azure SQL Database: import CSV from Blob Storage into a relational table and run SQL queries.
- Azure Synapse Analytics: large-scale analytics with SQL-like queries on data from Blob Storage.
- Azure Databricks: Spark / Spark SQL for big-data processing and analytics.
Typical Azure workflow (lab)
- Create a Storage Account → create a Blob container → upload CSV/dataset.
- Load data into Azure SQL, Synapse, or Databricks depending on processing needs (relational SQL vs. big-data Spark).
- Run SQL or SparkSQL queries and perform CRUD operations; perform object operations (upload/download/delete).
Practical lab instructions / methodology
-
Preparation
- Create or use cloud accounts for AWS and/or Azure.
- Obtain or create sample data (CSV files are preferred for SQL-like work). You can download dummy CSVs or create one locally.
-
AWS lab steps
- Create an S3 bucket:
- Choose a globally unique bucket name.
- Select region and configure access controls (public/private, IAM), encryption, and versioning if needed.
- Upload files (CSV, images, videos, code, etc.).
- Explore querying options:
- Use Amazon Athena to define a schema on CSVs and run SQL-like queries.
- Use AWS Glue crawler to infer schema and create a Data Catalog table.
- Use S3 Select to query individual objects and extract subsets of data.
- Optional: use Redshift, EMR, and QuickSight for warehousing, big-data processing, and visualization.
- Perform object operations: upload, download, delete; use SQL queries to simulate CRUD.
- Create an S3 bucket:
-
Azure lab steps
- Create a Storage Account.
- Create a Blob container and upload CSVs/datasets.
- Depending on goals:
- Use Azure SQL Database to import CSVs and run SQL (select, insert, update, delete).
- Use Azure Synapse Analytics for large-scale ingestion and SQL-like analytics.
- Use Azure Databricks to run SparkSQL for large-scale processing or transformations.
- Perform object operations (upload/download/delete) and run SQL/SparkSQL queries.
-
Tasks to perform and verify
- Upload different file types and sizes to confirm object storage behavior.
- Create tables or catalog entries from CSVs (Glue for AWS; import into Azure SQL or Synapse).
- Run SQL or SparkSQL queries that perform select, insert, update, delete (or equivalents).
- Download objects to verify retrieval; delete objects to verify removal and access controls.
-
Notes, tips, and caveats
- Regions and availability: not all services/features are available in every region — choose regions carefully.
- Naming and access: follow bucket/container naming best practices and manage access with IAM/Azure RBAC.
- For large datasets, prefer big-data tools (EMR, Databricks, Synapse) rather than small SQL engines.
- Monitor costs (storage, queries, data transfer) when using cloud resources.
- Document the services you explore and their behaviors (permissions, performance, region limits).
- Practice end-to-end flows: upload CSV → register schema/catalog → query and modify data → visualize or export results.
Learning objectives and takeaways
- Understand and use object storage on both major clouds (Amazon S3; Azure Blob under a Storage Account).
- Learn how to organize data (buckets/containers) and how to move from raw files to queryable tables.
- Get hands-on experience with AWS Athena, Glue, S3 Select and Azure SQL, Synapse, Databricks for querying and processing stored data.
- Appreciate the role of storage in ML, analytics, and database workflows.
- Remember region and availability constraints; practice CRUD operations and basic data engineering flows.
Speakers / Sources featured
- Unnamed instructor/presenter (video narrator)
- Services mentioned: Amazon S3, Amazon Athena, AWS Glue, S3 Select, Amazon Redshift, Amazon EMR, Amazon QuickSight, Azure Storage Account, Azure Blob Storage, Azure SQL Database, Azure Synapse Analytics, Azure Databricks
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...