I ended up picking up this modern classic after a lot of trepidation. It is voluminous & the author says it is meant for programmers so what was I thinking! But going by reviews and the sample read on Kindle, it looked like this would be a very useful book for even product leaders & digital business leaders like me: and it is! Understanding the principles behind how data-intensive applications are typically built today is really useful especially when working with the technology team in your organization and while you do not come close to becoming an expert on anything, the perspective you will be able to develop will be invaluable.
This is a long-form summary of the book & hopefully, besides serving as a refresher for me, it encourages more non-programmers to buy and read this book. You can buy it here.
Let’s dive in to the first part of the book which talks about the fundamental design principles behind data intensive applications.
Chapter 1: Reliable, scalable & maintainable applications.
Many applications are data-intensive, instead of being compute-intensive and the key challenges such applications solve are the amount of data to be handled, the complexity of the data, and the speed at which it is changing. They need to support:
- Store data and allow it to be read later.
- Data caching
- Search indexes
- Stream processing
- Batch processing
Today such data systems are no longer built by using a single tool. Instead, a multitude of smaller, general purpose components are stitched together to create a new, special-purpose application and a data system. When designing such systems, three concerns are most important in most cases.
Let’s look at each one of them in some detail.
It generally means ,”continuing to work correctly, even when things go wrong”. It is impossible to reduce the probability of a fault (i.e. one component not working as expected); a well designed system is however, fault tolerant that prevents faults from causing failures (i.e. the system as a whole fails). Sometimes it can make sense to deliberately increase the rate of faults in a system so that it is tested more often for fault tolerance; the Netflix ChaosMonkey is an example.
There are mainly 3 kinds of faults.
E.g. include hard disk crashes, faulty RAM, power blackout. A data center with 10k disks should expect 1 disk failure per day since MTTF (mean time to failure) is between 10 to 50 years.
One way to tackle hardware errors is to add redundancy; Dual power supplies, hot-swappable CPUs etc. As applications have started using large number of machines however, rate of hardware faults have increased too. This has led to adoption of software fault-tolerance techniques in addition to hardware redundancy which provides operational advantages like in the case of rolling upgrade.
Hardware faults are often uncorrelated (one machine failure does not cause failure in another one) or weekly correlated (e.g temperature in a server rack). Software errors, however, can be systematic errors within the system; harder to anticipate but can cause many more system failures. E.g. include software bugs that cause every instance of an application server to crash, a runaway process that uses up some shared resource like CPU time or disk space, service slows down, or cascading failures.
There is no quick solution for eliminating such systematic faults. Thorough testing, thinking about assumptions in the system, allowing processes to crash and restart, measuring, monitoring, analyzing system behavior in production; all such steps help.
Operators who run the systems are humans. And humans are known to be unreliable. How do we make systems designed by them to still be reliable. Some approaches are
- Design systems so as to reduce opportunities for error (e.g well designed APIs or admin interfaces).
- Decouple the place where mistakes happen to where they cause failure. E.g provide fully featured, non-production sandbox for experimentation.
- Test thoroughly at all levels (unit testing, system integration).
- Allow quick recovery from human errors. E.g. allowing easy roll back of configuration changes, rolling upgrades.
- Set up detailed monitoring for performance metrics and error rates.
- Implement good management practices and training.
Scalability is about a system’s ability to cope with increased load. The way load is defined depends on the application. Few e.g. include requests per second to a web server, ratio of reads to writes, number of simultaneously active users in a chat room.
Two main operations on Twitter are
- Posting tweets (average volume is 4.6k/sec which jumps to 12k/sec at peak)
- Reading tweets from people they follow on the home timeline (300k/sec).
The challenge in scaling is not writing 12k tweets per sec to the DB but the fan-out: each tweet on average is delivered to 75 followers and each tweet has to be pushed to the home timeline of its users. Two ways to handle the load are
- When a tweet is posted, insert it into a global collection of tweets. User’s home timeline is created when they request it: tweets are looked up from all those that they follow and merged into the timeline. This approach is fast to write but slow to handle the more critical operation of reading timelines, given the volume involved (300k/sec) is much higher than for writing tweets (12k/sec).
- When a tweet is posted, push the tweet into the home timeline cache of all people who follow them. The home timeline has been pre-computed hence and hence reduces the load on reading tweets, although posting a tweet has now become a much heavier operation.
One downside of the second operation is that the distribution of followers per user varies widely; some celebrities have more than 30 million followers. When they tweet, delivering tweets to all their users quickly becomes a challenge. Twitter, hence uses a hybrid approach. A tweet posted by those with a large number of followers is not merged to the home timeline of the followers in real-time but only when the followers come to read. For others, tweets are fanned out to the caches when the tweets are posted. This approach is able to deliver the optimum results.
In the above case, the distribution of followers per user (maybe weighted by how often they tweet) has become a key load parameter for scalability.
When load increases, two things could happen. Either performance goes down or more resources need to be added to keep performance unchanged. Performance is usually described in numbers, e.g. for a batch processing system such as Hadoop, throughput (number of records per second, or time taken to process n records) is important. For online systems, response time usually matters. [(Note: latency is not the same as response time. Response time = service time (time to process the request) + latency (waiting time due to network delays, queueing delays)].
Response time can vary a lot even for the same service requests due to a variety of reasons (loss of network packet & TCP retransmission, a page fault forcing a read from disk etc.). Due to this, it is best to look at it in percentiles; 95th percentile (95p) response time of 1.5 seconds means that 95 out of 100 responses took less than 1.5 seconds. Its gets more and more expensive as u optimize for higher percentiles.
A client may typically define SLOs (service level objectives) as part of the SLAs (service level agreements) which could look something like this: median response time of 200ms and 99p under 1 sec with service uptime at 99.9%.
When doing performance testing, the response times should always be measured on the client-side. Performance optimization often requires identifying and eliminating slow calls.
Coping with load
Scaling up (i.e. Vertical scaling, moving to a more powerful machine) vs scaling out (horizontal scaling, distributing the load across multiple smaller machines) are two usual choices. The ideal approach may be a hybrid for many applications. Systems may be required to be elastic (i.e. being automatically able to add resources on increase in load. Useful when the load is highly unpredictable) or maybe scaled manually, which are simpler to operate.
The route to scaling is usually very application-specific since the key load parameter may be different (e.g. volume of writes, the volume of reads, the volume of data to store, complexity of data, etc.). For e.g, a system that needs to handle 10k requests per sec, each 1 KB in size is very different from a system that needs to handle 3 requests per minute, each 2GB in size.
3 key areas to focus on for maintainability are
- Operability – making it easy for teams to operate.
- Simplicity – making it easy for engineers to understand the system by simplifying it.
- Evolvability – making it easy to make changes in the future as requirements change.
Operations teams take care of a multitude of tasks like monitoring system health, tracking & fixing system failures / degraded performance, keeping software up to date, capacity planning, establishing standards for deployment & configuration, maintaining system security, and preserving system knowledge.
To make these tasks easy, data systems can do many things; ensure good monitoring of system internals, good automation support, avoiding dependency on individual machines, good documentation, providing good default behaviour (but allowing defaults to be overridden when needed) & exhibiting predictable behaviour.
Simplicity often suffers as projects get larger; tight coupling of modules, inconsistent terminology, performance hacks, etc. This makes maintenance hard & often leads to budgets and schedule overruns.
Abstractions are often a good way to remove accidental complexity; it can hide implementation details behind the clean, easy-to-understand facade, can allow code reuse across different applications leading to higher-quality software. High-level programming languages are an example of abstraction as they hide lower-level machine code behind them.
System requirements change with time; new use cases emerge, business priorities change, new features are requested, new platforms replace old ones, legal changes etc.
Agile patterns provide a framework for adapting to change. Test-driven development (TDD) and refactoring are examples of tools/patterns that the agile community has come up with. Agile techniques don’t need to focus on a small, local scale only but also on larger data levels or across multiple applications.
Chapter 2: Data models & Query languages
Applications are built by stacking layers of data models one on top of another. These layers are built by different groups of people and hence each layer hides a clean data model that hides the complexity of all the layers below it.
Types of data models
We will look at following types of data models for storage
- Relational model,
- Document model &
- Graph-based model.
- & briefly look at some other data models.
Document & graph based models are part of non-relational models, often colloquially referred to as NoSQL.
The relational model has data organized into relations (called tables in SQL) and tuples (rows in SQL). They were originally used for biz data processing, e.g transaction processing (sales/banking transactions, stock keeping in warehouses) & batch processing (payroll, reporting). E.g include SQL, IBM DB2, PostgreSQL.
The reasons behind the emergence of NoSQL included the need for greater scalability, preference for free & open source software, specialized query operations and the need to go beyond the restrictions relational schemas impose. E.g include MongoDB, RethinkDB, CouchDB, Espresso.
All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements. However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (assumed on read). Each data model comes with its own query language or framework: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog.
Document model vs. relational model
To understand, when a document model is useful, lets look at the challenges in expressing a linkedin profile in a relational schema. Let’s look at the profile below.
The profile has some fields that appear only once per user (e.g name) but many others which have multiple entries (e.g experience & education). This means that there is a one-to-many relationship from the user to these items and that can be represented in multiple ways.
- The traditional SQL way is to put positions, education & contact information in separate tables, with a foreign key reference to the user’s table, as shown on the right in the image above. Later versions support more ways to store using JSON / XML structured data.
- The alternate is to store the resume, which is a self-contained document using a JSON representation, shown below.
The JSON representation has better locality, i.e all relevant data is in place & one query is sufficient. In the relational example that we discussed before this, you either need to perform multiple queries or perform a messy multi-way join between users table & its subordinate tables.
The one-to-many relationships from the user profile to the user’s positions, education etc imply a tree structure in the data as shown below.
Not all data relationships are one-to-many. There are many-to-many relationships. And also, many-to-one relationships, for e.g those introduced by normalization. [Normalization in databases is about storing repeating data as IDs instead of plain-text strings. E.g geographic regions, industries, institutions, etc. Storing as IDs has many advantages. Changing the value of the ID in one place, updates all occurrences of the data resulting in improved consistency, minimized effort, localization (language) support, better search & reduced write overheads. Normalization hence creates many-to-one relationships (many people work in one particular region)].
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
It’s not possible to say in general which data model leads to simpler application code; it depends on the kinds of relationships that exist between data items. For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models are the most natural.
There is some convergence of relational and document databases happening, however. Relational databases have started supporting XML and JSON (e.g PostgreSQL since v9.3, MySQL since v5.7, IBM DB2 since v10.5), allowing functionalities similar to document databases. Similarly, document databases have started supporting relational database like features (e.g RethinkDB supports relational-like joins in its query language). So these database types are becoming more similar over time. A hybrid of both is a good route for databases to take in the future.
Query languages: Declarative vs. imperative.
Query languages are of two types: imperative & declarative. An imperative language tells the computer what to do & how to do it; it asks for specific data by asking for required operations to be performed in a particular order. In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want — what conditions to meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated) — but not how to achieve that goal. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.
Declarative languages are concise and easy to work with and also make it possible to introduce performance improvements in the database system without requiring any changes to the queries. Finally, they also support parallel execution (i.e multiple cores & machines) since they only specify the result, not the algorithm used to determine the results.
MapReduce is a model, made famous by Google, for processing vast amounts of data spread across large distributed machines.
When many-to-many relationships dominate, graph models become more appropriate. A graph consists of vertices (nodes / entities) and edges (relationships). E.g of data where graph models include car navigation systems for searching shortest paths between 2 points on a road network. Another example of graph model being used on heterogenous data is facebook using a single graph to maintaining different types of vertices (people, locations, events, checkins, comments) & edges (which people are friends with each other, which checkin happened in which location, who attended which event).
Some of the ways of
- structuring data in graphs are the property graph model (implemented by Neo4j, Titan, and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph, and others).
- querying data in graphs: 3 of the declarative query languages for graphs are Cypher, SPARQL, and Datalog.
Each vertex consists of
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
- The vertex at which the edge ends (the head vertex)
- A label to describe the kind of relationship between the two vertices A collection of properties (key-value pairs)
For better visualisation of property graphs and how to query it, lets use the following example: it shows two people, Lucy from Idaho and Alain from Beaune, France. They are married and living in London.
Graphs offer very high flexibility for data modeling: any vertex can be connected to any other vertex, using edges (e.g people, cities, countries, continents) which can be labeled with different kind of relationships (e.g within, lives in, born in) allowing may kinds of info to be stored in a single graph.
Cypher query language
Cypher is a declarative query language for property graphs, created for the Neo4j graph database. For the above example of Lucy and Alain, below is an example of a Cypher graph query to find out people born in US but who have migrated to Europe.
Given this is a declarative query, execution details of the query are not required. The query optimizer automatically figures out the best strategy to extract the results.
Triple-stores graph model
In a triple-store, all information is stored in the form of very simple three-part statements: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object. The subject of a triple is equivalent to a vertex in a graph.
The SPARQL query language
SPARQL is a query language for triple-stores. The same query as before — finding people who have moved from the US to Europe — is even more concise in SPARQL than it is in Cypher.
Other data models
There are still many data models left unmentioned. To give just a few brief examples:
- Researchers working with genome data often need to perform sequence-similarity searches, which means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are similar, but not identical. Researchers have written specialized genome database software like GenBank.
- Particle physicists have been doing Big-Data style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hundreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control.
- Full-text search is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject
Chapter 3: Storage & retrieval
A database needs to do two fundamental things: store the data and retrieve the data when queried. In this chapter, we look to study the following areas.
- Two families of storage engines: log-structured & page-oriented engines like B-trees.
- Various kinds of indexes (a data structure used to query the main database).
- In-memory databases (vs on-disk databases).
- The difference between storage engines optimized for transactional workloads and those meant for analytics.
Log-structured engines, append data at the end of the file, allow deleting obsolete files but do not allow updating data once written. On the other hand, update-in-place engines, treat the disk as a set of fixed-sized pages; B-trees are the biggest example. The former, a relatively recent development, deliver good performance when writing data because appending data is generally very efficient. However, reading data is not, because if the number of records double, a lookup takes twice as long.
To query databases efficiently, we hence need another data structure: an index. They involve keeping additional metadata, derived from the primary data, on the side which acts as a signpost to locate the data that is needed to be extracted. However, indexes usually slow down writes since the indexes also need to be updated every time data is written. This is an important trade-off with indexes (they accelerate query responses but slow down writes) and requires making careful choices about them.
Key-value stores are quite similar to dictionary type which are usually implemented as hash map (hash table). If our storage data consists of only appending to a file, one way to implement the index is to keep an in-memory hash map where every key is mapped to a byte offset in the data file — the location at which the value can be found, as illustrated below. Whenever you write new a new record, you also update the hash map. When you want to read a value, use the hash map to find that location, and read the value.
This can offer high-performance reads and writes, if all the keys fit in the available RAM, since the hash map is kept completely in memory. This is how Bitcask (default storage engine in Riak) does.
If we keep appending to a file, we will eventually run out of disk space. To avoid that, we break the log into segments of a certain size by closing a segment file when it reaches a certain size and making subsequent writes to a new segment file. This is then followed by compaction (throwing away duplicate keys in the log, and keeping only the most recent update for each key) and merging several segments together, as shown in the figure below.
The merging and compaction of segments can be done in a background thread, and while it is going on, we can still continue to serve read requests using the old segment files, and write requests to the latest segment file. After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments — and then the old segment files can simply be deleted.
Some areas to handle in the above approach include choosing the right file format, deletion of records, crash recovery, partially written records & concurrency control.
Why not overwrite the file in place instead of appending new data? Because an append-only design offers a few critical benefits: much faster writes since it is sequential, simpler concurrency & crash recovery (if a crash happens during overwriting, you could have a file containing old & new data mixed up) and finally, avoiding data fragmentation over time by merging old segments. On the other hand, limitations of hash table index are that it must fit in memory (instead of disk) and range queries are not efficient.
SSTables and LSM-Trees
Unlike log segments with hash tables, SSTable (Sorted String Table) keep key-value pairs sorted by key. These have some advantages over the former. These include making the merging of segments simple & efficient, allowing working with a sparse index (typically, only one key for every few kilobytes of segment file is included in the index) & finally, compression allowing less usage of disk space and reduced I/O bandwidth use.
To get the keys sorted in the first place, well known tree data structures, like red-black trees or AVL trees are used; you insert keys in any order and read them back in sorted order.
This storage engine works by writing out writes to a memtable (in-memory). The memtable is pushed out to the disk as a segment if its size crosses a threshold, while a new memtable instance is created to allow for new writes. Reads look for the key first in the memtable and then in the on-disk segments (recent ones are searched first). On-disk segments are merged and compacted from time to time.
Storage engines described above, based on the principle of merging & compacting sorted files are called LSM storage engines (Log-structured merge tree). Examples include LevelDB & RocksDB. Similar engines are used in Cassandra and HBase. Lucene, the indexing engine for full text search used by Elasticsearch & Solr, too uses this method, though there is much additional complexity involved.
LSM tree algorithms use multiple performance optimizations. One example is the use of Bloom filters to get a faster response when the key being searched for is not there. Bloom filter is a memory efficient data structure for approximating the contents of a set and can tell you quickly if the key is not there.
LSM trees hence can handle datasets even if they are bigger than available memory, can efficiently perform range queries and because disk writes are sequential, they support very high write throughput.
B-trees are the most widely used indexing structures; are standard for most relational DBs and often found in non-relational DBs too.
While B-trees too keep key-value pairs sorted by key like SSTables, they do not break the data into variable-size segments. Instead they use fixed-sized blocks or pages, and read/write one page at a time. These pages reference one another by using an address and typically create a tree of pages consisting of root and child pages. The page containing the individual keys is called the leaf page.
B-trees update data on a page in place instead of appending data as LSM-trees do. To improve reliability, B-trees use WAL, (write-ahead log) which is an append-only file, where every update is written to first, before applying to the page itself. Concurrency control (when multiple threads access the data simultaneously) is improved by protecting the tree’s data structures with latches (lightweight locks).
Comparing B-trees and LSM-trees
B-trees are usually faster for reads (since LSM trees have to check across memtables and multiple segments at different stages of compaction) while LSM-trees usually score higher for writes.
LSM trees have 2 advantages: higher write throughput and higher compression. The higher write throughput is because a) they have lower write amplification (one write to the DB resulting in multiple writes to the disk), something of particular concern on SSD disks which have limited life in terms of overwriting blocks b) they sequentially write compact SSTable files rather than overwriting several pages in the tree.
LSM trees also suffer from 2 disadvantages. Compaction process can sometimes cause slower reads and writes. Also, B-trees store one key in only place, and hence higher support for transactional semantics; transactional isolation is implemented using locks on range of keys and in B-tree index, these locks can be directly attached to the tree.
So far we have discussed key-value indexes which are like primary key indexes, where the key uniquely identifies a row/document/vertex. Secondary indexes are ones, where the indexed values are not necessarily unique. They can easily be constructed from a key-value index and both B-tree and log-structured indexes can be used as secondary indexes.
Secondary indexes might have multiple rows (documents, vertices) under the same index entry; the value can be either an actual row or a reference to a row. In the latter case, sometimes, the extra hop from the index to the heapfile (place, where the rows referred, are stored) can result in a performance penalty, so it can be desirable to store the indexed row directly within an index, which is known as clustered index. A compromise between storing all row data within the index (clustered index) and non-clustered index is known as covering index, which stores some of a table’s columns within the index. Clustered and covering indexes speed up reads (since they duplicate data) but can add to overhead on writes.
If you need to query multiple columns in a table or multiple fields of a document simultaneously, multi-column indexes are the way to go. A standard B-tree or LSM-tree is not able to support this efficiently. Lets look at 2 examples when they would be useful
- Geospatial query: The user is searching on a restaurant-search website within a geographical block. The query needs to constrain both latitude & longitude simultaneously when querying the data. Often, specialized spatial indexes such as R-trees are used.
- A query on weather observations data may need to look at all observations matching a certain date & temperature range.
Full text-search & fuzzy indexes
When instead of exact data, the need is to search for similar keys (e.g misspelled words), fuzzy querying techniques are required. For e.g, full-text engines need to allow a search for one word to be expanded to include typos, grammatical variations, similar words. Lucene, is able to search text for words within a certain edit distance (an edit distance of 1 means that 1 letter has been added, removed or replaced).
In general, memory offers higher performance because disks require laying out data to be laid out carefully to get good performance on reads and writes. However, disks offer higher durability (their contents are not lost if power is turned off) and they have a lower cost per gigabyte than RAM.
With falling costs for RAM, one of the advantages for disks is getting eroded. In-memory databases, hence are getting used, especially if the data set is not too big. Memcached, is a in-memory key-value store, intended for caching only, where it’s acceptable for data to be lost on restarting the machine. But other in-memory databases aim for durability by various means (e.g battery powered RAM, writing backups to the disk or to in-memory DB on other machines).
VoltDB, MemSQL and Oracle TimesTen are in-memory databases with a relational model that offer big performance improvements (as per the vendors). RAMCloud is an open-source, in memory key-value store with durability. Redis and Couchbase provide weak durability as they write to the disk asynchronously.
Besides performance advantage, in-memory databases could find preference over disks in other areas. Redis offers a data model for priorty queues and sets, difficult to implement with disk based indexes. Another example is using the anti-caching approach (evicting least recently used data to disk when memory is full & bringing it back again later when needed) can be used to support datasets larger than available memory.
Transaction vs. Analytics.
Comparing characteristics of transaction processing (called OLTP viz. online transaction processing) vs analytics patterns (OLAP viz. online analytic processing) is best summed up by the following table. Both require very different storage and querying systems.
OLAP systems are now typically run on a separate database called data warehouse.
An enterprise may have dozens of different transaction processing systems: customer-facing website, point of sale systems in physical stores, tracking inventory, administering employees, etc. These OLTP systems usually see a high volume of queries, need to be highly available, and process transactions with low latency (often by using an index) since they are often critical to the operation of the business. Disk seek time is often the bottleneck here. Database administrators are usually reluctant to let business analysts run ad hoc analytic queries on an OLTP since those queries (albeit lower in volume) are often expensive, scanning large parts of the dataset, which can harm the performance of concurrently executing transactions. Disk bandwidth (not seek time) is often the bottleneck.
A data warehouse, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the warehouse is known as Extract–Transform–Load (ETL).
The advantage of having a separate data warehouse is that it can be optimized for analytics queries specifically. OLAP is usually relational and uses SQL to run queries. Many graphical tools generate queries, visualize the results to help analysts make decisions. Most vendors now specialize only in either one of OLTP or OLAP. Commercial licensed systems include Terradata, Vertical, SAP HANA & ParAccel (Amazon RedShift is a hosted version of ParAccel). Emerging open source SQL-on-Hadoop projects include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo & Apache Drill.
Schemas for Analytics: Stars & Snowflakes
Star schema is fairly common data model for analytics. The schema usually consists of fact-tables (real life events like a customer’s purchase of a product or, for a website, a page view or click) surrounded by dimensional tables (they represent the who, where, what, when, how and why of the event). The connections are like the rays of a star.
A variation of this template is known as the snowflake schema, where dimensions are further broken down into subdimensions. They have higher normalization but harder to work with than start schemas.
Fact tables often have trillions of rows and petabytes of data. Querying them efficiently is a challenging problem. This is often solved by using column-oriented storage because most queries only access few of the columns in such huge tables: by storing all the values from each column together (instead of storing all values from one row together like transactional systems do) in a separate file, a query only needs to read and parse those columns that are used in that query, which saves a lot of work. Column storage is used even in non relational data models: Parquet supports document data model, based on Google’s Dremel.
The column-oriented storage layout relies on each column file containing the rows in the same order.
Column compression allows to further reduce the demand on the disk (beyond loading only required columns) by compressing data in a column since it is often highly repetitive and the count of distinctive values is much smaller than the number of rows. Bitmap encoding is one technique used for compression.
Memory bandwidth & vectorized processing.
Column oriented storage not only reduce the bandwidth for getting data from disk into memory by reducing the amount of data to be loaded: they are also good at making efficient use of CPU cycles by leveraging the L1 cache through a technique called vectorized processing.
Sort order in column storage
Sorting is leveraged in 3 ways to make column storage more efficient.
- Use knowledge of common queries. For e.g if queries often target date ranges, making date_key the first sort key is likely to help the most. The query can only scan rows from the last month, making it much faster. A second column can determine the sort order of any wors that have the same value in the first column, e.g product_sk allowing for all sales of the same product to be grouped together.
- Another advantage of sorted order is compression benefits. If the primary sort column does not have many distinct values, it could allow the column to be compressed to a few kilobytes.
- Store the same data sorted in several different ways. One, different queries benefit from different sort orders. Second, data needs to be replicated to different machines anyways (backup/performance), so might as well sort in different ways and use the appropriate sorted version for each query.
Writing to column-oriented storage
The optimizations mentioned above make read-only queries faster which consists of the bulk of the load in data warehouses. However, they also make writes more difficult. A good solution is LSM trees discussed earlier: all writes go to an in-memory store, added to a sorted structure, prepared for addition to the disk & then added to files on disk when enough writes have accumulated.
Aggregation: Data Cubes & Materialized Views.
Materialized aggregates are used for queries that involve an aggregate function, like SUM, COUNT, AVG in SQL. If the same aggregates are used by many different queries, crunching raw data is wasteful every time and a pre-computed cache helps; this is referred to as a materialized view.
A common special case of a materialized view is data cube or OLAP cube: a grid of aggregates grouped by different dimensions.
Such aggregates are typically used only as a performance boost for certain queries.
Chapter 4: Encoding & evolution
Applications change with time (new required features, changed biz circumstances, etc) which in turn require changes to how data is stored. Handing this change in data format or schema requires a change in the application code (e.g if a field is added to a record, the code should be able to start reading and writing that field).
In such situations, evolvability of code is important, for allowing rolling upgrades i.e upgrading different parts of the system independently and not having to change everything at once (e.g in server-side applications with distributed nodes, or with client-side applications, where all clients will unlikely all upgrade at the same time)
Evolvability in turn requires compatibility in both directions. This means that different versions of data can coexist with different versions of code.
- Backward compatibility: New code being able to read data written by old code.
- Forward compatibility: Old code being able to read data written by new code.
Formats for encoding data
Programs work with data in two different representations
- In-memory. Data is structures are optimized for efficient use by the CPU.
- When writing data to a file or sending over the network, data has to be encoded in byte sequences.
Translation is needed between these two. Encoding (also called serialization or marshalling) means translation from in-memory to byte sequence. Reverse is called decoding (parsing, deserialization, unmarshalling).
There are many different libraries and encoding formats available. The details of these encodings affect their efficiency, the architecture of the applications and the options for evolving them.
- Language-specific formats: E.g Java has java.io.Serializable, Python has pickle. But these have deep problems (e.g they work best with that particular language only) and should be avoided.
- Standard encodings: E.g JSON, XML, CSV. These are widely used (especially the first two), somewhat human-readable although each one of these also has some specific issues. However, their high acceptance across different organizations make them very useful when transferring data.
- Binary encoding: These are compact or faster to parse and can be useful when dealing with big datasets within your organization. Schema-less binary encodings include JSON (MessagePack, BSON, BJSON, UBJSON, BISON, and Smile and for XML (WBXML and Fast Infoset). Encodings with schema include Apache Thrift (originally developed by Facebook), Protocol Buffers (originally developed by Google) & Avro. While JSON & XML are widely popular, encodings based on schemas are also viable options because they have some useful properties.
Modes of dataflow
As discussed earlier, encoding & decoding is required when data flows from memory to file/network or vice versa. We will look at 3 common modes of data flows.
A. Dataflow through databases
Writing to a database means you are encoding the data, while reading from a database means you are decoding the data. Backward & forward compatibility is important to support for many reasons.
- The database may still have 5-year-old data even though the new version of the application may have replaced the older version in few minutes (often summed up as data outlives code).
- Different applications maybe are accessing the same code.
- Even the same application may have different versions of code running on different nodes.
- When pushing a snapshot of your database into your data warehouse, you may use the latest schema, even if the original encoding in the source database contains mixtures of various schema versions.
B. Dataflow through services: REST & RPC
Processes need to communicate over a network & the most common way is for a client to connect to a server and make requests to its API (know as service). Web services are used in many contexts.
- Request from one service to another within the same organization (Services oriented or microservices architecture).
- Request from one service to another owned by a different company (e. g OAuth service, credit card transaction).
REST & SOAP
Two popular approaches to web services are REST & SOAP: both are completely different in philosophy.
The REST design philosophy is built on principles of HTTP, emphasizing simple data formats, using URLs for identifying resources, using HTTP features for cache, authentication, and content type negotiation. An API designed for it is called RESTful. SOAP is an XML-based protocol for making network API requests with complex and multiple standards requiring active vendor support. So while SOAP is still used by large organizations, REST has gained popularity especially for cross-organization service integration, and is associated with microservices.
Another set of technologies for making API requests over a network are based on the idea of a remote procedure call, i.e RPC. The RPC model tries to make the network request as if it is calling a local function/method in the programming language with the same process. This process is fundamentally flawed because a network request is very different from a local function call since the former has many more complications to deal with: request/response may be lost due to a network problem, a timeout may happen, network latency may be high, client & service may be implemented in different programming languages.
In spite of the flaw, RPC continues to be in use and various RPC frameworks have been build on top of encodings discussed earlier. These have some useful properties too while REST has other significant advantages. REST is typically used for public APIs while RPC frameworks are used between services owned by the same organization, typically within the same data center.
C. Message-passing Dataflow
In Asynchronous message-passing systems, a client’s request (a message) is delivered to another process (consumers or subscribers) through a message broker (also called a message queue) which stores the message temporarily. Messages are encoded by the sender and decoded by the recipient. Using a message broker has several advantages.
- act as a buffer if recipient is unavailable / overloaded
- automatically redeliver messages if process has crashed thus prevent messages getting lost.
- sender does not need to know the IP & port of the recipient (useful in a cloud deployment)
- one message can be sent to several recipients
- decouples sender from the recipient (sender publishes messages without caring who consumes them).
Open-source examples of message brokers include RabbitMQ, Apache Kafka, ActiveMQ, HornetQ, NATS which have become popular.
Chapter 5 onwards
The above was the summary for only the first part of the book, where Kleppmann focuses on data stored on only one machine. In part 2 of the book, he focuses on data that is distributed across multiple systems. This helps scalability but brings with it other associated issues. Finally, in part 3, the author focuses on derived datasets typically found in heterogeneous systems where applications need to integrate data from multiple datasets, caches, indexes, etc.
Hoping to write the summary for the next two parts soon! And if it gets delayed, just go directly buy the book 🙂