Brewer’s CAP Theorem in Big Data Explained

By | December 20, 2017
 

Let us learn about Brewer’s CAP theorem in Big Data with an explanation of distributed databases and NoSQL databases.

What Is A Distributed Database?

Before we understand CAP theorem in Big Data, it is important to understand the concept of distributed database systems. It is basically a network partitioning scheme.

A distributed database is a database wherein the storage devices are not attached to a common processor.

There could be multiple physical databases stored in different locations interconnected to each other via a network. These distributed physical stores are known as data nodes.

The centralized distributed database management systems manage these data nodes logically as if they’re stored at one location.

What Is Brewer’s CAP Theorem?

The CAP theorem in Big Data was coined by a computer scientist named Eric Brewer and therefore, it is named after him. In theoretical computer science domain, the CAP theorem is, therefore, popularly known as Brewer’s theorem.

The CAP system model is a single read-write register. The Brewer’s CAP theorem primarily consists of the following properties:

  • Consistency
  • Availability
  • Partition Tolerance

The CAP theorem states that at any given point in time, a distributed database (system) does not guarantee the existence of all the components viz., Consistency, Availability and Partition Tolerance.

Brewer's CAP theorem in big data Explained With Example

In other words, the CAP theorem states that for any given distributed database, it is impossible to offer more than the two properties of the theorem.

Simply putting it, the CAP theorem provides the basic requirements that a distributed system must follow.

The PACELC theorem, an extension of CAP theorem, states that even in the absence of partitioning tolerance, another trade-off between consistency and latency to occur.

The data nodes are distributed across a network and there’s a high possibility of network failures creating issues while accessing the data.

Note: The consistency property of the CAP theorem is very different from the consistent property of the ACID properties in SQL.

Properties of NoSQL CAP Theorem in Big Data

What Does Consistency Mean?

The consistency property refers guarantees that all the data nodes in a distributed database systems return the same and the most recently stored data.

Therefore, the end users will get the same data on their systems regardless of the data node that he/she is trying to access.

The consistency property actually also has a strong bond with database linearizability which is a very important factor in concurrent systems.

   

This property also states the fact that the system should always be in a consistent state after a transaction. However, if it goes into an inconsistent state, the transactions will be rolled back.

There are a lot of different consistency models for CAP theorem in big data enlisted below:

  • Processor consistency
  • Cache consistency
  • Sequential consistency
  • Vector-field consistency
  • Local consistency
  • Fork consistency
  • PRAM consistency
  • Delta consistency
  • Serializability
  • One-copy serializability

What Does Availability Mean?

The availability property states that the database system will always respond to a request irrespective of the consistent data. Even if it’s a failure, there should be a response from the system.

However, the availability property requires that the system must always be operational and connected to the network.

Every non-failing data node should, therefore, respond to all the read and write requests in a reasonable amount of time.

What Does Partition Tolerance Mean?

The availability property states the database system continues to be operational as a whole even if there’s a failure with one of the data nodes.

The partition tolerance means that the users are communicating with the data nodes over an asynchronous communication network.

Categories of CAP Theorem in Big Data

Let us now see the different possibilities and combinations of the systems that can occur.

1. Consistency and Availability (CA systems)

The CA systems are consistent and always available but they are unsafe from the network failures. This is often the scenario with single-node database systems.

Usually, there is no such database system which is safe from network failures. However, in the absence of network failure, both availability and consistency properties can be satisfied.

Example: Relational Database Management Systems

  1. PostgreSQL
  2. MySQL
  3. SQL Server

2. Availability and Partition Tolerance (AP systems)

The AP systems are always available and partition tolerant. However, they are not consistent.

When you consider network partitioning and availability, the database system always processes the query.

Even if it cannot guarantee that it is up to date due to network partitioning, it tries to return the most recent available version of the information.

   

Example:

  1. DynamoDB
  2. Riak
  3. Cassandra
  4. CouchDB

3. Consistency and Partition Tolerance (CP systems)

The CP systems are consistent and partition tolerant but they do not offer availability.

When you consider network partitioning and consistency, the database system returns a timeout error or a relevant error if the information cannot be guaranteed to be up to date due to network partitioning.

Example:

  1. MongoDB
  2. Redis
  3. MemcacheDB
  4. HBase

Many NoSQL databases compromise consistency property in the favour of speed, network partition and availability.

If you have any doubts about the Big Data CAP theorem NoSQL or if you have any additional thoughts about it, let us know about it in the comment section.

Let's Discuss