In focus

Sharding Concept In MongoDb

In this article we will learn about some concepts of Sharding in MongoDB.

Gagan Sharma Feb 26, 2016

Storing data records across multiple machines is known as Sharding. A single machine can’t store all the data as the size of the data increases, It's become tough to provide an acceptable read and write throughput. To overcome this issue, database systems have two basic approaches Vertical scaling and Sharding or horizontal scaling.
 
vertical scaling and Sharding
Vertical Scaling

In vertical scaling the database requires some extra storage device to hold the data, like hard disk, memory and so on. This technique is known as vertical scaling but this approach increases the load on a single machine and increases the chance of system failover. It is also known as a scaling up approach.

Horizontal Scaling

This is also known as scaling out. In this approach we add a new node (server) to the system such that the entire load is distributed over all the servers. MongoDB uses a simple approach to do scaling out approach. It starts with a single or multiple nodes. If 10,000 new users connect with the application it adds another server.

Sharding or horizontal scaling divides the data set and distributes the data over multiple servers. These multiple server are known as shards. Each shard is an independent database and collectively, the shards make up a single logical database. Sharding distributes data over multiple shards so it reduces the number of operations for each shard. Each shard processes fewer operations since the size of the cluster is increased and as a result the cluster can increase capacity and throughput.

In the case of selection of a specific record the application doesn’t access the entire database system, it only accesses the desired shard responsible for that record. Sharding reduces the amount of data that each shard must store. For example, if a database contains 1 TB of data and we have 4 shards then each shard will contain approximate 250 GB of data. If we have 100 shards then each shard contains approximately 10 GB data.

Sharding In MongoDB

MongoDB does Sharding using a sharded cluster. A sharded cluster contain 3 components. The following describes these components:
  • Shards - It is use to store data. Each shard contains a replica set. It provide high data consistency and data availability.

  • Config Servers - It is use to store the metadata of clusters. A sharded cluster has exactly 3 config servers. These 3 config servers contain the mapping of the cluster’s data stored in shards. Config servers help the query router to select the desired shards to do the operations.

  • Query Routers - Query routers are the Mongo's instance that interfaces with the client applications and does the operations of the appropriate shards. When a query comes to the query router then the query router first uses the metadata (config server) and selects the single or multiple shards and performs the desired operations to shards and then returns the results to the client. A sharded cluster may contain one or multiple query routers depending on the query load. A client can send a request to only one query router. Generally a sharded cluster contains multiple query routers.
Data Partitioning

MongoDB distributes data or shards at the collection level or the data of a collection may be distributed into several shards. Distribution of a collection’s data is done by a shard key.

Shard Key

A shard key shards a collection and determines the distribution of the collection’s documents among the cluster’s shards. A shard key may be an indexed field or an indexed compound field present in every document of the collection. The shard key is also divided into chunks and these chunks are evenly distributed across the shards. MongoDB uses the following two techniques to divide the shard key into chunks.
  • Range Based Partitions
  • Hash Based Partitions
Range Based Partitions

MongoDB uses range-based partitions for range-based sharding of the collection’s data. In range-based sharding MongoDB divides the shard key into a number of non-overlapping ranges called Chunks. Each chunk contains a range of values from minimum values to maximum values.

Hash Based Sharding

MongoDB uses hash-based partitions for hash-based sharding. In a hash-based partition MongoDB computes a hash of the field’s value and then uses this hash to create the chunks in the shard key. A hash-based partition provides a more random distribution of collections in the clusters.

Balanced Data Distribution

In the process of sharding there is a large possibility that the stored data becomes imbalanced in the cluster. Data imbalance may occur due to any of the following reasons.
  • Addition/Removing of new data.
  • Addition/Removing of cluster.
  • A specific shard contains more chunks than other chunks.
  • The size of a specific chunk is greater than another chunk's size.
MongoDB always tries to balance the data distribution, for this MongoDB uses the following two approaches:
  • Splitting
  • Balancing
Splitting

Splitting is a background process that restricts the chunks from growing too large. When the size of a chunk crosses the value of a specified chunk size, MongoDB splits the chunks into half. In the split process MongoDB does not modify the shards or migrate any data. The splits provide sufficient meta-data change.

Balancing

Balancing is a background process. Due to an uneven distribution of sharded collections, the query router runs a balancer process. The balancer process migrates chunks from the shard containing a large number of chunks to the shard that contains the least numbers of chunks. The balancer process can be initiated from any query router. After successful chunk migration the metadata regarding the location of the chunks on the config server is updated.
 
526
 
In the preceding image we can see that the number of chunks in shard A is just double that of shard B. So this is an imbalance condition. Now the migration of chunks from shard A to shard B will occur.
 
526
 
Now we can see that after migration of chunks from shard A to shard B both shards are in a balanced state.

Advantages of Sharding
  • Sharding distributes the data over multiple shards so it reduces the number of operations for each shard.
  • Removes the dependency from a single server.
  • Protects against system failover.
  • Increases capacity and throughput.
References taken from:
Here's a free e-book on MongoDB: MongoDB Architecture Guide

mongodb sharding

COMMENT USING