Understanding Data Structures and algorithms used in order to select the right framework for a scalable system— System Design(part 6)

5 min readFeb 25, 2022

Really? Data structures in System design?

Well, Yeah! We do need to understand certain DS before we start our further course because a lot of frameworks works on these data structure

If you wish to see video explanation rather than textual then please checkout the below link

Data structures acts as an important parameter in order to decide on which framework to use in system design interviews. For example, say if you are been told to pick up a database framework for faster search or say pickup database which can do faster search in text documents. Would you be able to pick it up without knowing which data structure does a framework use? Well, you would end up selecting wrong DB in order to do your task and eventually the whole system architecture and the company will suffer when millions of data records comes into picture. Every framework uses different data structures and it’s important to learn these different data structures. For ex: Mysql uses B+, Postgres uses B tree, Elastic search uses inverted indices etc.,

Let’s learn about the data structures in depth.

Please find the below table which gives information about data structure used in different frameworks

B+ trees and B tree:

They are said to be one of the fastest search DataStructure. Mysql framework, for instance, creates a b+ tree whenever we do an index on a column and Postgres does a B tree for indices(However Postgres also has options of creating other types of indices on a column). B+ tree expands horizontally rather than vertically as compared to B tree and so it could be slightly faster than b tree but then storage space could be more. If you wish to understand this well then it would be better if you go through a quick video to understand how this DS works (Refer: link for B+tree , link for B tree)

Inverted Indices:

It’s a simple and efficient algorithm in terms of searching in a document. Basically, words are been taken from multiple documents and every word has the mapping of which document it could be found and so this algorithm is one of the most preferred frameworks for searches in a document. Learn more about it here:link

There are also a few more concepts or say DS algorithm which should be well known before you build a scalable system design

1. CAP theorem (Consistency, Availability and partitioning) :

Consistency: This means that data when being read or written is consistent. At any point of time when the user tries to write the data to DB, he can and when he is trying to read the data, he always gets the latest data.

Availability: This means that data is readily available in the system even if a server goes down.

Partitioning: This implies on the fact that even if one of the nodes or slave server goes down, data should not get lost.

It is said that no framework by default offers all 3 capabilities until compromised with efficacy.

For example:
Let’s consider we have used the MySql framework and did a horizontal scaling with the help of the master-slave strategy. Now, you’ll end up seeing that you compromise on consistency because data are written might not be available at all servers at the same instant of time and if a read request is made on the same instant of time then the user might get old data. It doesn’t follow C in CAP by default.

Let’s take an example of Cassandra: Here we have a problem with partitioning. In Cassandra data are stored by default using partition strategy so if one node goes down then the data loss might happen so it doesn’t follow P in CAP by default. This can be further achieved by adding the RF factor in Cassandra so that multiple replicas are available for every node.

2. Quorum:

Quorum theorem is used to improve the performance of consistency. The strategy is to say suppose we have multiple master-slave servers and a written request is been made. Data gets written on a few sets of servers while being processed further to be written in all servers and if a read request is made in the middle then sometimes the user might not get accurate data so to solve this quorum strategy is to check if the same data is available on at least x servers and if yes then that data is set to be consistent.

3. Bloom filter:

Bloom filter is a probabilistic data structure. If data is fore-sure not there then it returns true but it doesn’t assure if data exists in our storage. Bloom filter application is found in a lot of places and it is considered to be one of the best solutions when we want to insert data in millions of records. Say for example, if we have millions of records and a primary key, now when we want to insert a new key, it must be unique and in that case, bloom filter indexes help to sort it out amazingly by telling if an index is not there. On the other hand, Postgres also offers Bloom indexes which could be simply be created on a column. Here’s the link if you want to learn in-depth about it — link

4. Trie DS:
Trie DS is one of the best prefix search algorithms today. A lot of business use cases desire to provide faster prefix search. Basically, trie DS creates a tree in which lookups can be made faster. Say, for example, we have a set of words for which prefix search is to be made faster then it creates a tree-like the below

Dog, Doctor, Dorm, Dell, Doom then a below tree gets created

If you wish to learn in-depth then one amazing video for the same is here(link)

5. Gossip protocol:

Gossip protocol is basically exchange of meta data between nodes or multiple DS in order to know the status of other nodes. This protocol is used by various frameworks to stay updated with recent statuses of other nodes. For instance, Cassandra uses gossip protocol to know the status of nodes in a cluster and decides on where request would go.

6. Consistent Hashing:

A lot of hashing strategies exist but this one is exceptionally used by various frameworks such as Cassandra to provide fault-tolerant mechanisms and optimize the minimal movement of data when a node goes down. This hash method has a different strategy for resolving collisions. Whenever you want to store the data, all you have to do is to search if hash value is exactly there in memory and store there or else store the data in an address whose value is higher than hash value. Whenever a collision happens, the idea is to move the data to the very next node after the collision rather than recomputing hash values of all data which were there in the crashed node. In this way, data movements are minimal. One of the videos which give a very clear explanation on the same is here→click here