How big MNC’s like
Google, Facebook, Instagram, etc stores manage and manipulate
thousands of Terabytes of data
with High Speed and High Efficiency!
What is Big Data?
The narrative begins with the term Big data, which refers to a vast volume of organized and unstructured data that is large, quick, or complicated, and has varying degrees of diversity, variability, and truthfulness. The value of big data is determined by what you do with it, not by how much data you have.
The majority of professionals use three ‘V’ concepts to define data. As a result, if your data stores have the following qualities, your company has big data.
- Volume — Your data is so enormous that processing, monitoring, and storage become a burden for your firm. A lot of data is generated as a result of trends like mobility, Internet of Things (IoT), social media, and eCommerce. As a result, practically every company meets this requirement.
- Velocity — Do you have to respond to fresh data in real-time because your company generates it at a high rate? If you answered yes, your company has the speed that comes with large data. This condition is met by the vast majority of enterprises active in technologies such as social media, the Internet of Things, and eCommerce.
- Variety — If your data is kept in a variety of formats, it has the properties of big data. Word processing documents, email communications, presentations, photos, videos, and data stored in organized RDBMSes are all examples of massive data repositories (Relational Database Management Systems).
Big data comes from a variety of places, including streaming data (YouTube, Twitch, Netflix), social media (Twitter, Facebook), publicly available data (data from the US government and other organizations), and data lakes, cloud data sources, suppliers, and customers.
Data Storage Units
Bit — 1 or 0 (on or off) — 8 bits — 1 Byte, 1024 bytes — 1 Kilobyte, 1024 kilobytes — 1 Megabyte, 1024 megabytes — 1 Gigabyte, 1024 gigabytes — 1 Terabyte, 1024 terabytes — 1 Petabyte, 1024 petabytes — 1 Exabyte, 1024 exabytes — 1 Zettabyte, 1024 Zettabytes — Yottabyte
Did you know?
Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.
So you might be asking how Google stores such a massive quantity of data, i.e. Big Data, given that it does not have the largest of all data centers, which typically house petabytes to exabytes of data.
Not only does Google retain massive amounts of data, but so do Facebook, Instagram, and other social media platforms. So we’ll need to know a little about distributed storage clusters to grasp this.
Distributed Storage Cluster
A distributed storage system is a type of architecture that allows data to be shared among numerous physical servers and, in some cases, multiple data centers. It usually takes the form of a storage cluster with data synchronization and coordination mechanism amongst cluster nodes.
It is based on the master-slave topology, in which the master is the system to which all other systems contribute their hard discs in order to solve the large data problem.
So, these massive MNCs strive to manage such massive amounts of data using the Distributed Storage System technique, and Apache Hadoop is one of the products that is widely used to implement this technique.
What is Apache Hadoop?
Apache Hadoop is a Java-based open-source platform for creating data processing applications that run in a distributed computing environment.
Large data sets are dispersed over clusters of commodity machines, and Hadoop applications are executed on them. A computer cluster is made up of several processing units (storage disc + processor) that are linked together and function as a single system.
You can see in the below picture how Hadoop helps in data management
Why Hadoop is used for big data?
Cost effective: Hadoop is a relatively cost-effective database management technology when compared to traditional database management methods.
Fast: Hadoop organises data into clusters, giving it a distinct storage strategy based on distributed file systems. Hadoop’s unique ability to map data across clusters allows for faster data processing.
How Google manages the big data?
- Mesa is a highly scalable analytic data warehousing system used by Google to store vital measurement data for their online advertising business. Mesa is built to meet a diverse set of user and system needs, including as near-real-time data intake and querying, as well as high availability, dependability, fault tolerance, and scalability for massive data and query volumes. Mesa, in particular, is capable of handling petabytes of data, millions of row updates per second, and billions of queries every day fetching trillions of records. Mesa is geo-replicated across numerous datacenters and provides consistent and repeatable query responses with low latency, even when a data center fails entirely.
- Google File System — For big distributed data-intensive applications, Google File System is a scalable distributed file system. It has fault tolerance and gives great aggregate performance to a large number of clients while running on low-cost commodity hardware. It’s widely utilized by Google as a storage platform for data generation and processing for Google services, as well as research and development projects that require massive data volumes.
Google File System is the base of hadoop’s HDFS that is being used actively in a lot of big data tools and databases such as HBase, Cassandra, Spark etc.
- BigTable — Bigtable is a distributed storage system for structured data management that can scale up to petabytes of data over thousands of commodity computers. Bigtable is used by several Google projects, including web crawling, Google Earth, and Google Finance. Bigtable is put to a variety of tests in these applications, including data size (from URLs to web pages to satellite images) and latency requirements (from backend bulk processing to real-time data serving).
How Facebook manages big data?
Facebook is without a doubt one of the largest Big Data specialists, dealing with petabytes of data, both historical and real-time, and will continue to expand in the same direction. While the world comes closer together on this platform, Facebook develops algorithms to track those connections and their presence on or outside of its walls in order to deliver the most relevant posts to its users. Facebook analyses every bit of your data and provides you better services every time you check in, whether it’s your wall post, your favourite books, movies, or your workplace.
The successful execution of this platform is supported by a combined workforce of people and technology. Though the platform is always being improved, the following are the most important technological features:
“Facebook runs the world’s largest Hadoop cluster” says Jay Parikh, Vice President Infrastructure Engineering, Facebook.
Facebook, in essence, maintains the world’s largest Hadoop cluster, with over 4,000 workstations and hundreds of millions of gigabytes of storage. Developers will benefit from the following capabilities provided by this large cluster:
- Map-reduce programmes can be written in any language by the developers.
- Because much of the data in Hadoop’s file system is in table format, SQL has been integrated to process large data sets. As a result, discrete sections of SQL become easily available to developers.
Hadoop provides Facebook with a dependable and efficient shared infrastructure. Hadoop is empowering this social networking platform in every way possible, from searching, log processing, recommendation systems, and data warehousing to video and picture analysis. Facebook Messenger, the company’s first user-facing service, is built on the Hadoop database, Apache HBase, which has a tiered design that can handle a large number of messages in a single day.
With a massive volume of unstructured data arriving every day, Facebook has gradually realized that it needs a platform to accelerate the entire analysis process. That’s when it came up with Scuba, a tool that may allow Hadoop engineers to delve into enormous data sets and do real-time ad-hoc studies.
Facebook was not designed to operate across numerous data centers, and a single outage could bring the entire platform down. Another Big data platform, Scuba, allows developers to store large amounts of data in memory, which speeds up data analysis. Small software agents are used to collecting data from numerous data centers and compress it into log data format. Scuba now compresses this compressed log data and stores it in memory systems that are instantaneously available.
According to Jay Parikh, “Scuba gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting.”
Following the implementation of Hadoop for Yahoo’s search engine, Facebook considered empowering data scientists so that they could store more data in the Oracle data warehouse. Hive was born as a result of this. This tool, which uses a subset of SQL to boost Hadoop’s query capabilities, quickly gained favor in the unstructured world. Thousands of jobs are conducted every day utilizing this system to quickly process a variety of applications.
How Twitter manages big data?
Every day, hundreds of millions of Tweets are sent. They’re studied, processed, saved, cached, and served. Twitter need a corresponding infrastructure to handle such vast content. Twitter’s infrastructure is made up of 45 percent storage and messaging.
The following services are provided by the storage and messaging teams:
- Hadoop clusters running both compute and HDFS
- Manhattan clusters for all our low latency key value stores
- Graph stores sharded MySQL clusters
- Blobstore clusters for all our large objects (videos, pictures, binary files…)
- Cache clusters
- Messaging clusters
- Relational stores (MySQL, PostgreSQL and Vertica)
There are no surprises in our infrastructure but some of the interesting bits are as follows:
- Twitter has numerous Hadoop clusters with a total storage capacity of over 500 PB, organised into four groups (real time, processing, data warehouse and cold storage). Their largest cluster has over ten thousand nodes. Every day, they run 150k applications and launch 130 million containers.
- Twitter uses many clusters for different use cases such as huge multi-tenant, smaller for non-common, read-only, and read/write for strong write/heavy read traffic patterns in Manhattan (the backend for Tweets, Direct Messages, Twitter accounts, and more). A read-only cluster can handle tens of millions of QPS, while a read/write cluster can handle millions. Their observability cluster, which ingests in every data center, has the highest performance.
- Graph: Twitter’s sharded cluster based on Gizzard/MySQL for storing our graphs. Our social graph, Flock, can handle peaks of tens of millions of QPS, whereas our MySQL servers average 30k — 45k QPS.
- Blobstore: It is Twitter’s image, video, and huge file storage service, with hundreds of billions of objects stored.
- Cache: Twitter’s Redis and Memcache clusters cache their users, timelines, and tweets, among other things.
- SQL: MySQL, PostgreSQL, and Vertica are examples of SQL. Where robust consistency is required, such as managing ad campaigns, ad exchanges, and internal tools, MySQL/PostgreSQL is used. Vertica is a column store that is frequently used as the backend for Tableau to enable sales and user groups.
Hadoop/HDFS is also the backend for their Scribe-based log pipeline, but they’re in the final stages of testing the switch to Apache Flume as a replacement to address issues like rate limiting/throttling of selective clients to aggregators, lack of delivery guarantee for categories, and memory corruption. Every day, Twitter processes over a trillion messages, which are subsequently sorted into over 500 categories, consolidated, and then selectively copied across all of their clusters.
So these are some of the big MNCs secret to store and manage huge amount of data i.e Big Data.
Thanks for reading!!:)
Do connect me on LinkedIn: