Aug. 13, 2024
Data backup helps to safeguard relevant data from corruption or loss, and organizations keep searching for a better way to back up their data. Some backup methods only provide one backup level, but those that offer more, even up to three backup cycles, triple the data protection. One such effective backup is the 3-2-1 strategy the grandfather-father-son (GFS) backup scheme provides. It conducts backup on three levels: monthly, weekly, and daily to protect against loss and ensure the safe restoration of lost data.
If you are looking for more details, kindly visit Wansheng.
This article explores the GFS backup scheme and how it works, including its advantages and disadvantages.
The Grandfather-father-son (GFS) backup scheme is a data retention strategy designed to protect critical data through a hierarchical data backup method. There are three versions of the data: the grandfather, father, and son, and each version backs up data monthly, weekly, and daily, respectively.
You can restore lost data using this 3-2-1 backup strategy from old backups, not just snapshots, ensuring a safer data environment.
GFS backup happens cyclically, with each backup occurring at the scheduled time repeatedly. Heres a breakdown of how it works:
This is the oldest backup in the GFS scheme. The grandfather backup occurs monthly and is usually stored off-site. For example, if an organization schedules a backup for every last day of the month, the grandfather backup will take place every last day of each month, creating a monthly cycle.
The father backup cycle occurs at the end of every week. At the end of the week, the system will store the last (father) backup in a local storage or hot cloud where they can easily access it.
The son backup represents the most recent backup. It usually occurs daily, providing the present data copy. Unlike the grandfather and father backup, which is a full backup, the son cycle may not be complete. It may be incremental or differential, depending on the business needs.
There are four available backup types. While full backups consume a lot of space, you can opt for the other backup types to reduce storage space and bandwidth. The four methods include:
1. Full Backups: A full backup copies all the data from the source device to the backup destination. It captures all the data and also creates a complete snapshot of the data. Full backups are standalone and can easily be stored on-site or off-site, making them ideal for data recovery. However, they consume the most bandwidth and storage space.
2. Incremental Backups: Incremental backups do not capture the complete data every time. Instead, it only captures the changes made after the last backup. Thus, it reduces storage usage and backup time.
3. Differential Backup: Differential backup is similar to incremental backup, as it also captures the changes made after the last backup. But unlike incremental backup, it always references the previous backup. Hence, it requires more space than incremental backups but not as much as a full backup. However, it is faster to restore data using this method.
4. Synthetic Backup: The synthetic backup is a combination of full and incremental backup. It creates a synthetic full backup by combining the full backup stored in the cloud with the subsequent incremental backups. So it can reduce the backup window while offering the benefit of a full backup process.
Below are some of the benefits of the GFS backup scheme:
GFS backup strategy is no doubt very effective. However, it still has its downsides:
Storware Backup and Recovery supports GFS backup rotation by allowing users to define backup policies and retention rules based on the GFS scheme. Users can configure the frequency of backups (Son), the retention period for intermediate backups (Father), and the retention period for long-term archival backups (Grandfather). This allows organizations to create a backup strategy that aligns with their specific retention and recovery requirements.
Additionally, Storware Backup and Recovery provides features such as scheduling, incremental backups, deduplication, and compression to optimize backup processes and reduce storage requirements. It also offers options for backup replication, offsite storage, and disaster recovery, ensuring data protection and business continuity. Learn more.
The Grandfather-father-son backup scheme is an effective way to backup data to prevent data loss. It uses three backup cycle systems called the grandfather, father, and son backup. They occur monthly, weekly, and daily, respectively. The GFS backup system offers simplicity, reliability, and efficient storage use.
However, it requires massive storage, wide network bandwidth, slow recovery time, and can be expensive to use. Before choosing this backup method, its crucial to weigh the pros and cons to decide if its effective for you.
As technology advances and data continues to explode, traditional disk file systems have revealed their limitations. To address the growing storage demands, distributed file systems have emerged as dynamic and scalable solutions. In this article, we explore the design principles, innovations, and challenges addressed by three representative distributed file systems: Google File System (GFS), Tectonic, and JuiceFS.
By exploring the architectures of these three systems, you will gain valuable insights into designing distributed file systems. This understanding can guide enterprises in choosing suitable file systems. We aim to inspire professionals and researchers in big data, distributed system design, and cloud-native technologies with knowledge to optimize data storage, stay informed about industry trends, and explore practical applications.
The table below shows a variety of widely-used distributed file systems, both open-source and proprietary.
Widely-used distributed file systemsAs shown in the table, a large number of distributed systems emerged around the year . Before this period, shared storage, parallel file systems, and distributed file systems existed, but they often relied on specialized and expensive hardware.
The "POSIX-compatible" column in the table represents the compatibility of the distributed file system with the Portable Operating System Interface (POSIX), a set of standards for operating system implementations, including file system-related standards. A POSIX-compatible file system must meet all the features defined in the standard, rather than just a few.
For example, GFS is not a POSIX-compatible file system. Google made several trade-offs when it designed GFS. It discarded many disk file system features and retained some distributed storage requirements needed for Google's search engine at that time.
In the following sections, well focus on the architecture design of GFS, Tectonic, and JuiceFS. Let's explore the contributions of each system and how they have transformed the way we handle data.
In , Google published the GFS paper. It demonstrated that we can use cost-effective commodity computers to build a powerful, scalable, and reliable distributed storage system, entirely based on software, without relying on proprietary or expensive hardware resources.
GFS significantly reduced the barrier to entry for distributed file systems. Its influence can be seen in varying degrees on many subsequent systems. HDFS, an open-source distributed file system developed by Yahoo, is heavily influenced by the design principles and ideas presented in the GFS paper. It has become one of the most popular storage systems in the big data domain. Although GFS was released in , its design is still relevant and widely used today.
The following figure shows the GFS architecture:
GFS architecture (Source: The Google File System)A GFS cluster consists of:
The communication between the master and chunkserver is through a network, resulting in a distributed file system. The chunkservers can be horizontally scaled as data grows. All components are interconnected in GFS. When a client initiates a request, it first retrieves the file metadata information from the master, communicates with the chunkserver, and finally obtains the data. GFS stores files in fixed-size chunks, usually 64 MB, with multiple replicas to ensure data reliability. Therefore, reading the same file may require communication with different chunkservers. The replica mechanism is a classic design of distributed file systems, and many open-source distributed system implementations today are influenced by GFS.
While GFS was groundbreaking in its own right, it had limitations in terms of scalability. To address these issues, Google developed Colossus as an improved version of GFS. Colossus provides storage for various Google products and serves as the underlying storage platform for Google Cloud services, making it publicly available. With enhanced scalability and availability, Colossus is designed to handle modern applications' rapidly growing data demands.
Tectonic is the largest distributed file system used at Meta (formerly Facebook). This project, originally called Warm Storage, began in , but its complete architecture was not publicly released until .
For more gfs technologyinformation, please contact us. We will provide professional answers.
Prior to developing Tectonic, Meta primarily used HDFS, Haystack, and f4 for data storage:
Tectonic was designed to support these three storage scenarios in a single cluster.
The figure below shows the Tectonic architecture:
Tectonic architecture (Source: Facebooks Tectonic Filesystem: Efficiency from Exascale)Tectonic consists of three components:
Tectonic abstracts the metadata of the distributed file system into a simple key-value (KV) model. This allows for excellent horizontal scaling and load balancing, and effectively prevents hotspots in data access. Tectonic introduces a hierarchical approach to metadata, setting it apart from traditional distributed file systems. The Metadata Store is divided into three layers, which correspond to the data structures in the underlying KV storage:
The figure below summarizes the key-value mapping of the three layers:
Layer mapping in Tectonic (Source: Facebooks Tectonic Filesystem: Efficiency from Exascale)This layered design addresses the scalability and performance demands of Tectonic, especially in Metas scenarios, where handling exabyte-scale data is required.
The three metadata layers are stateless and can be horizontally scaled based on workloads. They communicate with the Key-Value Store, a stateful storage in the Metadata Store, through the network.
The Key-Value Store is not solely developed by the Tectonic team; instead, they use ZippyDB, a distributed KV storage system within Meta. ZippyDB is built on RocksDB and the Paxos consensus algorithm. Tectonic relies on ZippyDB's KV storage and its transactions to ensure the consistency and atomicity of the file system's metadata.
Transactional functionality plays a vital role in implementing a large-scale distributed file system. Its essential to horizontally scale the Metadata Store to meet the demands of such a system. However, horizontal scaling introduces the challenge of data sharding. Maintaining strong consistency is a critical requirement in file system design, especially when performing operations like renaming directories with multiple subdirectories. Ensuring efficiency and consistency throughout the renaming process is a significant and widely recognized challenge in distributed file system design.
To address this challenge, Tectonic uses ZippyDBs transactional features. When handling metadata operations within a single shard, Tectonic guarantees both transactional behavior and strong consistency.
However, ZippyDB does not support cross-shard transactions. This limits Tectonic's ability to ensure atomicity when it processes metadata requests that span multiple directories, such as moving files between directories.
As previously mentioned, GFS ensures data reliability and security through multiple replicas, but this approach comes with high storage costs. For example, storing just 1 TB of data typically requires three replicas, resulting in at least 3 TB of storage space. This cost increases significantly for large-scale systems like Meta, operating at the exabyte level.
To solve this problem, Meta implements erasure coding (EC) in the Chunk Store which achieves data reliability and security with reduced redundancy, typically around 1.2 to 1.5 times the original data size. This approach offers substantial cost savings compared to the traditional three-replica method. Tectonic's EC design provides flexibility, allowing configuration on a per-chunk basis.
While EC effectively ensures data reliability with minimal storage space, it does have some drawbacks. Specifically, reconstructing lost or corrupted data incurs high computational and I/O resource requirements.
According to the Tectonic research paper,the largest Tectonic cluster in Meta comprises approximately 4,000 storage nodes, with a total capacity of about 1,590 petabytes, which is equivalent to 10 billion files. This scale is substantial for a distributed file system and generally fulfills the requirements of the majority of use cases at present.
JuiceFS was born in , a time when significant changes had occurred in the external landscape compared to the emergence of GFS and Tectonic:
Moreover, GFS and Tectonic were in-house systems serving specific company operations, operating at a large scale but with a narrow focus. In contrast, JuiceFS is designed to cater to a wide range of public-facing users and to meet diverse use case requirements. As a result, the architecture of JuiceFS differs significantly from the other two file systems.
Taking these changes and distinctions into account, lets look at the JuiceFS architecture as shown in the figure below:
JuiceFS architecture (Source: JuiceFS Architecture)JuiceFS consists of three components:
While JuiceFS shares a similar overall framework with the aforementioned systems, it distinguishes itself through various design aspects.
Unlike GFS and Tectonic, which rely on proprietary data storage, JuiceFS follows the trend of the cloud-native era by using object storage. As previously mentioned, Metas Tectonic cluster uses over 4,000 servers to handle exabyte-scale data. This inevitably leads to significant operational costs for managing such a large-scale storage cluster.
For regular users, object storage has several advantages:
However, object storage has limitations, including:
To tackle these challenges, JuiceFS adopts the following strategies in its architectural design:
JuiceFS supports various open-source databases as its underlying storage for metadata. This is similar to Tectonic, but JuiceFS goes a step further by supporting not only distributed KV stores but also Redis, relational databases, and other storage engines. This design has these advantages:
Tectonic achieves strong metadata consistency by using ZippyDB, a transactional KV store. However, its transactionality is limited to metadata operations within a single shard. In contrast, JuiceFS has stricter requirements for transactionality and demands global strong consistency across shards. Therefore, all supported databases integrated as metadata engines must support transactions. With a horizontally scalable metadata engine like TiKV, JuiceFS can now store over 20 billion files in a single file system, meeting the storage needs of enterprises with massive data. This capability makes JuiceFS an ideal choice for enterprises dealing with massive data storage needs.
The main differences between the JuiceFS client and the clients of the other two systems are as follows:
Distributed file systems have transformed data storage, and three notable systems stand out in this domain: GFS, Tectonic, and JuiceFS.
Distributed file systems overcome traditional disk limitations, providing flexibility, reliability, and efficiency for managing large data volumes. As technology advances and data grows exponentially, their ongoing evolution reflects industry's commitment to efficient data management. With diverse architectures and innovative features, distributed file systems drive innovation across industries.
Note: This article was originally published on InfoQ on June 26, .
Changjian Gao
Changjian Gao is a seasoned technical expert at Juicedata, specializing in distributed systems, big data, and AI. With over ten years of experience in the IT industry, Changjian has served as an architect at renowned Chinese companies such as Zhihu, Jike, and Xiaohongshu. He is a key contributor to the JuiceFS open-source community and is highly regarded for his deep understanding of industry trends and innovative solutions. With a passion for driving advancements in distributed file systems and cloud-native technologies, Changjian brings valuable insights and expertise to the field.
Want more information on glass fused steel tank installation? Feel free to contact us.
If you are interested in sending in a Guest Blogger Submission,welcome to write for us!
All Comments ( 0 )