This article describes the technicalities of the full load data replication strategy in a Medallion data architecture.

Understanding Full Load Data Replication

Full load data replication is a simple strategy which involves copying the entire dataset from a source to a target system. This strategy is particularly useful in the following cases:

This strategy can be used for source tables either with or without a primary key.

Full load replication does place load on the source system, as queries are executed against it to fetch the data. The load on the source system is heavier than with an incremental data replication strategy

Full Load Replication Data Architecture

In full load data replication, data is first extracted from source and loaded into the bronze area of the target database. At this point the data is still in its original, raw, system-specific format.

After the extract and load, the data is transformed into its generic, system-neutral format in the silver layer. The transformations can be simple if the raw data is already similar to the desired target. Data transformations are likely to include datatype changes and naming standards.

Updates to the silver layer table can be made in two different ways:

Image ofFull Load Data Replication Strategy in a Medallion data architecture
Full Load Data Replication Strategy in a Medallion data architecture

Note the naming alterations in the silver layer table, and the addition of a primary key constraint. For more information on the notation please refer to this article on Data Modeling Techniques.

Storing data in the silver layer does not necessarily make it any easier to read. So a second layer of data transformation - from silver to gold - is often used to render the data in a more consumable way. The output is usually a star schema, or completely denormalized (“flat”) tables - sometimes known as One Big Table or “OBT”’s.

From a DataOps perspective it is helpful to perform this second layer of transformation using database views.

Advantages of using a silver layer ODS table

I’ll take the example of an “Account” table. Using the ODS table as an intermediate has several advantages:

In other words, separate stages of data transformation are used for integration and for consumability. Distinguishing between these two aspects is one of the hallmarks of a robust, production-grade data pipeline architecture.

Full vs Incremental loading

Full load data replication has the advantage of simplicity. Every load guarantees complete synchronization. The approach is more suitable for smaller datasets. It can be relatively resource-intensive for large datasets, with longer processing times.

In contrast, an incremental load strategy - characterized by tracking changes and filtering the extraction - is faster and more efficient, especially for larger datasets. Less data transfer means reduced resource utilization, and consequently faster processing times. Whether implemented logically (using a high water mark) or physically (using database change logs or change data capture), incremental loading increases the complexity of implementation.

Hard vs Soft deletes

The difference between hard and soft deletes lies in how records are handled when they are marked for removal. The difference can have an important impact on data replication.

Hard deletes ensure cleaner and more efficient storage. They may be required for operational reasons, for example data privacy laws. Hard deletes result in a loss of historical context, and can pose challenges if audit trails or historical analysis are needed. A soft delete allows both the retention of historical information and ongoing referential integrity. But consumers of the data are expected to use additional logic to filter out deleted records during queries. One well known example of soft deletes occurs in a slowly changing dimension.

Hard deletes are handled by full load data replication provided you use the truncate-and-insert option to update the silver layer table. The DataOps risk to be aware of is that you may end up with records in other silver layer tables that reference deleted records.

Author: Ian