DATA BRANCH DIFF

Description

The DATA BRANCH DIFF statement is used to compare data differences between two tables. This feature is similar to Git's diff command and can display insert, delete, and update operations between two data branches.

The system automatically identifies the Lowest Common Ancestor (LCA) between two tables and calculates differences based on it. The difference results include:

INSERT: Rows that exist in the target table but not in the base table
DELETE: Rows that exist in the base table but not in the target table
UPDATE: Rows with the same primary key in both tables but different values in other columns

Syntax

DATA BRANCH DIFF target_table [{ SNAPSHOT = 'snapshot_name' }] 
    AGAINST base_table [{ SNAPSHOT = 'snapshot_name' }] 
    [OUTPUT output_option]

Output Options

output_option:
    COUNT                           -- Return only the count of different rows
  | LIMIT number                    -- Limit the number of returned difference rows
  | FILE 'directory_path'           -- Export differences as SQL file
  | AS table_name                   -- Save differences to a table (not yet supported)

Arguments

Parameter Description

Parameter	Description
`target_table`	Target table (the table to compare)
`base_table`	Base table (the table used as comparison baseline)
`SNAPSHOT = 'snapshot_name'`	Optional parameter to specify using data at a snapshot point for comparison
`OUTPUT COUNT`	Return only the count of differences
`OUTPUT LIMIT number`	Limit the number of returned difference rows
`OUTPUT FILE 'path'`	Export differences as SQL file to specified directory, supports local path or Stage path (e.g., `stage://stage_name/`)

Output Column Description

Default output includes the following columns:

Column Name	Description
`diff target against base`	Shows the table names being compared
`flag`	Difference type: INSERT, DELETE, or UPDATE
Other columns	All visible columns of the table

Usage Notes

LCA (Lowest Common Ancestor)

The system automatically detects the branch relationship between two tables:

No LCA: Two tables have no common ancestor, directly compare all data
Has LCA: Two tables have a common ancestor, calculate incremental differences based on ancestor
Self as LCA: One table is the ancestor of the other

Supported Table Types

Tables with primary key (recommended)
Tables with composite primary key
Tables without primary key (using hidden fake primary key)

Examples

Example 1: Basic Difference Comparison

Compare two tables without a common ancestor:

-- Expected-Rows: 0
CREATE DATABASE test;
-- Expected-Rows: 0
USE test;

-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b VARCHAR(10));
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (1, '1'), (2, '2'), (3, '3');

-- Expected-Rows: 0
CREATE TABLE test.t2 (a INT PRIMARY KEY, b VARCHAR(10));
-- Expected-Rows: 0
INSERT INTO test.t2 VALUES (1, '1'), (2, '2'), (4, '4');

-- Expected-Rows: 2
DATA BRANCH DIFF test.t2 AGAINST test.t1;
+-------------------+--------+------+------+
| diff t2 against t1 | flag   | a    | b    |
+-------------------+--------+------+------+
| t2                | INSERT |    4 | 4    |
| t1                | INSERT |    3 | 3    |
+-------------------+--------+------+------+

-- Expected-Rows: 2
DATA BRANCH DIFF test.t1 AGAINST test.t2;
+-------------------+--------+------+------+
| diff t1 against t2 | flag   | a    | b    |
+-------------------+--------+------+------+
| t1                | INSERT |    3 | 3    |
| t2                | INSERT |    4 | 4    |
+-------------------+--------+------+------+

-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;

Example 2: Compare Branch Tables (With Common Ancestor)

-- Expected-Rows: 0
CREATE TABLE test.t0 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t0 VALUES (1, 1), (2, 2), (3, 3);

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t1 FROM test.t0;
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (4, 4);

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t0;
-- Expected-Rows: 0
INSERT INTO test.t2 VALUES (5, 5);

-- Expected-Rows: 2
DATA BRANCH DIFF test.t2 AGAINST test.t1;
+-------------------+--------+------+------+
| diff t2 against t1 | flag   | a    | b    |
+-------------------+--------+------+------+
| t2                | INSERT |    5 |    5 |
| t1                | INSERT |    4 |    4 |
+-------------------+--------+------+------+

-- Expected-Rows: 0
DROP TABLE test.t0;
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;

Example 3: Compare Using Snapshots

-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (1, 1), (2, 2);

-- Expected-Rows: 0
CREATE SNAPSHOT sp1 FOR TABLE test t1;

-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (3, 3);
-- Expected-Rows: 1
UPDATE test.t1 SET b = 10 WHERE a = 1;

-- Expected-Rows: 0
CREATE SNAPSHOT sp2 FOR TABLE test t1;

-- Expected-Rows: 1
DATA BRANCH DIFF test.t1{SNAPSHOT='sp2'} AGAINST test.t1{SNAPSHOT='sp1'};
+--------------------+--------+------+------+
| diff t1 against t1 | flag   | a    | b    |
+--------------------+--------+------+------+
| t1                 | INSERT |    3 |    3 |
+--------------------+--------+------+------+

-- Expected-Rows: 0
DROP SNAPSHOT sp1;
-- Expected-Rows: 0
DROP SNAPSHOT sp2;
-- Expected-Rows: 0
DROP TABLE test.t1;

Note

When comparing two snapshots of the same table, the system compares based on incremental changes between snapshots. In this example, although the UPDATE operation modified the row where a=1, due to the special mechanism of snapshot comparison, only the newly inserted row is shown.

Example 4: Get Only Difference Count

-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 SELECT result, result FROM generate_series(1, 1000) g;

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t1;
-- Expected-Rows: 0
INSERT INTO test.t2 SELECT result, result FROM generate_series(1001, 2000) g;
-- Expected-Rows: 100
DELETE FROM test.t2 WHERE a <= 100;

-- Expected-Rows: 1
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT COUNT;
+----------+
| COUNT(*) |
+----------+
|     1100 |
+----------+

-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;

Example 5: Limit Returned Rows

-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 SELECT result, result FROM generate_series(1, 100) g;

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t1;
-- Expected-Rows: 0
INSERT INTO test.t2 SELECT result, result FROM generate_series(101, 200) g;

-- Expected-Rows: 5
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT LIMIT 5;
+--------------------+--------+------+------+
| diff t2 against t1 | flag   | a    | b    |
+--------------------+--------+------+------+
| t2                 | INSERT |  106 |  106 |
| t2                 | INSERT |  107 |  107 |
| t2                 | INSERT |  117 |  117 |
| t2                 | INSERT |  124 |  124 |
| t2                 | INSERT |  156 |  156 |
+--------------------+--------+------+------+

-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;

Note

The rows returned by OUTPUT LIMIT are not necessarily ordered by primary key, but rather return the first N difference rows based on internal storage order.

Example 6: Export Differences as SQL File

-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (1, 1), (2, 2);

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t1;
-- Expected-Rows: 0
INSERT INTO test.t2 VALUES (3, 3);
-- Expected-Rows: 1
UPDATE test.t2 SET b = 10 WHERE a = 1;
-- Expected-Rows: 1
DELETE FROM test.t2 WHERE a = 2;

-- Expected-Success: false
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT FILE '/tmp/diff_output/';
+------------------------------------------+------------------------------------------+
| FILE SAVED TO                            | HINT                                     |
+------------------------------------------+------------------------------------------+
| /tmp/diff_output/diff_t2_t1_20241225.sql | DELETE FROM test.t1, REPLACE INTO test.t1 |
+------------------------------------------+------------------------------------------+

-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;

Output File Types

The system generates different file formats based on the difference type:

Scenario	File Format	Description
Incremental sync (target table non-empty)	`.sql`	Contains `DELETE FROM ...` and `REPLACE INTO ...` statements
Full sync (target table empty)	`.csv`	CSV format, can be imported via `LOAD DATA`

Replaying Patch Files

Replay SQL file (incremental sync):

mysql -h <mo_host> -P <mo_port> -u <user> -p <db_name> < diff_t2_t1_20241225.sql

Import CSV file (full sync):

-- Expected-Success: false
LOAD DATA LOCAL INFILE '/tmp/diff_output/diff_xxx.csv'
INTO TABLE test.t1
FIELDS ENCLOSED BY '"' ESCAPED BY '\\' TERMINATED BY ','
LINES TERMINATED BY '\n';

Example 6b: Export Differences to Stage (Object Storage)

Stage is a logical object in MatrixOne for connecting to external storage (such as S3, HDFS). You can output difference files directly to object storage for cross-cluster/cross-environment data synchronization.

-- Expected-Rows: 0
CREATE STAGE my_stage URL = 's3://my-bucket/diff-output/?region=us-east-1&access_key_id=xxx&secret_access_key=yyy';

-- Expected-Success: false
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT FILE 'stage://my_stage/';
+-------------------------------------------------+------------------------------------------+
| FILE SAVED TO                                   | HINT                                     |
+-------------------------------------------------+------------------------------------------+
| stage://my_stage/diff_t2_t1_20241225.sql        | DELETE FROM test.t1, REPLACE INTO test.t1 |
+-------------------------------------------------+------------------------------------------+

-- Expected-Success: false
SELECT load_file(CAST('stage://my_stage/diff_t2_t1_20241225.sql' AS DATALINK));

-- Expected-Rows: 0
DROP STAGE my_stage;

Advantages of using Stage:

Security: No need to expose AK/SK in every SQL statement; administrators configure once
Convenience: Encapsulate complex URL paths into simple object names
Cross-cluster sync: Source writes to object storage, target reads and executes directly

Example 7: Detect Update Operations

-- Expected-Rows: 0
CREATE TABLE test.t0 (a INT PRIMARY KEY, b INT, c INT);
-- Expected-Rows: 0
INSERT INTO test.t0 SELECT result, result, result FROM generate_series(1, 100) g;

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t1 FROM test.t0;
-- Expected-Rows: 3
UPDATE test.t1 SET c = c + 1 WHERE a IN (1, 50, 100);

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t0;
-- Expected-Rows: 3
UPDATE test.t2 SET c = c + 2 WHERE a IN (1, 25, 75);

-- Compare differences, will detect conflicting updates
-- Expected-Rows: 6
DATA BRANCH DIFF test.t2 AGAINST test.t1;
+--------------------+--------+------+------+------+
| diff t2 against t1 | flag   | a    | b    | c    |
+--------------------+--------+------+------+------+
| t2                 | UPDATE |    1 |    1 |    3 |
| t1                 | UPDATE |    1 |    1 |    2 |
| t2                 | UPDATE |   25 |   25 |   27 |
| t1                 | UPDATE |   50 |   50 |   51 |
| t2                 | UPDATE |   75 |   75 |   77 |
| t1                 | UPDATE |  100 |  100 |  101 |
+--------------------+--------+------+------+------+

-- Expected-Rows: 0
DROP TABLE test.t0;
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;

Example 8: Difference Comparison for Composite Primary Key Tables

-- Expected-Rows: 0
CREATE TABLE test.orders (
    tenant_id INT,
    order_code VARCHAR(8),
    amount DECIMAL(12,2),
    PRIMARY KEY (tenant_id, order_code)
);

-- Expected-Rows: 0
INSERT INTO test.orders VALUES
    (100, 'A100', 120.50),
    (100, 'A101', 80.00),
    (101, 'B200', 305.75);

-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.orders_branch FROM test.orders;

-- Make modifications on branch
-- Expected-Rows: 1
UPDATE test.orders_branch SET amount = 130.50 WHERE tenant_id = 100 AND order_code = 'A100';
-- Expected-Rows: 1
DELETE FROM test.orders_branch WHERE tenant_id = 100 AND order_code = 'A101';
-- Expected-Rows: 0
INSERT INTO test.orders_branch VALUES (102, 'C300', 512.25);

-- Compare differences
-- Expected-Rows: 3
DATA BRANCH DIFF test.orders_branch AGAINST test.orders;
+-----------------------------------+--------+-----------+------------+--------+
| diff orders_branch against orders | flag   | tenant_id | order_code | amount |
+-----------------------------------+--------+-----------+------------+--------+
| orders_branch                     | UPDATE |       100 | A100       | 130.50 |
| orders_branch                     | DELETE |       100 | A101       |  80.00 |
| orders_branch                     | INSERT |       102 | C300       | 512.25 |
+-----------------------------------+--------+-----------+------------+--------+

-- Expected-Rows: 0
DROP TABLE test.orders;
-- Expected-Rows: 0
DROP TABLE test.orders_branch;
-- Expected-Rows: 0
DROP DATABASE test;

Notes

Table Structure Consistency: The two tables being compared must have the same table structure (column names, column types).
Primary Key Requirement: Although tables without primary keys are supported, it is recommended to use tables with primary keys for more accurate difference results.
Performance Considerations: For large tables, difference comparison may take a long time. It is recommended to use OUTPUT COUNT first to understand the scale of differences.
Snapshot Validity: When comparing using snapshots, ensure the snapshots exist and are valid.
Output File: When using OUTPUT FILE, ensure the target directory exists and has write permissions. The generated SQL file can be executed directly to synchronize data.
LCA Detection: The system automatically detects LCA without manual specification. The existence of LCA affects how differences are calculated.