DATA BRANCH DIFF
语法说明
DATA BRANCH DIFF 语句用于比较两个表之间的数据差异。该功能类似于 Git 的 diff 命令,可以显示两个数据分支之间的插入、删除和更新操作。
系统会自动识别两个表之间的最近公共祖先(LCA,Lowest Common Ancestor),并基于此计算差异。差异结果包括:
- INSERT:目标表中存在但基准表中不存在的行
- DELETE:基准表中存在但目标表中不存在的行
- UPDATE:两个表中主键相同但其他列值不同的行
语法结构
DATA BRANCH DIFF target_table [{ SNAPSHOT = 'snapshot_name' }]
AGAINST base_table [{ SNAPSHOT = 'snapshot_name' }]
[OUTPUT output_option]
输出选项
output_option:
COUNT -- 仅返回差异行数
| LIMIT number -- 限制返回的差异行数
| FILE 'directory_path' -- 将差异导出为 SQL 文件
| AS table_name -- 将差异保存到表中(暂不支持)
语法释义
参数说明
| 参数 | 说明 |
|---|---|
target_table |
目标表(要比较的表) |
base_table |
基准表(作为比较基准的表) |
SNAPSHOT = 'snapshot_name' |
可选参数,指定使用某个快照时刻的数据进行比较 |
OUTPUT COUNT |
仅返回差异的行数统计 |
OUTPUT LIMIT number |
限制返回的差异行数 |
OUTPUT FILE 'path' |
将差异导出为 SQL 文件到指定目录,支持本地路径或 Stage 路径(如 stage://stage_name/) |
输出列说明
默认输出包含以下列:
| 列名 | 说明 |
|---|---|
diff target against base |
显示比较的表名 |
flag |
差异类型:INSERT、DELETE 或 UPDATE |
| 其他列 | 表的所有可见列 |
使用说明
LCA(最近公共祖先)
系统会自动检测两个表之间的分支关系:
- 无 LCA:两个表没有共同的祖先,直接比较所有数据
- 有 LCA:两个表有共同的祖先,基于祖先计算增量差异
- 自身为 LCA:一个表是另一个表的祖先
支持的表类型
- 带主键的表(推荐)
- 复合主键的表
- 无主键的表(使用隐藏的 fake primary key)
示例
示例 1:基本差异比较
比较两个没有共同祖先的表:
-- Expected-Rows: 0
CREATE DATABASE test;
-- Expected-Rows: 0
USE test;
-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b VARCHAR(10));
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (1, '1'), (2, '2'), (3, '3');
-- Expected-Rows: 0
CREATE TABLE test.t2 (a INT PRIMARY KEY, b VARCHAR(10));
-- Expected-Rows: 0
INSERT INTO test.t2 VALUES (1, '1'), (2, '2'), (4, '4');
-- Expected-Rows: 2
DATA BRANCH DIFF test.t2 AGAINST test.t1;
+-------------------+--------+------+------+
| diff t2 against t1 | flag | a | b |
+-------------------+--------+------+------+
| t2 | INSERT | 4 | 4 |
| t1 | INSERT | 3 | 3 |
+-------------------+--------+------+------+
-- Expected-Rows: 2
DATA BRANCH DIFF test.t1 AGAINST test.t2;
+-------------------+--------+------+------+
| diff t1 against t2 | flag | a | b |
+-------------------+--------+------+------+
| t1 | INSERT | 3 | 3 |
| t2 | INSERT | 4 | 4 |
+-------------------+--------+------+------+
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;
示例 2:比较分支表(有共同祖先)
-- Expected-Rows: 0
CREATE TABLE test.t0 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t0 VALUES (1, 1), (2, 2), (3, 3);
-- 从 t0 创建两个分支
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t1 FROM test.t0;
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (4, 4);
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t0;
-- Expected-Rows: 0
INSERT INTO test.t2 VALUES (5, 5);
-- Expected-Rows: 2
DATA BRANCH DIFF test.t2 AGAINST test.t1;
+-------------------+--------+------+------+
| diff t2 against t1 | flag | a | b |
+-------------------+--------+------+------+
| t2 | INSERT | 5 | 5 |
| t1 | INSERT | 4 | 4 |
+-------------------+--------+------+------+
-- Expected-Rows: 0
DROP TABLE test.t0;
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;
示例 3:使用快照比较
-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (1, 1), (2, 2);
-- Expected-Rows: 0
CREATE SNAPSHOT sp1 FOR TABLE test t1;
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (3, 3);
-- Expected-Rows: 1
UPDATE test.t1 SET b = 10 WHERE a = 1;
-- Expected-Rows: 0
CREATE SNAPSHOT sp2 FOR TABLE test t1;
-- Expected-Rows: 1
DATA BRANCH DIFF test.t1{SNAPSHOT='sp2'} AGAINST test.t1{SNAPSHOT='sp1'};
+--------------------+--------+------+------+
| diff t1 against t1 | flag | a | b |
+--------------------+--------+------+------+
| t1 | INSERT | 3 | 3 |
+--------------------+--------+------+------+
-- Expected-Rows: 0
DROP SNAPSHOT sp1;
-- Expected-Rows: 0
DROP SNAPSHOT sp2;
-- Expected-Rows: 0
DROP TABLE test.t1;
Note
当比较同一张表的两个快照时,系统会基于快照之间的增量变化进行比较。在此示例中,虽然 UPDATE 操作修改了 a=1 的行,但由于快照比较的特殊机制,只显示了新增的行。
示例 4:仅获取差异数量
-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 SELECT result, result FROM generate_series(1, 1000) g;
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t1;
-- Expected-Rows: 0
INSERT INTO test.t2 SELECT result, result FROM generate_series(1001, 2000) g;
-- Expected-Rows: 100
DELETE FROM test.t2 WHERE a <= 100;
-- Expected-Rows: 1
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT COUNT;
+----------+
| COUNT(*) |
+----------+
| 1100 |
+----------+
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;
示例 5:限制返回行数
-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 SELECT result, result FROM generate_series(1, 100) g;
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t1;
-- Expected-Rows: 0
INSERT INTO test.t2 SELECT result, result FROM generate_series(101, 200) g;
-- Expected-Rows: 5
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT LIMIT 5;
+--------------------+--------+------+------+
| diff t2 against t1 | flag | a | b |
+--------------------+--------+------+------+
| t2 | INSERT | 106 | 106 |
| t2 | INSERT | 107 | 107 |
| t2 | INSERT | 117 | 117 |
| t2 | INSERT | 124 | 124 |
| t2 | INSERT | 156 | 156 |
+--------------------+--------+------+------+
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;
Note
OUTPUT LIMIT 返回的行不一定是按主键顺序排列的,而是按照内部存储顺序返回前 N 行差异。
示例 6:导出差异为 SQL 文件
-- Expected-Rows: 0
CREATE TABLE test.t1 (a INT PRIMARY KEY, b INT);
-- Expected-Rows: 0
INSERT INTO test.t1 VALUES (1, 1), (2, 2);
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t1;
-- Expected-Rows: 0
INSERT INTO test.t2 VALUES (3, 3);
-- Expected-Rows: 1
UPDATE test.t2 SET b = 10 WHERE a = 1;
-- Expected-Rows: 1
DELETE FROM test.t2 WHERE a = 2;
-- Expected-Success: false
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT FILE '/tmp/diff_output/';
+------------------------------------------+------------------------------------------+
| FILE SAVED TO | HINT |
+------------------------------------------+------------------------------------------+
| /tmp/diff_output/diff_t2_t1_20241225.sql | DELETE FROM test.t1, REPLACE INTO test.t1 |
+------------------------------------------+------------------------------------------+
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;
输出文件类型说明
根据差异类型,系统会生成不同格式的文件:
| 场景 | 文件格式 | 说明 |
|---|---|---|
| 增量同步(目标表非空) | .sql |
包含 DELETE FROM ... 和 REPLACE INTO ... 语句 |
| 全量同步(目标表为空) | .csv |
CSV 格式,可通过 LOAD DATA 导入 |
回放补丁文件
回放 SQL 文件(增量同步):
mysql -h <mo_host> -P <mo_port> -u <user> -p <db_name> < diff_t2_t1_20241225.sql
导入 CSV 文件(全量同步):
-- Expected-Success: false
LOAD DATA LOCAL INFILE '/tmp/diff_output/diff_xxx.csv'
INTO TABLE test.t1
FIELDS ENCLOSED BY '"' ESCAPED BY '\\' TERMINATED BY ','
LINES TERMINATED BY '\n';
示例 6b:导出差异到 Stage(对象存储)
Stage 是 MatrixOne 用于连接外部存储(如 S3、HDFS)的逻辑对象,可以将差异文件直接输出到对象存储,便于跨集群/跨环境同步数据。
-- Expected-Rows: 0
CREATE STAGE my_stage URL = 's3://my-bucket/diff-output/?region=us-east-1&access_key_id=xxx&secret_access_key=yyy';
-- Expected-Success: false
DATA BRANCH DIFF test.t2 AGAINST test.t1 OUTPUT FILE 'stage://my_stage/';
+-------------------------------------------------+------------------------------------------+
| FILE SAVED TO | HINT |
+-------------------------------------------------+------------------------------------------+
| stage://my_stage/diff_t2_t1_20241225.sql | DELETE FROM test.t1, REPLACE INTO test.t1 |
+-------------------------------------------------+------------------------------------------+
-- Expected-Success: false
SELECT load_file(CAST('stage://my_stage/diff_t2_t1_20241225.sql' AS DATALINK));
-- Expected-Rows: 0
DROP STAGE my_stage;
使用 Stage 的优势:
- 安全:无需在每次 SQL 中暴露 AK/SK,管理员配置一次即可
- 便捷:将复杂的 URL 路径封装成简单的对象名
- 跨集群同步:源端写到对象存储,目标端直接读取执行
示例 7:检测更新操作
-- Expected-Rows: 0
CREATE TABLE test.t0 (a INT PRIMARY KEY, b INT, c INT);
-- Expected-Rows: 0
INSERT INTO test.t0 SELECT result, result, result FROM generate_series(1, 100) g;
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t1 FROM test.t0;
-- Expected-Rows: 3
UPDATE test.t1 SET c = c + 1 WHERE a IN (1, 50, 100);
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.t2 FROM test.t0;
-- Expected-Rows: 3
UPDATE test.t2 SET c = c + 2 WHERE a IN (1, 25, 75);
-- 比较差异,将检测到冲突的更新
-- Expected-Rows: 6
DATA BRANCH DIFF test.t2 AGAINST test.t1;
+--------------------+--------+------+------+------+
| diff t2 against t1 | flag | a | b | c |
+--------------------+--------+------+------+------+
| t2 | UPDATE | 1 | 1 | 3 |
| t1 | UPDATE | 1 | 1 | 2 |
| t2 | UPDATE | 25 | 25 | 27 |
| t1 | UPDATE | 50 | 50 | 51 |
| t2 | UPDATE | 75 | 75 | 77 |
| t1 | UPDATE | 100 | 100 | 101 |
+--------------------+--------+------+------+------+
-- Expected-Rows: 0
DROP TABLE test.t0;
-- Expected-Rows: 0
DROP TABLE test.t1;
-- Expected-Rows: 0
DROP TABLE test.t2;
示例 8:复合主键表的差异比较
-- Expected-Rows: 0
CREATE TABLE test.orders (
tenant_id INT,
order_code VARCHAR(8),
amount DECIMAL(12,2),
PRIMARY KEY (tenant_id, order_code)
);
-- Expected-Rows: 0
INSERT INTO test.orders VALUES
(100, 'A100', 120.50),
(100, 'A101', 80.00),
(101, 'B200', 305.75);
-- Expected-Rows: 0
DATA BRANCH CREATE TABLE test.orders_branch FROM test.orders;
-- 在分支上进行修改
-- Expected-Rows: 1
UPDATE test.orders_branch SET amount = 130.50 WHERE tenant_id = 100 AND order_code = 'A100';
-- Expected-Rows: 1
DELETE FROM test.orders_branch WHERE tenant_id = 100 AND order_code = 'A101';
-- Expected-Rows: 0
INSERT INTO test.orders_branch VALUES (102, 'C300', 512.25);
-- 比较差异
-- Expected-Rows: 3
DATA BRANCH DIFF test.orders_branch AGAINST test.orders;
+-----------------------------------+--------+-----------+------------+--------+
| diff orders_branch against orders | flag | tenant_id | order_code | amount |
+-----------------------------------+--------+-----------+------------+--------+
| orders_branch | UPDATE | 100 | A100 | 130.50 |
| orders_branch | DELETE | 100 | A101 | 80.00 |
| orders_branch | INSERT | 102 | C300 | 512.25 |
+-----------------------------------+--------+-----------+------------+--------+
-- Expected-Rows: 0
DROP TABLE test.orders;
-- Expected-Rows: 0
DROP TABLE test.orders_branch;
-- Expected-Rows: 0
DROP DATABASE test;
注意事项
-
表结构一致性:进行差异比较的两个表必须具有相同的表结构(列名、列类型)。
-
主键要求:虽然支持无主键的表,但建议使用带主键的表以获得更准确的差异结果。
-
性能考虑:对于大表,差异比较可能需要较长时间。建议使用
OUTPUT COUNT先了解差异规模。 -
快照有效性:使用快照比较时,确保快照存在且有效。
-
输出文件:使用
OUTPUT FILE时,确保目标目录存在且有写入权限。生成的 SQL 文件可以直接执行以同步数据。 -
LCA 检测:系统会自动检测 LCA,无需手动指定。LCA 的存在会影响差异计算的方式。