金融行业审计相关的表结构设计
即便是普通的应用开发中,大部分的表结构都会考虑数据的几个基本的时间点和变更人来记录一条记录的审计信息,比如
id | account | stock | quantity | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 1000 | 2022-01-30T09:04:35.517Z | 2022-02-13T09:04:35.517Z | null | Damon Yuan |
2 | A | 601318.SH | 100 | 2022-03-29T09:04:35.517Z | 2022-03-31T09:04:35.517Z | 2022-03-31T09:04:35.517Z | Tony Ma |
-- 获取账户 A 6000519.SH 的最近持仓
select p.quantity
from position p
where p.account = 'A' and p.stock = '600519.SH' and deleted_at is null
但是这样会丢失以下几条信息点
- 数据库记录的变更历史丢失。以上表为例,对于 A 账户中的每只股票,只留下了最新的账户持仓,而数量变更产生的入库记录都丢失了。
- 数据库记录的变更人历史丢失。以上表为例,对于 A 账户的每一只股票,只留下了最后的变更人及更新时间,而之前的记录创建人信息,其他的变更人信息以及相应的时间戳都无法保存下来。
业务发生历史丢失。这条信息点比较难理解,我们假设对于 A 账户 600519.SH 的持仓是由以下几次业务变更产生的
- 在 2022-01-30 开仓 +500 合计 500,簿记入库
- 在 2022-02-11 补仓 +100,簿记失败未入库,合计仍然为 500
- 在 2022-02-12 补仓 +500 合计 1000,簿记入库
- 同一天马上发现实际补仓数量是 +400,先反向平仓之前那一条记录,再增加一条 +400 的记录
- 在 2022-02-13 发现第二条簿记失败,补记该条记录,合计更新为 1000
这些信息都无法从数据库记录中体现。注意这里和第一条信息点的区别,我们会在下面对两者进行重点比较。
常见的审计相关表结构
为了能够记录数据库的历史变更,通常在业务中常见的审计相关表结构可以调整为如下两种格式:
- 交易流水模式
id | account | stock | quantity | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | null | Damon Yuan |
2 | A | 600519.SH | 500 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | null | Alex Wu |
3 | A | 600519.SH | -500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | null | Alex Wu |
4 | A | 600519.SH | 400 | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | null | Alex Wu |
5 | A | 600519.SH | 100 | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | null | Lily Liu |
-- 获取账户 A 6000519.SH 的最近持仓
with cumulative_change as (
select pc1.id as id, sum(pc2.quantity) as cumulative_change
from position_change pc1
join position_change pc2 on pc1.id >= pc2.id
)
select cc.cumulative_change as quantity
from cumulative_change cc
where cc.id = (
select max(id) from cumulative_change
)
- 交易持仓模式
id | account | stock | quantity | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Damon Yuan |
2 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu |
3 | A | 600519.SH | 500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu |
4 | A | 600519.SH | 900 | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu |
5 | A | 600519.SH | 1000 | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | null | Lily Liu |
-- 获取账户 A 6000519.SH 的最近持仓
select p.quantity
from position p
where p.account = 'A' and p.stock = '600519.SH' and deleted_at is null
本质上这两种模式都通过增加新的记录来达到记录数据变更历史的目的,都可以解决数据库记录的变更历史丢失和数据库记录的变更人历史丢失的问题。但是它们都不能解决第三个问题 - 业务发生历史的丢失。正是因为缺乏业务历史信息,我们无法通过数据复盘历史上某一个时间点的准确仓位。 例如,从上面两种表结构,我们都无法知道第五条记录业务发生的时间实际上是 2022-02-11,因此时间轴上无法查询获取 2022-02-12T09:05:45.517Z
这个时间点上的准确仓位。
-- 交易流水模式查询 2022-02-12T09:05:45.517Z 这个时间点的仓位
with cumulative_change as (
select pc1.id as id,
sum(pc2.quantity) as cumulative_change,
pc1.created_at as start_at,
pc3.created_at as end_at
pc1.created_at,
pc1.updated_at,
pc1.deleted_at
from position_change pc1
join position_change pc2 on pc1.id >= pc2.id
join position_change pc3 on pc1.id + 1 = pc3.id
)
select cc.cumulative_change as quantity
from cumulative_change cc
where p.account = 'A' and p.stock = '600519.SH'
and cc.start_at <= date'2022-02-12T09:05:45.517Z'
and (cc.end_at is null or cc.end_at > date'2022-02-12T09:05:45.517Z')
这里查询出来的结果为 500,然而实际上应该为 600。
-- 交易持仓模式查询 2022-02-12T09:05:45.517Z 这个时间点的仓位
select *
from position p
where p.account = 'A' and p.stock = '600519.SH'
and p.created_at <= date'2022-02-12T09:05:45.517Z'
and (p.deleted_at is null or p.deleted_at > date'2022-02-12T09:05:45.517Z')
这里查询出来的结果同样为 500,然而实际上应该为 600。
双时间轴审计表结构
为了记录业务发生时间,我们可以在原有的记录变更时间轴的基础上,再加一条业务变更时间轴;同时由于我们记录了所有的变更,所以理论上 updated_at 时间戳可以删除,但是保留下来和 created_at 做对照可以验证该条记录是否被人为修改过。基于交易持仓模式的表结构因而可以变更为如下结构,结合业务场景分析如下(注意各个时间戳的变化):
- 在 2022-01-30 开仓 +500 合计 500,簿记入库
id | account | stock | quantity | start_at | end_at | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | Damon Yuan |
- 在 2022-02-11 补仓 +100,簿记失败未入库,合计仍然为 500,表记录不变
- 在 2022-02-12 补仓 +500 合计 1000,簿记入库
id | account | stock | quantity | start_at | end_at | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Damon Yuan | |
2 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Alex Wu | |
3 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Alex Wu |
这里我们先通过 deleted_at 逻辑删除之前那条记录,再生成两条新的记录。注意第二条记录的 end_at 时间戳必须是第三条记录的 start_at 时间戳,这样才能保证业务事务发生时间轴的连续性。注意为了保证原子性,这些操作应该包含在一个 Transaction 实现。
- 同一天马上发现实际补仓数量是 +400,先反向平仓之前那一条记录,
id | account | stock | quantity | start_at | end_at | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Damon Yuan | |
2 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu |
3 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu | |
4 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | Damon Yuan | |
5 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu | |
6 | A | 600519.SH | 500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu |
这里我们先逻辑删除第二和第三条记录,然后通过 5 和 6 两条记录相互抵消掉簿记错误。同样为了保证原子性,这些操作应该包含在一个 Transaction 实现。
再增加一条 +400 的记录,
id | account | stock | quantity | start_at | end_at | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Damon Yuan | |
2 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu |
3 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu | |
4 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | Damon Yuan |
5 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu |
6 | A | 600519.SH | 500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu |
7 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | Damon Yuan | |
8 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu | |
9 | A | 600519.SH | 500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu | |
10 | A | 600519.SH | 900 | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu |
- 在 2022-02-13 发现第二条簿记失败,补记该条记录,合计更新为 1000
id | account | stock | quantity | start_at | end_at | created_at | updated_at | deleted_at | updated_by |
---|---|---|---|---|---|---|---|---|---|
1 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | Damon Yuan | |
2 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu |
3 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | Alex Wu | |
4 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | Damon Yuan |
5 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu |
6 | A | 600519.SH | 500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | Alex Wu |
7 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | Damon Yuan |
8 | A | 600519.SH | 1000 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu |
9 | A | 600519.SH | 500 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu |
10 | A | 600519.SH | 900 | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu | |
11 | A | 600519.SH | 500 | 2022-01-30T09:04:35.517Z | 2022-02-11T09:04:35.517Z | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | Damon Yuan | |
12 | A | 600519.SH | 600 | 2022-02-11T09:04:35.517Z | 2022-02-12T09:04:35.517Z | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | Lily Liu | |
13 | A | 600519.SH | 1100 | 2022-02-12T09:04:35.517Z | 2022-02-12T09:05:35.517Z | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu | |
14 | A | 600519.SH | 600 | 2022-02-12T09:05:35.517Z | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu | |
15 | A | 600519.SH | 1000 | 2022-02-12T09:06:35.517Z | 2022-02-13T09:04:35.517Z | 2022-02-13T09:04:35.517Z | Alex Wu |
-- 获取账户 A 6000519.SH 的最近持仓
select p.quantity
from position p
where p.account = 'A' and
p.stock = '600519.SH' and
p.deleted_at is null and
p.end_at is null
-- 查询 2022-02-12T09:05:45.517Z 时刻账户 A 6000519.SH 的持仓
select p.quantity
from position p
where p.account = 'A'
and p.stock = '600519.SH'
and p.deleted_at is null
and p.start_at <= date'2022-02-12T09:05:45.517Z'
and (p.end_at > date'2022-02-12T09:05:45.517Z' or p.end_at is null)
查询结果 600,这是和正确的历史记录相符的。
综合以上几个步骤可以发现通过这种方法我们可以复盘整个业务历史和数据库操作历史,从而查询历史上某一时刻的准确仓位。其代价是大量的存储空间。一般这种情况会通过 data retention policy 设定历史记录(deleted_at 不为空)的存储时间,同时每天对当天记录进行历史归档。这对于量化回测尤为重要。
问题
最大的问题为数据量的问题。假设每日产生数据为 k, 则第 n 日需要存储的数据量为 kn, [1, n] 日总数据量为 k * (1 + 2 + 3 + ... + n) = kn(n + 1) / 2,空间复杂度为 O(n^2)。
解决方案一
细化 bitemporal 数据的粒度,比如每次更新 stock position 都通过 Account, Stock ID 选取最细粒度的 subset 进行更新,这样可以减少需要更新 deleted_at 的数据的量。
在业务时间轴上一般选择日为单位时间。
解决方案二
TODO