MT19937¶

MT19937即梅森旋转算法（Mersenne twister）由松本眞（日语：松本真）和西村拓士在1997年开发，基于二进制有限域 $\mathbb{F}_2$ 上的矩阵线性递归，可以快速产生高质量的伪随机数。

该算法的周期为 $2^{19937}-1$ ，故名为MT19937。该算法具有以下优点
1. 周期非常长，为 $2^{19937}-1$
2. 在 $1\le k \le 623$ 都满足 $k$ -分布
3. 能比硬件实现的方法更快产生随机数
$k$ -分布：一个周期为 $P$ 的 $w$ 位整数的伪随机数序列 $x_i$ ，如果下列成立则称其 $v$ -比特精度的 $k$ -分布成立。

令 $\mathrm{trunc_v(x)}$ 表示由 $x$ 的前 $v$ 位形成的数，并考虑 $P$ 中 $k$ 个 $v$ 位向量。

$(\mathrm{trunc_v(x_i)},\mathrm{trunc_v(x_{i+1})},\dots,\mathrm{trunc_v(x_{i+k-1})})\ (0\le i< P)$

然后， $2^{kv}$ 个组合中每一个都在一个周期内出现次数相同（全0组合出现次数较少除外）。

代码实现¶

MT19937算法可分为三个部分

初始化
旋转状态
提取伪随机数

其中32位的MT19937 Python代码实现为：

def _int32(x):
    return int(0xFFFFFFFF & x)

class MT19937:
    # 初始化
    def __init__(self, seed):
        self.mt = [0] * 624
        self.mt[0] = seed
        self.mti = 0
        for i in range(1, 624):
            self.mt[i] = _int32(1812433253 * (self.mt[i - 1] ^ self.mt[i - 1] >> 30) + i)

    # 提取伪随机数
    def extract_number(self):
        if self.mti == 0:
            self.twist()
        y = self.mt[self.mti]
        y = y ^ y >> 11
        y = y ^ y << 7 & 2636928640
        y = y ^ y << 15 & 4022730752
        y = y ^ y >> 18
        self.mti = (self.mti + 1) % 624
        return _int32(y)

    # 旋转状态
    def twist(self):
        for i in range(0, 624):
            y = _int32((self.mt[i] & 0x80000000) + (self.mt[(i + 1) % 624] & 0x7fffffff))
            self.mt[i] = (y >> 1) ^ self.mt[(i + 397) % 624]

            if y % 2 != 0:
                self.mt[i] = self.mt[i] ^ 0x9908b0df

安全分析¶

逆向extract_number函数¶

对于extract_number函数，可以发现就是将状态中的数据进行位运算后给出。

显然我们可以发现这里的四次位运算类似，我们以y = y ^ y << 7 & 2636928640为例来进行逆向处理。

为了方便描述，我们将代码改为y = x ^ ((x << 7) & 2636928640)，其中 $y$ 可以看作已知， $x$ 未知。

$\begin{aligned} x &\rightarrow \mathtt{aaaabbbbccccddddee{\color{blue}eeffffg}\color{red}{ggghhhh}}\\ &\oplus\\ x<<7 &\rightarrow \mathtt{bccccddddee{\color{blue}eeffffg}{\color{red}{ggghhhh}}0000000}\\ &\&\\ 2636928640 &\rightarrow \mathtt{10011101001011000101011010000000}\\ &=\\ y &\rightarrow \mathtt{naannnbnccncnndden{\color{purple}enfnnfn}\color{red}{ggghhhh}} \end{aligned}$

这里为了方便编写便使用相同的字母作为一段，实际上其值可能不同。

显然我们可以发现 $y$ 的低7位就是 $x$ 的低7位和2636928640的低7位异或的结果。且2636928640低7位为0，所以 $y$ 的低7位就是 $x$ 的低7位。即我们能够得到 $\mathtt{\color{red}{ggghhhh}}$ 的结果，基于此，我们便可以向前恢复得到 $\mathtt{\color{blue}eeffffg}$ 。

$\mathtt{\color{blue}eeffffg} = (\mathtt{\color{red}{ggghhhh}} \&\mathtt{0101101}) \oplus \mathtt{\color{purple}enfnnfn}$

同理可以向前恢复得到 $x$ 的所有值。完整Python代码实现为
```
o = 3989032602

def inverse_right_mask(res, shift, mask=0xffffffff, bits=32):
    tmp = res
    for i in range(bits // shift):
        tmp = res ^ tmp >> shift & mask
    return tmp

def inverse_left_mask(res, shift, mask=0xffffffff, bits=32):
    tmp = res
    for i in range(bits // shift):
        tmp = res ^ tmp << shift & mask
    return tmp

def extract_number(y):
    y = y ^ y >> 11
    y = y ^ y << 7 & 2636928640
    y = y ^ y << 15 & 4022730752
    y = y ^ y >> 18
    return y&0xffffffff

def recover(y):
    y = inverse_right_mask(y,18)
    y = inverse_left_mask(y,15,4022730752)
    y = inverse_left_mask(y,7,2636928640)
    y = inverse_right_mask(y,11)
    return y&0xffffffff

print(recover(extract_number(o)) == o)
```

预测随机数¶

当我们能够逆向extract_number中的数据时，这也意味着我们能够提取得到state中的原始数据，同时可以发现后续的随机数完全依赖于上一轮的state，所以如果我们能够得到某一轮的全部state数据，便可以向后调用twist来预测随机数。

此类题型在CTF中经常出现，部分题型没有明确的给出随机数信息，但可以泄漏的其他信息得到随机数，解题时可以考虑加以注意是否能够直接或间接的得到题目中的随机数信息。

逆向twist函数¶

在上文中我们提到如果得到了某一轮state的全部信息便可以向后预测随机数，那么如果我们需要向前恢复随机数，则需要对twist函数进行逆向。

$\begin{aligned} y \rightarrow& \mathtt{aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa{\color{blue}{u}}\color{red}{x}}\\ y \gg 1 \rightarrow& \mathtt{0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa{\color{blue}{u}}}\\ \oplus&\\ \mathrm{0x9908b0df} \rightarrow& \mathtt{10011001000010001011000011011111}\\ =& \mathtt{bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb}\\ \oplus&\\ s_{i+397} =& \mathtt{ccccccccccccccccccccccccccccccc{\color{blue}{v}}}\\ =&\\ out=& \mathtt{ddddddddddddddddddddddddddddddd{\color{blue}{d}}} \end{aligned}$

其中y = state[i] & 0x80000000 | state[i + 1] & 0x7fffffff，我们可以发现新的值只和 $s_i,s_{i+1},s_{i+397}$ 有关，在上面的两次异或中，第一次异或和 $y$ 的奇偶性来确定

由于 $y\gg 1$ 的最高位必定为0，此时我们可以列出下列关系表，

0 1

0 N Y

1 Y N

其中行为最终输出out的最高位值，列为 $s_{i+397}$ 的最高位值，单元值代表是否发生第一次异或。由此通过判断是否发生异或，可以得到 $y$ 的最低位值。

此时按照extract_number中一样的方法，我们有

$\mathtt{\color{blue}{u}} = {\color{blue}{d}}\oplus {\color{blue}{v}} \oplus {\color{red}{x}}\ ,\mathrm{or}\ \mathtt{\color{blue}{u}} = {\color{blue}{d}}\oplus {\color{blue}{v}} \oplus {\color{red}{x}}\oplus 1$

同理，我们便可以向前恢复得到 $y$ 的所有值。

同时 $y$ 的值即为 $s_i$ 的最高位和 $s_{i+1}$ 的第[2, 32]位。同时我们可以对 $s_{i+1}$ 和 $s_{i+2}$ 生成的 $y$ 值做同样操作得到同样 $s_{i+1}$ 的最高位，对 $s_{i-1}$ 和 $s_i$ 生成的 $y$ 值做同样操作得到 $s_i$ 的第[2, 32]位，至此原state恢复完成。

Python代码实现为：
```
def inv_twist(state):
    high = 0x80000000
    low = 0x7fffffff
    mask = 0x9908b0df

    def recover(i):
        y = state[i + 624] ^ state[i + 397]
        if y & high == high: # 异或了常数
            y ^= mask
            y <<= 1
            y |= 1
        else: # 没有异或常数
            y <<= 1
        return y

    for i in range(len(state)-625, -1, -1):
        # 得到s_i的最高位
        state[i] = recover(i) & high
        # 对s_{i-1}做同样操作得到2-32位
        state[i] |= recover(i-1) & low
    return state
```

逆向init函数¶

我们可以发现第一轮的初始状态是通过seed生成的。

关键操作为

self.mt[i] = _int32(1812433253 * (self.mt[i - 1] ^ self.mt[i - 1] >> 30) + i)

显然这里的加法和乘法都存在逆运算，而中间的 $mt_{i-1}\oplus (mt_{i-1} \gg 30)$ 可以通过上文所述方法进行逆运算。

from gmpy2 import invert

def _int32(x):
    return int(0xFFFFFFFF & x)

def init(seed):
    mt = [0] * 624
    mt[0] = seed
    for i in range(1, 624):
        mt[i] = _int32(1812433253 * (mt[i - 1] ^ mt[i - 1] >> 30) + i)
    return mt

seed = 3989032602

def invert_right(res,shift):
    tmp = res
    for i in range(32//shift):
        res = tmp^res>>shift
    return _int32(res)

def recover(last):
    n = 1<<32
    inv = invert(1812433253,n)
    for i in range(623,0,-1):
        last = ((last-i)*inv)%n
        last = invert_right(last,30)
    return last

state = init(seed)

print(recover(state[-1]) == seed)

扩展¶

在上文中我们介绍了对于MT19937各个函数的逆向分析以及算法实现，但其中我们针对的都是以32bit为一组，并且能够获取到足够多连续随机数值的情况，那么如果我们得到的并不是连续的随机数此时我们该做何分析。

以Python中的random实现的getrandbits为例。

// getrandbits(k) -> x.  Generates an int with k random bits.
static PyObject *
_random_Random_getrandbits_impl(RandomObject *self, int k)
/*[clinic end generated code: output=b402f82a2158887f input=8c0e6396dd176fc0]*/
{
    int i, words;
    uint32_t r;
    uint32_t *wordarray;
    PyObject *result;

    if (k < 0) {
        PyErr_SetString(PyExc_ValueError,
                        "number of bits must be non-negative");
        return NULL;
    }

    if (k == 0)
        return PyLong_FromLong(0);

    if (k <= 32)  /* Fast path */
        return PyLong_FromUnsignedLong(genrand_uint32(self) >> (32 - k));

    words = (k - 1) / 32 + 1;
    wordarray = (uint32_t *)PyMem_Malloc(words * 4);
    if (wordarray == NULL) {
        PyErr_NoMemory();
        return NULL;
    }

    /* Fill-out bits of long integer, by 32-bit words, from least significant
    to most significant. */
#if PY_LITTLE_ENDIAN
    for (i = 0; i < words; i++, k -= 32)
#else
    for (i = words - 1; i >= 0; i--, k -= 32)
#endif
    {
        r = genrand_uint32(self);
        if (k < 32)
            r >>= (32 - k);  /* Drop least significant bits */
        wordarray[i] = r;
    }

    result = _PyLong_FromByteArray((unsigned char *)wordarray, words * 4,
                                PY_LITTLE_ENDIAN, 0 /* unsigned */);
    PyMem_Free(wordarray);
    return result;
}

可以发现，当参数为0时，返回0，当参数小于32时，生成一个32位的随机数取其高位，当参数大于32时，生成多个随机数进行拼接。我们不妨考虑一些最极端的情况

情况一¶

考虑我们不能获取连续的32bit随机数，只能隔1个取1个，考虑此时恢复随机数。

显然这对我们来说没有太多影响，因为 $s_{i+624}$ 只和 $s_i,s_{i+1},s_{i+397}$ 有关。假设此时 $i$ 为偶数，且我们只取第奇数位数，则我们可以得到 $s_{i+1}, s_{i+397}$ 。

$\begin{aligned} y \rightarrow& \mathtt{{\color{red}{a}}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa{\color{blue}{u}}x}\\ y \gg 1 \rightarrow& \mathtt{0{\color{red}{a}}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa{\color{blue}{u}}}\\ \oplus&\\ \mathrm{0x9908b0df} \rightarrow& \mathtt{10011001000010001011000011011111}\\ =& \mathtt{bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb}\\ \oplus&\\ s_{i+397} =& \mathtt{{\color{blue}{v}}{\color{blue}{c}}ccccccccccccccccccccccccccccc}\\ =&\\ out=& \mathtt{{\color{blue}{d}}{\color{purple}{d}}ddddddddddddddddddddddddddddd} \end{aligned}$

此时我们可以得到 $y$ 的[2-32]位，则可以由此得到 $s_{i+624}$ 的第1和第[3-32]位。同时又有第2位的约束

$\begin{cases} \mathtt{{\color{red}{a}}}\oplus\mathtt{{\color{blue}{c}}} = \mathtt{{\color{purple}{d}}}\\ \end{cases}$

显然这是一个较强的约束，因为 $s_{i+624}$ 中的错误会扩散至 $s_{i},s_{i+624-397},s_{i+624-1},...$

而此时我们能够所有第奇数个数的值和第偶数个数的 ${1,3-32}$ 位值。故我们可以得到 $s_{i+624}$ 以及 $s_{i}$ 的全部信息。

情况二¶

考虑我们现在只能获取所有第偶数个数的情况，此时显然我们只能够获取 $s_{i+397}$ 的第2位。但是 $s_{i+397}$ 的第二位配合 $s_{i+397+1}$ ，我们便能知道

y = _int32((self.mt[i+397] & 0x80000000) + (self.mt[(i+397+1) % 624] & 0x7fffffff))

的全部信息，此时再异或 $s_{i+397+397}$ 便可以得到 $s_{i+397+624}$ 的所有信息。最终通过此信息我们便能向前回推出 $s_j\ (1 \le j)$ 的所有信息。

情况三¶

现在我们考虑一种更加极端的情况，假设我们只能获取每个随机数的第 $k$ 位比特信息。我们是否还能利用扩散得到足够多的约束信息。

这里我们可以回顾之前的内容，我们的本质是通过已知信息推出与之直接关联的未知信息，再利用已知信息和未知信息之间的约束条件进行扩散，因为我们有足够多的信息，我们总是能够利用足够多的约束来求解所有信息。

对于这种情况，我们来考虑MT19937在 $\mathbb{F}_2$ 上的表现形式。

设state[i]的二进制表示为：

$\mathbf{x} = \{x_0,x_1,\dots,x_{30},x_{31}\}$

输出output的二进制表示位：

$\mathbf{z} = \{z_0,z_1,\dots,z_{30},z_{31}\}$

我们有

$\begin{aligned} z_0=&x_0 \oplus x_4 \oplus x_7 \oplus x_{15}\\ z_1=&x_1 \oplus x_5 \oplus x_{16}\\ z_2=&x_2 \oplus x_6 \oplus x_{13} \oplus x_{17} \oplus x_{24}\\ z_3=&x_3 \oplus x_{10}\\ z_4=&x_0 \oplus x_4 \oplus x_8 \oplus x_{11} \oplus x_{15} \oplus x_{19} \oplus x_{26}\\ z_5=&x_1 \oplus x_5 \oplus x_9 \oplus x_{12} \oplus x_{20}\\ z_6=&x_6 \oplus x_{10} \oplus x_{17} \oplus x_{21} \oplus x_{28}\\ z_7=&x_3 \oplus x_7 \oplus x_{11} \oplus x_{14} \oplus x_{18} \oplus x_{22} \oplus x_{29}\\ z_8=&x_8 \oplus x_{12} \oplus x_{23}\\ z_9=&x_9 \oplus x_{13} \oplus x_{20} \oplus x_{24} \oplus x_{31}\\ z_{10}=&x_6 \oplus x_{10} \oplus x_{17}\\ z_{11}=&x_0 \oplus x_{11}\\ z_{12}=&x_1 \oplus x_8 \oplus x_{12} \oplus x_{19}\\ z_{13}=&x_2 \oplus x_9 \oplus x_{13} \oplus x_{17} \oplus x_{20} \oplus x_{28}\\ z_{14}=&x_3 \oplus x_{14} \oplus x_{18} \oplus x_{29}\\ z_{15}=&x_4 \oplus x_{15}\\ z_{16}=&x_5 \oplus x_{16}\\ z_{17}=&x_6 \oplus x_{13} \oplus x_{17} \oplus x_{24}\\ z_{18}=&x_0 \oplus x_4 \oplus x_{15} \oplus x_{18}\\ z_{19}=&x_1 \oplus x_5 \oplus x_8 \oplus x_{15} \oplus x_{16} \oplus x_{19} \oplus x_{26}\\ z_{20}=&x_2 \oplus x_6 \oplus x_9 \oplus x_{13} \oplus x_{17} \oplus x_{20} \oplus x_{24}\\ z_{21}=&x_3 \oplus x_{17} \oplus x_{21} \oplus x_{28}\\ z_{22}=&x_0 \oplus x_4 \oplus x_8 \oplus x_{15} \oplus x_{18} \oplus x_{19} \oplus x_{22} \oplus x_{26} \oplus x_{29}\\ z_{23}=&x_1 \oplus x_5 \oplus x_9 \oplus x_{20} \oplus x_{23}\\ z_{24}=&x_6 \oplus x_{10} \oplus x_{13} \oplus x_{17} \oplus x_{20} \oplus x_{21} \oplus x_{24} \oplus x_{28} \oplus x_{31}\\ z_{25}=&x_3 \oplus x_7 \oplus x_{11} \oplus x_{18} \oplus x_{22} \oplus x_{25} \oplus x_{29}\\ z_{26}=&x_8 \oplus x_{12} \oplus x_{15} \oplus x_{23} \oplus x_{26}\\ z_{27}=&x_9 \oplus x_{13} \oplus x_{16} \oplus x_{20} \oplus x_{24} \oplus x_{27} \oplus x_{31}\\ z_{28}=&x_6 \oplus x_{10} \oplus x_{28}\\ z_{29}=&x_0 \oplus x_{11} \oplus x_{18} \oplus x_{29}\\ z_{30}=&x_1 \oplus x_8 \oplus x_{12} \oplus x_{30}\\ z_{31}=&x_2 \oplus x_9 \oplus x_{13} \oplus x_{17} \oplus x_{28} \oplus x_{31} \end{aligned}$

显然我们可以构造一个 $\mathbb{F}_2^{32}$ 方阵 $T$ 满足

$\mathbf{x}\cdot T = \mathbf{z}$

对于如何构造这个方阵有很多方式，在[1]中使用了一种选择乘数的方式得到 $T$ 的每一行。

现在我们把这种表示扩展到更高位空间，对于MT19937中的某一轮，其state可以表示为

$\mathbf{x} = \{x_0,x_1,\dots,x_{19966},x_{19967}\}$

此时如果我们能获取每个随机数的第 $k$ 位比特，则我们可以得到

$s = \{z_{i+k}\}$

若我们能够获取足够多的数据，则可以构造

$\mathbf{s} = \{s_0,s_1,s_{19966},s_{19967}\}$

同时构造

$\mathbf{x}\cdot T = \mathbf{s}$

最后在GF2上求解这个代数方程即可。

参考¶

[1]. 浅析MT19937伪随机数生成算法