Keywords: QC, boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues
1. The boundary erosion caused by complex model
Strict abstraction boundaries help express the invariants and logical consistency of the information input and output from an given component. Note the ML is required in exactly those cases when the desired behavior cannot be effectively expressed in software logic without dependency on external data. We can hardly assume any independency of the object we extract from the real world (all in all, it is a black box system). (总结的说,就是boosting产生的影响,bagging产生的影响,和可能的未定义行为产生的ensembles的影响)
1.1 Entanglement
We can hardly find a set of signals which any of the signal is orthogonal to the closed span generated by other signals. The phenomenon can be described as CACE(change anything change everything) principle.
One of the solution is to isolate the model and serve ensembles(Questions: how should we defined the isolation of the model?). This solution works well when the subproblems decompose naturally such as in disjoint multi-class settings (Recall the structure of LDA). Ensembles work well when the errors in the component models are uncorrelated. However, the ensembles make the prediction worse when we improve one of the basis model but increase the correlation within the basis model system.
The second solution is to focus on detecting changes in prediction behavior.
1.2 Correction cascades
There are often situations in which model $m_{\alpha}$ for problem A exist, and we are tempting to use the model $m_{A}$ to solve a slightly different problem $A^{\prime}$. (随机变量的向量表示是能够通过同构说明的,这里实际上是在考虑只对A到$A^\prime$)的残差建模). The problem is we create a new system dependency on the model $m_{A}$, which makes the evaluation of the improvement on the model $m_{A}$ more difficult (A离B 很近,C离A很近,如果改变C使得C离A更近,并不能保证C离B很近). Therefore, it is possible that the improvement of any model in the system will lead to a systematic detriments. The mitigation is to augment $m_{\alpha}$ to learn the corrections directly within the same model by adding features to distinguish among the cases, or to accept the cost of creating a separate model for $A^\prime$.
1.3 Undeclared Consumers
The above two scenarios may be caused by the undeclared consumers. They may be the source of the hidden technical debt (entanglement and correction cascade). The only way to prevent from these detrimental events is to set up strict service-level agreement.
2. Data dependencies
2.1 Unstable data dependency
类似于onehot编码中加入了新的一类,导致数据输入不稳定。One common strategy is to create a versioned copy of a given signal. Create a frozen version of this mapping, and freeze it until the quality of the next version is fully controlled. Note that version control has it own cost.
2.2 Underutilized data dependency
Underutilized data are the data sets with little increment to the prediction task. The typical examples of such kind of data includes:
a) A feature $F$ is included in a model early in the development. However, we include a new features, for example $f(F) = F + 1$, which makes the feature $F$ redundant.
b) Bundled features: we know a group of features is evaluated and found to be beneficial. However, some of the signals are actually has little or no value. (Question: how to define “has little value”)
c) $\epsilon-$Features: 糟糕的工程目标设定会导致研究员追求增益小,复杂度高的模型。
d) Correlated Features: (Important) the machine learning model can hardly recognize the differences between two correlated model A and B. For example, A is more naturally causal, while B is temporarily more correlated. Then model will think B is a better signal than A, and the residual of the projection on B will has limited projection on A. The result is brittleness if world behavior later changes the correlations. (这里作者的描述其实是缺乏概念上的严密性,然而实际上其实是这样的特征使得模型不够稳健,容易受到特征变化微弱的影响。)
Underutilized dependencies can be detected via exhaustive leave-one-feature-out evaluations (feature sampling). These should be run regularly to identify and remove unnecessary features. One solution is to use the feature management system, which enables data sources and features to be annotated. This kind of tooling can make migration and deletion much safer in practice.
3. Feedback loops
The key features of the live machine learning system is that research can hardly predict the performance of a machine learning system before it is released. (在线机器学习模型的表现是难于分析的)The reasons include that:
a) Direct feedback loops: the models will influenced by their own choices. Can be solved by the randomized policy (but costly to analyze).
b) Hidden feedback loops: two system influence each other indirectly through the world. (例如市场冲击问题,多个预测模型之间的博弈)
Remark: 机器模型的表现实际受到许多因素的影响。
4. ML-system Anti-pattern
Avoid the patterns listed below.
a) Glue code: Using generic packages often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages. Create a clean native solution rather than re-use a generic package. An important strategy is to wrap black-box packages into common API’s. (反对过度设计,从你我开始)
b)不要写太多的数据管道。数据管道的一项重要问题就是问题的定位困难。
c) Dead Experimental Codepaths: 这是过多的通用package和数据管道的后果,一旦管道发生变化,可能会产生系统性的灾害。一个要求的预防措施是做好严格的代码test。
5. Common smell
a) Plain-old data type smell. A robust system is all to often encoded with plain data types like raw floats and integers.
b) Multiple-language smell. Hard to transfer ownership to other individuals.
c) Prototype smell: 在小样本上的测试可能会导致在大样本上的鲁棒性确实。一直使用小样本数据在落地时是有额外成本的。
6. Dealing with maintenance cost in the External World
a) Fixed threshold in Dynamic systems: Pick a decision threshold for a given model to perform some action. Learn the threshold via simple evaluation on held out validation data.
b) 测试问题:端对端测试很难检测一个实时系统是否按照预期工作。一些检测指标包括:
Prediction Bias: $\mathbb{E}(x) = \int_{\Omega_{z}}\mathbb{E}(x | z) dP(z)$。
Action limits: 一旦出发某种行为(action function specified)应该自动触发人工检查。
Up-stream producers: 通过监控上游app的行为来确定底层逻辑的问题。这种上游app的检查应该层层传递。
Comments