解析基因调控网络GRN中TFs+500数据集

对于GRN任务，作为benchmark的TFs+500/1000数据集是绕不开的。但是，一旦分析一下这个名字，就会发现这个数据集非常诡异。

正如下图所示，你会发现gene列也不是500，也不是TFs+500。

为了解答这个疑惑，不得不去查看其论文。遗憾的是，论文对于这一部分写的非常简略。但是，这篇论文的代码以及数据集都是公开的。并且，我在github issue找到了相关信息:
where is the tfs+500 and tfs+1000 datasets?： https://github.com/Murali-group/Beeline/issues/98

根据回复，我下载并运行了相关代码。
code: https://github.com/Murali-group/Beeline
dataset: https://doi.org/10.5281/zenodo.3378975

整体架构

根据代码中的描述，TFs+500 mHSC-E数据集的生成过程如下（括号中的值表示这一步的基因数量）：

输入是4个部分：

gene expression data
knows TFs
ground truth GRN
gene ordering :大致表明了这个基因有用的程度。

你可能发现了，TFs与500个基因的并集刚刚好是500+204=704个基因，没有出现重复的基因。这是因为代码的逻辑是这样的：

variable_genes = []
if include_tfs:
    ...
    variable_tfs=...
    # 从gene_df 中删除了TFs
    gene_df.drop(labels = variable_tfs, axis='index', inplace = True) 
...
if num_genes > 0:
    gene_df.sort_values(by=var_col, inplace=True, ascending = False)
    # 从删了TFs后的gene_df中选择前num_genes(500)个基因
    variable_genes_new = gene_df.iloc[:num_genes].index.values 
    variable_genes = set(variable_genes_new) | set(variable_tfs)