Centre for Genomic Regulation的Guillaume Filion和Pompeu Fabra University的Lucas Carey，从PubMed下载了2012年1月到2014年4月的论文发表数据。他们用自然语言处理技术梳理近两200万论文的摘要，希望从中分析出2014年的新热点。
在这些平凡的研究之中，他们发现了一个令人意外的发现：以前很少出现的数据库从今年2份开始，这个数据库的出现频率突然上升为一周一次。这个数据库名为CISCOM（Centralised Information Service for Complementary Medicine， London），属于伦敦的补充医学研究委员会。该数据库是一个鲜为人知的数据库。
Filion和Carey进一步发现，有32篇不同主题的文章很奇怪，它们都是分析CISCOM数据库和一些常用数据库比如Google Scholar、PubMed和Web of Science已发表数据的meta分析或综述。而且这些文章全部来自于中国，作者是分布在多个城市的28个不同研究团队。
然而，这些文章的讨论部分都含有类似的表述，只有很小的改动。例如一篇文章写道「Importantly, the inclusion criteria of cases and controls were not well defined in all included studies and thus might have influenced our results.」另一篇写道「Importantly, the inclusion criteria of cases and controls were not well defined in all included studies, which might also have influenced our results.」
另外有四篇文章具有同样的语法错误，如「our results had lacked sufficient statistical power」中多余的「had」。Filion和Carey发现，这些文章似乎来自于多个模板。可以看出，文章作者主动对文段进行洗牌，这是一种规避剽窃检测软件的手法，与洗黑钱类似。
Copycat papers flag continuing headache in China
SHANGHAI, CHINA—Two computational biologists searching for trends in journals indexed in the search engine PubMed stumbled across signs that China’s paper-selling companies remain active, 1 year after Science published a detailed undercover investigation describing a highly sophisticated and lucrative industry.
Guillaume Filion of the Centre for Genomic Regulation and Lucas Carey from Pompeu Fabra University, both in Barcelona, downloaded all PubMed records for papers published between January 2012 and this past April. Combing over the abstracts for those 2 million papers using a big data technique called natural language processing, they isolated terms that spiked in use in 2014.
They hoped to find 「new topics about to detonate,」 Filion says. Not surprisingly, they found an uptick in papers mentioning cutting-edge topics like CRISPR, a gene-editing technique that was named a runner-up for Science’s 2013 Breakthrough of the Year, and lncRNA, or long non-coding RNA, an unusually long form of RNA that is now a hot topic in genomics.
But alongside those more predictable trends, one term stuck out: a little-known database run by the Research Council for Complementary Medicine in London called CISCOM, or the Centralised Information Service for Complementary Medicine. Until 2013, the scholars note, the term 「CISCOM」 appeared in only two to three papers per year. In February, the database began cropping up once a week.
Looking more closely, Filion and Carey found a group of 32 papers on varying topics that nonetheless shared some curious characteristics. All were meta-analysis or review papers that analyzed already-published data in CISCOM, along with more commonly used databases like Google Scholar, PubMed, and Web of Science. Moreover, all originated in China, from 28 different research groups spread out across several cities.
Filion, who described what he calls the 「disturbingly similar」 papers in a blog post published on 4 October, set out with Carey to determine what was going on. They downloaded complete versions of the 25 papers to which they had access through various institutional subscriptions or other means. (All but two papers are behind a pay wall.) Running the papers through the plagiarism detection program iThenticate turned up no red flags.
But the discussion sections of all the papers contain similar statements, with only minor changes. For example, one paper reads, 「Importantly, the inclusion criteria of cases and controls were not well defined in all included studies and thus might have influenced our results.」 Another states, 「Importantly, the inclusion criteria of cases and controls were not well defined in all included studies, which might also have influenced our results.」
Four of the papers include the same grammatical error—the extraneous 「had」 in 「our resultshad lacked sufficient statistical power.」 But in mapping out the relationships among the papers, the duo noticed that the writers seemed to be drawing from multiple templates. That suggests, Filion says, 「that the writers actively shuffle the texts」—a method of evading plagiarism detection software known as text laundering.
Most of the papers were submitted in late 2013, making it impossible that some authors plagiarized others after publication. Filion and Carey thus hypothesized that the papers might all be the work of a single company. With help from Yao Yu, a geneticist at Fudan University in Shanghai, the scholars identified an outfit whose website advertises tailored meta-analysis papers and contacted the company to inquire about its services. The company reportedly offers meta-analysis papers for journals with an impact factor of 2 or 3 for about $10,000.
A 5-month investigation published in Science last year found dozens of similar companies offering an array of services aimed at securing publication in journals indexed in Thomson Reuters’ Science Citation Index, Thomson Reuters’ Social Sciences Citation Index, or Elsevier’s Engineering Index—which at many Chinese institutions are critical to securing promotions. In addition to preparing original papers from scratch with data provided by their clients, China’s paper-selling companies fabricate data, arrange to add scientists’ names to already accepted papers, and sell finished manuscripts.
Among the most popular options for finished manuscripts are meta-analyses, perhaps because they require no original data. One legitimate analysis published in PLOS ONE in June 2013 found that from 2003 to 2011, meta-analysis papers from China rose more than 16 times faster than did such papers from the United States. Combing PubMed for other trends might turn up more evidence of malfeasance. But Filion says he and Carey now plan to turn their attention to other topics: 「We are not witch-hunters, we are big data analysts.」