In what may serve as great news for knowledge-seekers, activists have created a searchable index of over 107 million science articles. A total different 107,233,728 papers have been cataloged into a General Index, which is said to have a size of a whopping 38 terabytes.
The Index is a searchable collection of short sentences and keywords from published articles, which can be used to provide a gateway to scientific knowledge. The compressed version alone has a size of 8.5 terabytes, and can be accessed through archive.org, in a process that, despite being direct, is rather cumbersome.
Keywords and N-Grams to Help You Track Articles
But the world is full of nice people, as the data has been uploaded to a remote server by users on the /r/DataHoarder subreddit, who are also in the process of spreading it across BitTorrent.
However, it may be noted that the General Index doesn’t contain the journal articles in their entirety, and instead, it has only the keywords and n-grams (strings of simple phrases that contain a keyword). These phrases are known to make tracking articles easier.
Work in Progress
Public.Resource.org founder and General Index co-creator Carl Malamud has said that the version that has been released is still a “work in progress.” He has highlighted how the process was not sman easy one, with text extraction failing sometimes, and metadata not being available at other times. This, he says, is the reason why the “corpus,” despite being large, is still not up to date and complete.
Nevertheless, General Index represents, at least to Malamud, a “lookup tool, a dictionary of knowledge, a map to knowledge,” something he believes is necessary to the modern practice of science. He further adds that his team views the database as a public utility, and is not keen to assert any type of ownership of the tool.
Trying to Avoid the Law
At the same time though, publicly sharing scientific articles that exist behind paywalls was, and still is, illegal. Take for example Sci-Hub, which has been facing the ire of world governments for years. Malamud is hoping that with General Index, they will be able to enter the public domain thanks to the innovative approach they have taken to their database. Nevertheless, he has also landed in trouble for similar efforts, when the State of Georgia accused him of terrorism and sued him, following his attempts to post the State’s laws online for the world to read.