Wikipedia:Wikipedia Signpost/2018-10-28/Special report 2

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
Now Wikidata is six: SPARQL adds sparkle to WMF projects.

Charles Matthews began regular Wikipedia editing in 2003. He is currently Wikimedian in Residence at ContentMine, in Cambridge, working on the ScienceSource project. He has created over 12,000 articles on the English Wikipedia where he has made over 300,000 edits from a global count of almost one million edits to Wikimedia projects.

Photograph of Magnus Manske in October 2015, accompanied by two women, holding up a large framed award
When Wikidata was three: Magnus Manske (centre) with an award for Wikidata


Wikidata's sixth birthday falls on 29 October, and will be celebrated by 34 Wikimedia events worldwide. Last year, indeed, there was WikiDataCon in Berlin. This time around, the cakes will be distributed far and wide.

Some people, I suppose, will still not buy into the acclaim. Here's a personal story.

I get started on Wikidata

My earliest Wikidata edits had slipped my mind. It turns out that what I added initially was the first actual statement to the item on Winston Churchill. When that item had only sitelinks to Wikipedias, I linked it also to Churchill's father Randolph, in February 2013. It was a few days after I had set up a Cambridge meetup, which was probably the reason why I thought I should take a look.

The mix'n'match tool, one of Wikidata's huge successes, was written by Magnus Manske in 2013, though in nothing like today's form. After I made a feature request for a Wikisource tool at a meetup, he replied "I have a better idea", which is now undeniable. In any case, the initial datasets on the tool were catalog 1, from the Oxford Dictionary of National Biography (ODNB); and catalog 2, now the Art UK person ID, that started off under the older name "BBC Your Paintings".

It was not, however, until after Wikimania 2014 in London that I was really drawn into Wikidata editing. That was by a problem to solve, namely how many ODNB biographies were of BBC Your Paintings artists. These days I take it for granted that I can write an easy SPARQL query and bring up an answer: as of this writing, Wikidata knows about 2088 of these matches. At the time, the two British cultural institutions were interested in this question, and were going at it by traditional methods. I also had at the back of my mind another problem: how many ODNB women were missing from the English Wikipedia? Carbon Caryatid had asked me that.

So, as Wikidata turned two, I started to put time into the particular biographical area that was being opened up. By 2018 standards this was still pioneer stuff. There was no SPARQL yet, though there was Magnus's substitute Wikidata Query Service (WDQ). Not only that, but matching implies you can solve the disambiguation question for people, and in the worst cases that can be really hard. When a common name, say "William Smith", came up, you were faced with a list that could run to several pages of hits on the exact name. Typically most of those items were undescribed, and very sparse when it came to biographical facts.

Becoming serious

Portrait photograph of Andrew Gray in 2014
Andrew Gray in 2014

As an extension of what I had been doing on Wikisource with the first edition and English Wikipedia, matching the ODNB on Wikidata was a natural project. Andrew Gray did early spadework there, with Magnus providing the tech support, and we three began a long series of emails, wondering about getting other biographical datasets into mix'n'match. Andrew's pair of Wikidata blogposts from 2014 give the flavour.

Much heavy lifting was going on in early 2015: at that time the mix'n'match tool in gamified mode was my common approach: improved automatching has taken some of the fun, and most of the low-hanging fruit, from that mode of using it. Importantly, as the Wikidata community relaxed its view on notability, non-matched catalog entries were no longer just parked for later consideration: either they were created as items at once, or they were marked "N/A" and left on the back-burner.

So arose the project of creating items for all ODNB entries. In other words, all ODNB topics, around 60,000 of them, which include academic arcana such as "Women in trade and industry in York (act. c. 1300 – c. 1500)" and its 14 examples, would be considered Wikidata-notable. This liberal interpretation of notability that came on stream in 2015 had some odd effects, when minor figures from other catalogs got items, but now that Wikidata is at the scale of 50 million items one hardly sees it as causing genuine problems. Typical databases in the cultural sector, for example the British Museum's, contain many sparse entries. And (worse) often entries that one cannot match because there is no adequate identification provided. Wikidatans do delete such things.

In mid-2015, I pushed through the final stages of ODNB matching into Wikidata and, with Magnus, I helped select which further Wikidata items for Art UK artists should be created. The very interesting BBC television series Britain's Lost Masterpieces makes Art UK's work vividly accessible, as the presenter Bendor Grosvenor pokes around in gallery storage looking for "old masters". Wikidata allows easier access to minor artists, with less dust and cobwebs, since over 22,000 Art UK artists, some 60% of the total, are represented. These can often be identified and merged with other items, though the scholarly challenges in a Wikidata merge can be quite serious (and instructive) just because the notability standards are quite relaxed. Consequently, I have been in a number of meetings with Art UK, explaining Wikidata.

To answer the "missing women from ODNB" question, there was the matter of filling in the "sex or gender" field, and then writing some standard SPARQL. I could see that the answer was about 2,000. The advance from the number to an actual redlink list, created by the ListeriaBot, was still major.

More discovery

Chart of pages on the English Wikisource as of 30 April 2013 depicting two lines being graphed: one of the "naked" pages not backed by scans, while blue shows pages backed by scans
English Wikisource, growth of proof-reading directly against scans in the early 2010s

I came to the ODNB through the Dictionary of National Biography (DNB), specifically the Victorian version edited initially by Leslie Stephen, and I found the DNB on Wikisource. Back in 2010, I gave a talk based on about a year proof-reading the DNB at the Annual General Meeting of Wikimedia UK. There was and still is a discoverability issue with Wikisource, where the French version alone has a million texts. How does one locate texts on a given topic? The category system is not really designed for that, and in any case is used inconsistently by the various languages.

It turned out in 2015 that Wikidata potentially could solve this problem. Any Wikisource text, however short, warrants its own Wikidata item about the text itself. That item can have a "main subject" statement highlighting the topic. Run queries on such statements and you have a language-independent mechanism for finding material you want on Wikisource. Bots started posting Wikidata items for all the articles in big works such as the DNB (around 30,000 of them), and in doing so created outline entries for them in Wikidata.

I took up this direction in 2016, completing the identification of main subjects for the 30,000 metadata items of the DNB. I remember it being incredibly hard graft, even though the previous round of ODNB work meant that all the subjects of the biographies were already there. A proper matching process got into side issues and required, for good conscience, large amounts of cleanup with plenty of merging of duplicates. Fortunately, merging is easier on Wikidata than on Wikipedia.

In 2017, I was rewarded for investing so much of my time: once I had made a key advance in my SPARQL understanding, I was able to write queries to remove the need for patrolling I did on Wikisource to see which Wikipedia articles covered the DNB topics. In a neat plot twist, it was a tool for this kind of patrolling that comes into my anecdote about the origin of mix'n'match above. I learned the facet of SPARQL I needed from Jheald. With the help of a Petscan query, I can do my patrolling without much effort, and so thank those here who create articles on DNB topics. The articles show up once they have a Wikidata item (caveat here about some needed merging), which frees my time to be better spent working on the backlog of articles that don't.

SPARQL users, I should say parenthetically, form a good and collaborative community in my experience. I use the amazing full text search in mix'n'match most days for biographical research—and it too has functioning communities, the matchers and the uploaders of datasets. The more recent TopicMatcher tool supports main subject work (including "depicts" for pictures), and therefore has the potential to take some of the grind out of the discovery trail. Specialised software is a large factor in the development of Wikidata, not just bots, though they still play a massive role as well.

Official logo for the ContentMine ScienceSource

WikiFactMine and ScienceSource

In April 2017, I started work at ContentMine, an unconventional Cambridge tech startup, as Wikimedian in Residence for the WikiFactMine project supported by the Foundation. There I had T Arrow as a colleague. Over five months, the first half of it based at the Betty and Gordon Moore Library where I held training sessions and blogged, I saw the WikiCite initiative to get control of science bibliography take off as Tom Arrow's fatameh tool was exploited to the full by bot operators on Wikidata. Tom now works as a Wikidata contractor.

Photograph of Tom Arrow standing and speaking at Wikimania 2017
T Arrow at Wikimania 2017

Wikidata's store of items about individual scientific papers shot up, from about half a million, and reached 5 million by August that year. It now tops 18 million, with over 150 million citation statements. In October 2017, I went to WikiDataCon in Berlin, my first Wikimedia scholarship.

There was a further Wikimedia grant to ContentMine for the ScienceSource project that started in June this year, centred on the Wikibase site at the ScienceSource Wiki. Wikibase is to Wikidata as MediaWiki is to Wikipedia: it means essentially the same software as Wikidata, if without some features, but set up as an independent site and community. There is a Wikimedia UK blogpost title "Science Source seeks to improve reliable referencing on Wikipedia's medical articles" which is about ScienceSource, along with a set of basic introductory videos.

Photograph of the Betty and Gordon Moore Library
The Betty and Gordon Moore Library, on the West Cambridge mathematics campus

The underlying idea of ScienceSource is to be more systematic about searching the biomedical literature for medical facts that can be passed with good references to Wikidata. It applies text mining, as did WikiFactMine before it, but aims to bring it closer into the Wikimedia fold by posting the results to its Wikibase site, where SPARQL can be run over them. It will engage with WP:MEDRS, the major reliable sources guideline that applies to Wikipedia's health information, again by use of SPARQL applied to metadata. Details aside, it is a project that could hardly have been conceived without the development of Wikidata and its supporting tools. Why not on Wikisource, you cry? Here's why:

The social realities

Group photograph of the Cambridge Wikidata Workshop, with cake
Group photo at the Cambridge Wikidata Workshop, 20 October 2018, in Makespace, 16 Mill Lane, Cambridge UK

I've never been seriously involved with infoboxes, neither here nor from the Wikidata end, yet I have ended up in a project that takes for granted their role: when ScienceSource adds statements to Wikidata, they can appear in infoboxes on Wikipedias in 300 languages.

Last time I wrote for The Signpost about Wikidata, my major theme was the "integration" of Wikimedia sites, facilitated by Wikidata. The infobox mechanism should become infrastructure for that integration and, in the way of such things, ultimately be taken for granted. "Citation reform" here can in principle be carried out as an infrastructural project using the same family of techniques around Lua, though the social realities mean that the frictional forces may be a serious factor for delay.

From my current interests, I would single out the "SPARQL aggregate" as potentially having the same range of benefits for Wikimedia. SPARQL itself may become the first thought for issues of discoverability, because it can cope with disparate inputs as long as their relational structure is clear ("find me authors born in India, writing in English but with a Spanish mother"). What SPARQL aggregates do is to tack onto purely list-making queries any columns that may be computed spreadsheet-style from associated data. It appears to me a rather powerful model.