Wednesday 18 April 2012

The Wikification of ESTC

ESTC are asking for feedback (here) on a planned redesign.

"Big changes are underway" (the email alert reads) because the ESTC cannot keep up with the growth in submitted records, online digital facsimiles etc. Having received funding for a redesign, they now want the imprimatur of users (and perhaps a bell or whistle), before applying for a second round of funding to implement the redesign.

The planned redesign will enable "users to assist in the bibliographical 'detective work'":

ESTC will welcome users’ contributions in correcting, refining, and expanding the information in both its bibliographic records (for instance, by supplying evidence for a publication date differing from that in the imprint) and holdings data (for example, by linking digitized works to physical copies).

Information about holdings contributed by users will remain separate from that provided by libraries and will be subject to versioning but only until "a prescribed [but undisclosed] number of users match to an existing ESTC record". Then the record will be removed from the provisional “contributed” corpus and added to ESTC proper.

The proposal is that one's curatorial standing as a contributor to the new ESTC will be determined by "the number of edits a user has made"; this standing will—like the Google search algorithym—assign a "confidence rating" to each user. So, for example, three confirmations of a publication date by editors with a level three or higher rating could result in a confidence rating of five for an entry.

* * * * *

I do not know how many survey responses ESTC have received so far. I hope it is a lot. But there are only nine replies on the ESTC21 "blog", so I suspect not. (NB: it isn't a blog, it is a website. A blog is a "personal journal … consisting of discrete entries … displayed in reverse chronological order". But I digress.)

The nine replies on the ESTC21 "blog" are mostly critical and a number oppose the changes altogether. As do I—I think it is a deeply flawed plan, as I explained in my survey response this morning. But it is pretty clear from the information provided on the site that there is no chance that those who oppose the wikification of ESTC will prevail.

The main problem with the existing plan is pretty simple: there is no quality control. Sheer weight of numbers determines one's curatorial standing and will allow "provisional" records to be incorporated into the core ESTC. It is a system which allows—indeed it will encourage—enthusiastic but ill-informed users to corrupt the existing ESTC database.

The end-result will not resemble Wikipedia (with its 31.7 million registered user accounts) generating a passable hive-mind consensus, but that bibliographical cess pit, OCLC WorldCat (with only a score of authority contributors, and tens of thousands of contributing libraries).

* * * * *

In my Bibliography of Eliza Haywood I politely explained the hierarchy I applied to books whose location I recorded:

Copies listed in major union catalogues, library catalogues, bibliographies and works of reference are grouped according to their source, since these sources differ considerably in reliability. These appear in order of importance to this bibliography. The three most important sources used are ESTC, the National Union Catalog and OCLC WorldCat.

With NUC "matching" of records was undertaken—on a massive scale—by grad students, not trained cataloguers or librarians. Locations of "identical" works were recorded on a single card, and usually only that one card appears in NUC. Not surprisingly, the grad students made mistakes, we all do: but they also couldn't and didn't distinguish issues and editions, they confused microfilm and facsimile reprints with original editions etc. It is a dog's breakfast. WorldCat is a lot, lot worse than NUC.

Sue Waterman noted in a 2007 discussion on the SHARL-List "WorldCat lacks about 28% of the entries in the NUC pre-1956 Imprints"—and it contains a smaller percentage of pre-1800 imprints. But the real issue, as Richard Noble explained, is …

The database itself consists of an accumulation of records of varying quality, from very high to simply wretched. In the case of older materials, these consist in large part of records "converted" from paper files of information all too often so minimal as to preclude the identification of edition or issue without recourse to the artifact.

In theory, once a record has been input by one institution, other institutions simply add their holding symbols--a process which in many cases involves a good deal of more or less educated guesswork, under economic conditions that discourage the asking or answering of questions.

The records against which this matching is done are rife with duplicates based on false distinctions or legitimate doubt concerning the entity represented by the record, as well as conflations (some resulting from crude matching protocols) and incorrect holdings statements based on bad matches. The hardest cases, obviously, bring out the worst.


* * * * *

ESTC has avoided the sort of problems that beset NUC and OCLC WorldCat—and it was the most reliable union catalogue I used—precisely because ESTC staff were trained to distinguish editions and issues and to vet copy matching.

In many ways the worst result of the Wikification of ESTC will be in the area of attributions. The plan suggests users will be asked to provide the "names of authors suggested by other sources."

ESTC users can look forward to having all of the attributions that have ever been proposed, no matter how ridiculous/idiotic/mistaken/accidental, being added to the author fields of each entry. Furbank and Owen's will have wasted their time with their Defoe De-Attributions (1994) because half of the publications from the entire eighteenth century will be re-attributed to Defoe, or have his name added to the author field by well-meaning but ill-informed contributors. Ditto Eliza Haywood.

And I have to note: the forty-five works mis-attributed to Haywood, which I list in my Bibliography, were all "suggested by sources"—many of them by multiple sources. And just as the number of contributions made by a single contributor does not give you any guide to how reliable they are, just as the number of said contributions confirmed by other busy users does not give you any guide to how reliable they are, the number of sources who suggest an author for a work is not a guide to the authorship of that work.

Allowing users to add any/all attributions on the mistaken assumption that the truth will out—because those who are well-informed will be sufficiently numerous or motivated to repeatedly edit away all the ill-informed attributions—is a fantasy. As in Wikipedia, editors for entries and fields will be needed. And in contentious cases, particularly those that have a high-profile in the public arena, the entries will have to be locked down as they are in Wikipedia. And if there are administrators, there will be "administrator abuse"—which is considered to be the major reason for the decline in Wikipedia editor numbers since 2006 (a striking reversal from its exponential growth between 2001–5).

And here is the essential difference between Wikipedia and an ESTC-wiki: Wikipedia works (mostly) because of its vast user-base. ESTC does not have anything like a sufficient number of users for this system to work. It can only muster nine comments on its "blog"! With such small numbers, edits will go unchallenged, commonplace errors will be "confirmed" and ESTC will lose the authority it has enjoyed for thirty years. If editorial control is relaxed to the extent planned, it will result in ESTC becoming a kind-of grease trap, like WordCat, where bibliographical refuse accumulates.

No comments: