Friday, January 29, 2010

Who Owns Koha?


In New Zealand, Maori customs are taken seriously. For example, in 2002, the route of a new highway through a swamp had to be altered because it was believed that three taniwha - Karutahi, Waiwai, and Te Iaroa - lived there, and were being disturbed by the road, causing an unusual number of accidents. Taniwha are mythical beings that act as guardian spirits. Many taniwha arrived in New Zealand as guardians of specific ancestral canoes and then took on a protective role over the descendants of the canoe's crew.

Another Maori custom, one that has crossed over into general New Zealand culture, is that of "koha".  "Koha" is often translated as "gift", but according to Chris Cormack, one of the original developers of the Free Open-Source Software (FOSS) Library System with the same name, a more accurate translation would be a "gift with expectations".   Cormack got his B.A. degree in both mathematics and Maori Studies, so he should know. A koha is gift that is offered with an expectation that it will be reciprocated.

In the U.S., (and to a lesser extent, in Europe) it's our lawyers that we seem to take seriously. And so when we let people use software that we've developed, its not enough to offer it as a koha, we have to use a legal license to spell out the terms of the release. The license that has become popular because of the expectations of reciprocity built in to it is the Gnu Public License (GPL).

Software released under the GPL cannot be thought of as an unconditional gift by its developer. GPL software is not in the public domain, it is copyrighted. A copyright owner can exert control over the use of the copyrighted material; the GPL uses that power to require licensees to publish any modifications they make if they want to redistribute the modified work.  When the Horowhenua Library Trust was choosing a license to use for Koha (the library system software), they chose GPL (version 2) because they thought it would prevent Koha from being further developed as non-open software.

Included in the recently announced (but not yet completed) acquisition of LibLime by PTFS were Koha-related assets, including source code copyrights, trademarks and the koha.org website. It may be difficult for the casual observer to understand what value these assets have, especially in light of the GPL license attached to Koha. Could these assets be used to privatize Koha in some way? The short answer is "No", but it gets complicated.

Some FOSS companies use "dual licensing" as a business model. They release their code under GPL, but if another company wants to use the software in a way that would not be allowed under GPL, it has the option to pay for a commercial license. Dual-licensing can only be done by the original copyright owner. Index Data has used this model for years to the great benefit of libraries everywhere. MySQL AB is an example of a company which was extremely successful with this model; it was acquired by Sun Microsystems for about a billion dollars. (See this article for an overview of how GPL licensing fared in Oracle's subsequent acquisition of Sun.)

The GPL makes it very difficult, however, for anyone to change licensing terms after software has been released into the world. That's because of the way copyright law determines the copyright holder of derivative works. In general, if I take a piece of software that you have written, and modify it in such a way that involves creative effort on my part, then I own the copyright to the changes that I've made and the resulting work is a derivative work that both you and I have a copyright interest in. Even though you are the original copyright holder, you would need my permission to release the derivative work under any license other than the GPL.

Unlike Index Data's software, Koha has included significant contributions from many developers, including many who have never worked for LibLime or Katipo Communications (which sold its copyrights to Koha source code to LibLime in 2007). So although LibLime probably owned clear copyright to a majority of Koha at some point, Koha is still a collective work locked into GPL, version 2, and it is unlikely that LibLime or PTFS would be able to distribute Koha under terms other than GPL without doing a thorough rewrite of the software.

Trademarks are a different story. LibLime owns the US trademark for Koha; a European trademark is held by BibLibre. Trademarks are frequently used by open source projects to prevent splintering. The excellent primer on legal issues by the Software Freedom Law Center puts it this way:
FOSS applications develop reputations over time as users come to associate an application’s name with a particular standard of quality or set of features. Trademark law can help protect this relationship of trust and reliance that a project develops with its users; it allows the project to maintain a certain amount of control over the use of its brand.
Since GPL and other FOSS licenses allow anyone to modify and distribute software as long as the license conditions are met, they frequently spawn variants. The owner of a trademark can prevent these variants from using the trademarked name, and thus enforce unity in a project.

In the case of Koha, there are currently two parallel tracks of development being pursued, one inside LibLime, and the other by the community of developers outside LibLime. I will have to postpone a discussion of the issues surrounding  these development tracks to yet another article, but for now, let's just assume there will be two main versions of Koha, LibLime Koha and Community Koha. In the US, LibLime could theoretically prevent anyone from  using the name "Koha" without its authorization, and could strip Community Koha of the right to use "Koha" in its name. In fact, LibLime and BibLibre threatened to use this power a year ago to regulate PTFS's use of Koha trademarks in the marketing of its Koha support services. Liblime could even apply the Koha name to non-Open Source software. Similarly, BibLibre could regulate the use of the Koha name in Europe, preventing LibLime from marketing LibLime Koha there.

Based on discussions I've had with the leaders of almost every open-source library system company, I think it is unlikely that there will be any such "trademark war". Even if the development of Koha continues on separate but related tracks, the success of every Koha-based company is tied to the success Koha as a whole, and vice versa. It would be advantageous for every stakeholder if the two trademark owners develop some sort of "big tent" system of Koha trademark governance. Assuming PTFS's acquisition of LibLime is completed, such governance will need to be acceptable to both PTFS and BibLibre, and will need to accommodate differing styles of software development.

Until a general agreement on the use of Koha trademarks is reached, Koha stakeholders would be well advised to recognize that collective copyrights tie them into the same canoe and that they should avoid disturbing the taniwha that guards and protects them.

This article is the second part of a series. Part 1 is here. Part 3 is here
Reblog this post [with Zemanta]

Tuesday, January 26, 2010

Deconstructing the Attributor Book Piracy Study

It was a lot of fun to have a post slashdotted and read by 100 times more people than any of my previous posts, however, the fact that it made fun of the way a study on piracy was "spun" overshadowed some serious points.
  1. The study in question, from Attributor, in fact has some substance beneath the spin.
  2. The number of books people get from libraries is comparable to the number they get from bookstores.
First, the Attributor study. Looking past the silly projections of book industry lost sales, there is some useful information there. In the study, illicit copies of 913 books were found on four one-click download sites which display download counts. The reported download counts for these 913 books totaled 3.2 million over 90 days. On average, each book was downloaded 3,500 times total, or 39 times per day.

These numbers are in rough agreement with a much smaller sample I collected for myself. I looked at only 10 books in a single category on a single site, and observed that download totals ranged for 0 to 5000, with an average of 578.

I followed up with some questions for Rich Pearson, the General Manager at Attributor. Most interesting to me, and probably of likely concern to publishers, was this: of 913 titles, chosen from Amazon lists to get a broad distribution rather than to focus on popularity, Attributor was able to find illicit copies of 90% of the books they looked for. I had expected that the most popular titles would be easy to find, but this breadth surprised me.

The extrapolations made to translate these download numbers into economic impact, while understandable from an marketing point of view, are multiply lacking in rigor.

The first assumption made by the study is that the "downloads" reported by  download sites represent potential readers of books. Attributor does not have any special relationship with the download sites to gain access to download statistics, they simply rely on the download counters published by the sites.  Assuming that the counts are not simply manufactured (and I've seen download sites that do just this) it's unlikely that the download sites bother to distinguish between robot activity and human activity. Robot activity is important because many of the download sites do not offer search themselves. This is a tactic that allows them to be unaware of, and thus not liable for,  the content that they host. The sites depend on other sites linking to them to drive traffic; they monetize the traffic by throttling downloads and offering "premium" subscriptions to people want to remove the throttles.

The second assumption made in the Attributor study is the way that they extrapolate numbers reported by 4 sites to the 25 sites that were monitored. What they did was to weight the sites based on the distribution of the 52,000+ takedown notices that Attributor has sent out since launching their monitoring service in July of '09. One problem with this is that while the scope of takedowns was limited to Attributor customers, the 913 monitored books were limited to non-customers. If Attributor's customers were focused in textbooks, for example, this would result in a bias towards sites used to host illicit textbooks. The other problem is "lamp post" bias- the extrapolation is based on what Attributor can find; sites hidden from Attributor would result in undercounting.

Freakonomics: A Rogue Economist Explores the Hidden Side of Everything (P.S.)Third, the sampling extrapolation to cover the entire industry has significant uncertainty. I worry most about price bias. The 1,132 downloads of the book Freakonomics were reported, which seems insignificant against sales of 2.5 million copies. Freakonomics, ranked #77 at Amazon,  can be bought new for $9.35. By contrast, Architect’s Drawings, which was downloaded over 10,000 times and ranks #547,491 at Amazon,  is a $60 hardcover. (The high download count is likely due to use as a textbook). Thus the "lost sales" for Architect’s Drawings will be hugely overweighted in the extrapolation.

Architect's Drawings: A selection of sketches by world famous architects through historyFinally, while the Attributor study does note that the actual effect of downloads on sales is speculative, there is a significant question as to whether the people downloading the book copies could have purchased the books even if they wanted to. It is evident from the chatter around the book downloads that many of the downloaders speak Spanish, French, Portuguese, Arabic, and Indonesian. According to Pearson, Attributor scans sites in many languages linking to illicit books, including Chinese, Japanese, Korean, French, Spanish, German, Italian, Czech, Polish, Russian, Portuguese and Austrian. The impact of piracy is likely to be very different in export markets. Not that this is shocking news to anyone.

Attributor is currently targetting its service (which starts at around $10,000) at large publishers, though it is working with a reseller called Author Guard to service smaller publishers. It points to its superior ability to find and take down illicit content as its comparative advantage. (I've previously surveyed other companies in this space; somehow I managed to overlook Attributor.)

Although many publishers will look at the numbers and conclude that services like Attributor are not worth the expense, I think the real danger from piracy is collective rather than individual, and thus the response should be collective rather than individual. The book publishing industry will only suffer if piracy becomes so widespread that piracy gains cultural acceptance. To prevent this from happening, book publishers need to lead the culture with both carrot and stick. Legal, allmost-free access to books must be provided such as occurs today in libraries, while illicit content should be taken down in as efficient manor as possible. This means that services such as Attributor's should be deployed on behalf of the entire industry, perhaps through a consortium, and not just by individual publishers. Finally, the book industry needs to figure out how to use ebooks to effectively address the needs of the developing world, or else huge markets will be forever out of reach.

The book publishing industry is entering some scary times and needs to decide who its friends are. A don't know whether technology companies with scary marketing will prove to be reliable friends or not. Amazon and Apple might end up being saviours, but publishers shouldn't expect them to be buddies. I'm pretty sure of one thing, though. Libraries are definitely not the enemy.
Enhanced by Zemanta

Monday, January 25, 2010

8 One-Way Business Models for Linked Data

Real Wheels - Travel Adventures (There Goes a Train/Plane/Bus)One of the videos that I was forced to watch many times when my boys were younger was There Goes a Train. It's a pretty good video. I learned many things, including the fact that locomotives don't have steering wheels. Yeah. Pretty obvious if you think about it for even a moment. It's so unfair. Without the switches, the rail network would be pretty useless, but no one will ever make a video entitled There Sits a Switch.

In electronics though, the switch is the star. With a switch, you can modify and route information; with a wire you can only send it from one point to another. You need good switches to make a computer or a network; even though photons are faster and easier to move from one place to another, computers are still based on electrons because electronic switches are so much better than optical switches.

Linked Data is a label for a set of technologies that are trying to make information move around the internet more easily and with more meaning. The Linked Data vision is one where many entities acting cooperatively and globally create a web of data much more powerful and meaningful than any single entity could bring about.

For the Linked Data vision to become a reality, each entity must have a strong motivation to cooperate; each entity must have a viable business model. If the business models were easy, the Linked Data vision would already be a vibrant reality.

Scott Brinker recently launched a round of discussion about seven business models that can make Linked Data viable. Leigh Dodds contributed some important insights in his followup, prompting Brinker to add an eighth model.

Here are Brinker's eight business models for Linked Data (somewhat relabeled based on who's writing checks):
  1. Subsidy. Entities such as governments with a mandate to make information available will pay to have it linked into a global web of information.
  2. Subscription. People will pay for valuable data, and will pay more for data that has been linked to a global web of information.
  3. Advertising. Advertisers will pay to information in raw data feeds.
  4. Authority. People will pay for the validation and certification of data.
  5. Affiliate marketing. Merchants will pay sales commissions on sales resulting from affiliate links in embedded in the global web of data
  6. Service Enhancement. People will pay for services which have been enhanced by data from a global web.
  7. Search Engine Optimization. Search engines will send you more traffic if you give them more meaningful data.
  8. Brand Enhancement. Your reputation will be burnished if you emit lots of good information.
(I should note that Brinker describes each model a bit differently so that he can add a dimension that characterizes whether data is delivered raw or as an application.  I find that this dimension is not at all orthogonal. A data driven subscription service is a service that makes use of data, but the core business model is not to sell a data subscription.)

There are difficulties with all of these business models, but it strikes me that each of them will only work in one direction, like a train track without switches. Either they work for emitting data, or they work for consuming data, but none of the models work in both directions at the same time. If you're providing a service that's either based on Linked Data or enhanced by it, you can pay for the data, but if you send that data back out, your competitors get the data for free. Conversely, if you're emitting data, it's hard for you to pay for it.

Imagine you're in the book metadata business. You can use several of these models to support creation of book metadata, or you can consume book-related metadata to provide book-related services. But what if you want to support an activity of aggregating book data or fixing errors in book metadata? None of these models will work for you because you'll either be competing with the entities you get data from, or you'll be competing with entities you send data to.

What's missing from this list is a business model for the Linked Data switch. Entities that take in Linked Data, improve it or otherwise add value and reemit it as Linked Data have no solid business model to run on. Everyone active so far in the Linked Data business is either a data sink or a data source. To realize the full potential of Linked Data, there need to be viable switches, both collecting and emitting Linked Data.
Reblog this post [with Zemanta]

Thursday, January 21, 2010

PTFS to Acquire LibLime and Move to Library Systems Premier League

Update Feb.12 - the acquisition is not happening
Update Mar. 16- the acquisition closed after all.

In 2009, the New York Yankees had a payroll almost ten times that of the Florida Marlins. The reason that baseball lives with that disparity is that the financial interests of many owners do not align with their fans- they take in roughly the same amount of money no matter what the team's performance.

It's different in the English Football Leagues. Teams which fall to the bottom of the standings in the Premier League are relegated to the second division, the equivalent of baseball's minor leagues. At the same time, the best teams in the second division are promoted to the Premier League, giving them a chance to make much more money. There is a clear alignment between the interests of the fans and the owners.

The library industry has likewise been troubled by misalignment of interests between the owners of the companies and their customers. That's why it's important for libraries to pay close attention to the frequent mergers and acquisitions of the companies that serve them. These transactions are often announced just before an ALA meeting, and this past weekend's ALA Midwinter Meeting was no exception.

The big story of the weekend was the pending acquisition of Koha support vendor LibLime by PTFS (Progressive Technology Federal Systems, Inc.). (The acquisition is still in the due diligence phase and is expected to close in early February; terms were not disclosed.) The surprising part of the announcement was the sudden emergence of PTFS, which has had a very low profile in the library industry, into the top tier of integrated library system vendors.

Here I must digress to discuss a bit about business models in the library industry. Libraries have traditionally viewed their catalog system vendors as long term partners; the migration of data from one system to another is a major project, not lightly taken, and preferably not attempted more than once a decade. The choice of a new system touches almost all the library's processes, and thus involves many consultations and lengthy RFPs.

From the vendor's point of view, the sales process is very expensive. Promises to customize the system to address customer peculiarities are common, and these add to the cost of system maintenance. Once the system has been sold, a proprietary system vendor has a guarantee of continuing profits from support contracts. Only the vendor has the system knowledge (and sometimes even the system access) to make even the most trivial changes. It's in the support phase that the vendor and customer interests can become misaligned. The vendor has every incentive to do the least work at the highest price possible. The customer is locked into whatever system they have chosen.

Companies with strong cash flow have been attractive acquisition targets for private equity firms. Once acquired the company's new management focuses on eliminating expenses by cutting support staff and cleaning up the balance sheet by offloading liabilities such as unfinished development, thus making the company very profitable. The company can them be resold at a good mark-up. Customers often become very unhappy during the process. The company they "hired" during their system selection process transforms into something different.

The recent popularity of open source library management systems is in large part a search for business models that better align the interests of vendor and customer during the support phase. If the support vendor doesn't perform to the library's expectations, the library can hire a new support vendor without ditching their automation system. If a library wants to add a new feature to their system, or integrate it with a system from another vendor, they can hire a developer based on qualifications rather than access to source. The important thing to the library is not so much the access to source or the cost of the license, it's the absence of vendor lock-in.

The reason that PTFS is not widely known is that it specializes in an obscure segment of the market- it supports libraries predominantly in the government and the military. Founded in 1995, PTFS has been installing ILS systems, doing conversions and supporting systems in the unique security environment of government systems. John Yokley, a co-founder and the CEO of PTFS, spent 13 years as a Sirsi system administrator and programmer at the U.S. Courts, NASA, and University of Virginia Health Sciences Library has spent 20 years, 5 more than the age of PTFS, working in the library industry. Yokley himself worked in a government library for a short period in the early 90’s designing and building virtual library technology.. The company has experienced steady 20% per year growth and today has 120 employees. PTFS is particularly proud of their development of the US Government Printing Office's Federal Digital System (FDsys) which supports over a thousand libraries, but the company also has a library staffing component and a digitization facility.

Although PTFS has had a strategic partnership with SirsiDynix to market ArchivalWare, a digital content management system that grew out of technology developed for FDsys the Naval Research Laboratories (NRL) TORPEDO project, it found itself hamstrung in supporting its customers because of the lack of access to source code of the proprietary systems it was supporting. About 18 Months ago, PTFS decided that Koha was the Integrated Library System that it could most easily integrate with ArchivalWare, and it began to offer support for Koha. Koha is generally considered to be the first open source integrated library system; it was initially developed in New Zealand by Katipo Communications Ltd. and first deployed in January of 2000 for Horowhenua Library Trust.

LibLime (which is actually a trade name of Columbus, Ohio based Metavore, Inc.) was started in 2005 by Joshua Ferraro, Tina Berger and two others. LibLime has been the hardest-charging and fastest-growing proponent of the Koha Library System in the world. Over the intervening years, LibLime has acquired key Koha-related assets, including the US trademark, copyrights to Koha source code, and the Koha website. The combination of PTFS and LibLime will be supporting 640 installations of Koha under 123 contracts. The combined business will have Koha-related development contracts totaling $1.7 million. Despite the state of the economy, LibLime has actually had an increase in business over the past few months.

Recently, Ferraro and his co-principals at Metavore became very interested and excited by an opportunity outside of the library space. As the LibLime business grew, they recognized that they couldn't pursue both the new opportunity and LibLime, and they began to look for an acquisition partner. PTFS was the first company they went to. Given the reasons for the sale, only Ferraro among the Metavore principals will remain with LibLime; and he will stay only for 18 months to oversee the completion of planned development.

PTFS will keep the LibLime name and fold its own Koha support business into LibLime, which will be run by Patrick Jones. At the press conference held at ALA Midwinter in Boston, PTFS CEO John Yokley indicated that PTFS was committed to the concept of user-driven development and the open source concept, but also emphasized that he was still learning about open source and he was reviewing the LibLime business model; there is much left to be decided about how the LibLime business will move forward.

I spoke with Yokley afterwards. In his conversations with LibLime customers, he has found that their top priority for adopting Koha was to avoid vendor lock-in: their systems should be expandable by LibLime, the library or by another vendor. He sees Koha as a component of a fully capable integrated library system, and vowed that in two years, Koha will be fully capable of running a major academic library. The integration of Koha and ArchivalWare will be only the first phase. Although his team has discussed making ArchivalWare into an open source project, there are issues with third party components used which may prevent that from happening.

Yokley's clarity on avoiding vendor lock-in will be reassuring to customers, particularly with respect to LibLime Enterprise Koha (LLEK), a service announced by LibLime in September of 2009. LLEK is perhaps the most exciting asset being acquired by PTFS, and also the most controversial. The controversy deserves another article entirely, as it represents a break between LibLime and other developers supporting Koha. I plan to write that article in the coming week; please e-mail me if you wish to comment.

LLEK represents the evolution of LibLime's entry into cloud computing (also known as "software-as-a-service". Unlike vendors whose idea of cloud computing is simply to offer fully hosted services, LibLime's implementation of the cloud is more in line with that of modern "lean startups" who don't even own their own servers. By using Amazon EC2, LibLime has access to instantly expandable, low cost computing resources. LibLime is able to provision, configure, and implement a new Koha server in less than an hour. To accomplish this, LibLime has developed sophisticated deployment software (which it does not intend to release).

PTFS is already doing a sort of software-as-a-service, building private clouds for its military customers who don't have the option of going out on the open internet. As Yokley explained to me, "Economies of scale are an interesting thing. We've had a few large customers, but now with LibLime, we can provide services to large numbers of small libraries."

Welcome to the big leagues, PTFS!

This article is the first part of a series. Part 2 is here. Part 3 is here.

Reblog this post [with Zemanta]

Monday, January 18, 2010

Google Exposes Book Metadata Privates at ALA Forum

At the hospital, nudity is no big deal. Doctors and nurses see bodies all the time, including ones that look like yours, and ones that look a lot worse. You get a gown, but its coverage is more psychological than physical!

Today, Google made an unprecedented display of its book metadata private parts, but the audience was a group of metadata doctors and nurses, and believe me, they've seen MUCH worse. Kurt Groetsch, a Collections Specialist in the Google Books Project presented details of how Google processes book metadata from libraries, publishers, and others to the Association for Library Collections and Technical Services Forum during the American Library Association's Midwinter Meeting.

The Forum, entitled "Mix and Match: Mashups of Bibliographic Data", began with a presentation from OCLC's Renée Register, who described how book metadata gets created and flows though the supply chain. Her blob diagram conveyed the complexity of data flow, and she bemoaned the fact that library data was largely walled off from publisher data by incompatible formats and cataloging practice. OCLC is working to connect these data silos.

Next came friend-of-the-blog Karen Coyle, who's been a consultant (or "bibliographic informant") to the Open Library project. She described the violent collision of library metadata with internet database programmers. Coyle's role in the project is not to provide direction, but to help the programmers decode arcane library-only syntax such as "ill. (some col)". The one instance where she tried to provide direction turned out to be something of a mistake. She insisted that, to allow proper sorting, the incoming data stream should try to keep track of the end of leading articles in title strings. So for example, "The Hobbit" should be stored as "(The )Hobbit". This proved to be very cumbersome. Eventually the team tried to figure out when alphabetical sorting was really required, and the answer turned out to be "never".

Open Library does not use data records at all, instead, every piece of data is typed with a URI. This architecture aligns with W3C web standards for the semantic web, and allows much more flexible searching and data mining than would be possible with a MARC record.

Finally, Groetsch reported on Google's metadata processing. They have over 100 bibliographic data sources, including libraries, publishers, retailers and aggregators of review and jacket covers. The library data includes MARC records, anonymized circulation data and authority files. The publisher and retailer data is mostly ONIX formatted XML data. They have amassed over 800 million bibliographic records containing over a trillion fields of data.

Incoming records are parsed into simple data structures which looked similar to Open Library's, but without the URI-ness. These structures are than transformed in various ways for Googles use. The raw metadata structures are stored in an SQL-like database for easy querying.

Groetsch then talked about the nitty-gritty details of data. For example, the listing of an author on a MARC record can only be used as an "indication" of the authors name, because MARC gives weak indications of the contributor role. ONIX is much better in this respect. Similarly, "identifiers" such as ISBN, OCLC number, LCCN, and library barcode number are used as key strings but are only identity indicators with varying strengths. One ISBN with a chinese publisher prefix was found on records for over 24,000 different books; ISBN reuse is not at all uncommon. One librarian had mentioned to Groetsch that in her country, ISBNs are pasted onto a book to give it a greater appearance of legitimacy.

Echoing comments from Coyle, Groetsch spoke with pride of the progress the Google Books metadata team has made in capturing series and group data. Such information is typically recorded in mushy text fields with inconsistent syntax, even in records from the same library.

The most difficult problem faced by the Google Books team is garbage data. Last year, Google came under harsh criticism for the quality of its metadata, most notably from Geoffrey Nunberg. (I wrote an article about the controversy.) The most hilarious errors came from garbage records. For example, certain Onix records describing Gulliver's Travels carried an author description of the wrong Jonathan Swift. Most of these errors come from garbage records, and when one of these is found, almost always, the same problems can be found in other metadata sources. Google would like to find a way to get corrected records back into the library data ecosystem so that they don't have to fix them again, but that there have been issues with data licensing agreements that still need to be worked out. Article like Nunberg's have been quite helpful to the Google team. Every indication is that Google is in the metadata slog for the long term.

One questioner asked the panel what the library community should be doing to prevent "metadata trainwrecks" from happening in the future. Groetsch said without hesitation "Move away from MARC". There was nodding and murmuring in the audience (the librarian equivalent of an uproar). He elaborated that the worst parts of MARC records were the free text data, and normalization of data would be beneficial whereever possible.

One of the Google engineers working on record parsing, Leonid Taycher, added that the first thing he had had to learn about MARC records was that the "Machine Readable" part of the MARC acronym was a lie. (MARC stands for MAchine Readable Cataloging) The audience was amused.

The last question from the audience was about the future role of libraries in production of metadata. Given the resources being brought to bear on the book metadata by OCLC, Google and others, should libraries be doing cataloguing at all? Karen Coyle's answer was that libraries should concentrate their attention on the rare and unique material in their collections- without their work, these materials would continue to be almost completely invisible.
Reblog this post [with Zemanta]

Saturday, January 16, 2010

Library Automation Vendors Discuss Bicycles at ALA Midwinter

For the last 20 years Rob McGee, a library consultant, has run a "president's panel" at ALA Midwinter. I try to go whenever I can, because you can go and talk with the people who run the library automation industry and hear what they're thinking. Yesterday's was typical, interspersing a fair amount of tedium with a few lively and enlightening moments. I won't attempt to cover what was said, but I will tell you about the liveliest moment.

Last year, in Denver, I sat in the audience next to Andrew Pace, who has often been a questioner. In past years, he would write about the panel in his American Libraries blog, Hectic Pace. This year, he was behind the podium, representing OCLC. He's not been made President or anything, but he has been leading OCLC's effort to produce cloud-based library managment services. (Disclosure: I worked with Andrew when I was at OCLC.)

Andrew woke me from my Twitter-distracted slumber with a story about his bicycle. It seems that when Andrew was a boy, he had a bicycle, and it was a plain Schwinn without the banana seats and gears that all the other boys had. (I apologize for inaccuracies in my account, Andrew managed to make it sound un-dorky.) His next-door neighbor was about 2 years older than he was, and suggested that he could make Andrew's bike much better by taking it apart and putting it back together.

Andrew's parents came home to find the driveway covered with bike parts. It took a while to get the bike back together again, but once that was done, it was pretty much the same bike.

"We don't need a next generation integrated library system", Pace went on. That would be like taking apart my bike and putting it back together again. No matter what you do, if you have the same pieces, you'll get the same bike. What we need, according to Pace, are entirely new pieces that work together so that libraries can manage their libraries at a lower cost.

Suddenly, everyone in the room was awake. Gary Rautenstrauch, CEO of SirsiDynix, quickly responded "You know, I had the same bike". Rautenstrauch said that he took his bike, and stuck a playing card in between the spokes, which made a really cool sound, and then added some streamers to the handlebars. And that was a really good bike.

Next up with the bike story was Catherine Wilt, President of Lyrasis, the nation’s largest regional membership organization serving libraries and information professionals. She said that Pace was not really wanting to have a bike at all, he wanted a Razor.

"No" said Pace, "what I really wanted was a jet-pack".
Reblog this post [with Zemanta]

Friday, January 15, 2010

Offline Book "Lending" Costs U.S. Publishers Nearly $1 Trillion

Hot on the heels of the story in Publisher's Weekly that "publishers could be losing out on as much $3 billion to online book piracy" comes a sudden realization of a much larger threat to the viability of the book industry. Apparently, over 2 billion books were "loaned" last year by a cabal of organizations found in nearly every American city and town. Using the same advanced projective mathematics used in the study cited by Publishers Weekly, Go To Hellman has computed that publishers could be losing sales opportunities totaling over $100 Billion per year, losses which extend back to at least the year 2000. These lost sales dwarf the online piracy reported yesterday, and indeed, even the global book publishing business itself.

From what we've been able to piece together, the book "lending" takes place in "libraries". On entering one of these dens, patrons may view a dazzling array of books, periodicals, even CDs and DVDs, all available to anyone willing to disclose valuable personal information in exchange for a "card". But there is an ominous silence pervading these ersatz sanctuaries, enforced by the stern demeanor of staff and the glares of other patrons. Although there's no admission charge and it doesn't cost anything to borrow a book, there's always the threat of an onerous overdue bill for the hapless borrower who forgets to continue the cycle of not paying for copyrighted material.

To get to the bottom of this story, Go To Hellman has dispatched its Senior Piracy Analyst (me) to Boston, where a mass meeting of alleged book traffickers is to take place. Over 10,000 are expected at the "ALA Midwinter" event. Even at the Amtrak station in New York City this morning, at the very the heart of the US publishing industry, book trafficking culture was evident, with many travelers brazenly displaying the totebags used to transport printed contraband.

As soon as I got off the train, I was surrounded by even more of this crowd. Calling themselves "Librarians", they talk about promoting literacy, education, culture and economic development, which are, of course, code words for the use and dispersal of intellectual property. They readily admit to their activities, and rationalize them because they're perfectly legal in the US, at least for now.

Typical was Susanne from DC, who told me that she's been involved in lending operations for over 15 years. This confirms our estimate that "lending" has been going on for over ten years, beyond even Google's memory. Our trillion dollar estimate may thus be on the conservative side. Of course, it's impossible to tell how many of these lent books would have been purchased legally if "libraries" were not an option, but we're not even considering the huge potential losses to publishers when "used" books are resold for pennies on the black markets.

The communications backbone for this vast enterprise appears to be Twitter. Already, there is constant chatter on the #alamw10 hashtag. Most messages are clearly coded references to illicit transactions. For example a trafficker with the alias "@libacat" tweets "Have to be on the bus to the airport at 6:41 tomorrow morning to make it to the airport to get on my plane to #alamw10". At first glance, it seems like a mundane tweet about travel plans, but the breathtaking ordinariness and triple redundancy is more likely a secret code. How else to understand @scolford's (correction: retweet of @SonjaandLibrary replying to @BPLBoston) tweet; "curling my toes in joy at the thought of visiting your library"?

I've attended this meeting before. When I register for the book lending confab, I'll be presented with an encrypted document labeled the "program", which once decoded, will tell me where I can meet other book traffickers, discuss arcane trafficker lore, and drink trafficker beer. It's thick with secret code words like YALSA, LITA and NMRT, and no apparent rhyme or reason in its layout, evidently to frustrate outside investigators. I'll be lucky if I can find a bathroom.

Two places I'll be sure to find this weekend will be the OCLC Blog Salon on Sunday evening and the Chinatown Storefront Library on Saturday afternoon. Say hello if you see me.

A more serious post on Attributor is forthcoming.

Update: here's my post on "Deconstructing the Attributor Study".

Monday, January 11, 2010

Business Idea #2: eBooks are not Books

After finishing my article on the economic reasons that libraries exist, I briefly considered titling this followup "2030: No More Libraries". But as I've said before, I'm an optimist about the future of libraries, and my message is that a new sort of public library will thrive in 2030, even if the current sort don't survive.

To quickly sum up my previous article, in the 90's, Prof. Hal Varian derived an equation (pdf) that describes when library-like sharing can benefit both producers and consumers. Varian's equation boils down to comparing the transaction cost of sharing to the producer's marginal production cost. When the cost of producing and selling an additional book is higher than a library's cost to loan a book, society will benefit from the existence of libraries. This is the environment that has allowed libraries to thrive for hundreds of years.

eBooks are not like printed books. The cost to produce an additional book is essentially zero. The cost to deliver an additional book is essentially zero. The only marginal costs a publisher is likely to have are author royalties and the financial overhead of the sales channel. Thus the likelihood that public libraries can generate economic value by mere sharing of ebooks is minuscule.

I want to emphasize that I'm not saying that public libraries don't generate economic value, just that libraries generate a lot of value in the print economy that won't be generated in the digital economy. There's been a lot of reaction to a short post by Seth Godin in which he asks "What should libraries do to become relevant in the digital age?" Some of the umbrage is directed as his seeming unawareness of what libraries are already doing in the digital age, but to say, as another blog does, that that the answer is "nothing" is just head-sand-sticking.

Another economic benefit of circulating libraries described by Varian's equations kicks in when consumer access preferences are heterogeneous. This benefit remains in the digital book economy, but oddly, it is tied to the inconvenience of using libraries. To put it another way, libraries help to segment the ebook market only if ebook access through the library is sufficiently inconvenient to shield publishers' higher priced access options. If you don't believe me, go and look at how ebooks are offered at the New York Public Library; you'll have to pretend that ebooks have all the inconvenient characteristics of real books.

Demand aggregation is a function that libraries have fulfilled in the print economy. If 10,000 people are willing to pay $10 for shared access to a book collection, they can buy 1,000 books that cost $100 each. But local libraries lose their advantage in collective acquisition when books become digital because there is no longer a necessity for users to be geographically close to books. Smart publishers will want in on this action, as will other entrepreneurs. As Evan Schnittman, a senior executive at Oxford University Press told me, "Lending models of scale are coming... but I doubt consumers will turn to libraries en masse to get their ebooks, as capitalism has a funny way of turning demand (at any price) into financial opportunity."

There's no doubt there will be collective acquisition of ebooks, the question is the shape it will take. Today I read about the problems of tropical fish farmers in Florida. This group no doubt has an distinctive set of information needs. They need access to information about tropical fish, weather forecasting, pond digging, Florida real estate, agricultural regulations and international trade. Few of them have enough money to access all the types of information they need, and they're too small a group to attract the attention of a publisher-sponsored "ebook-club". As a collective, however, it should be theoretically possible to aggregate their demand, improve information access for members, and increase publisher revenue.

Alas, the Florida Tropical Fish Farms Association would have a number of difficulties in building the ebook library it needs. Making deals with all the different publishers would be prohibitively difficult; technical infrastructure to support the use and administration of such a library may not exist. Perhaps Gluejar can make it possible for communities of interest to nucleate and create libraries of ebooks.

Locally funded public libraries will continue to be efficient demand aggregators to the extent that they focus on the particular interests and needs of their communities. In this respect they will repeat the experiences of academic and corporate libraries that have seen their importance increase with the transition to electronic information. But on the whole, public libraries will need to reprioritize and reduce their budgets, reconfigure their physical assets, scrap expensive infrastructure exclusively focused on inventory management and adopt infrastructure that supports community involvement and community development.

They'll need our support.

Saturday, January 9, 2010

Why Libraries Exist

Imagine a world in which you share every book you buy with nine strangers, and you don't buy nine of every ten books you read. What would that world be like?

If you're a librarian, you're probably thinking it would be a reader's paradise. If you're a publisher, you're probably thinking it would be an author's nightmare. If you're an economist, you're probably computing demand curves, and concluding that both librarian and publisher have got it completely wrong.

Demand curves are used by economists to characterize the market for products. They plot the the amount the market would buy for a given price. For example, the market for a book would show that many people will buy books at a low price, and few will buy the book if it has a high price. If you are an exclusive seller in a market, as most publishers are, you can choose any price you like, but your revenue will be maximized if you choose a price somewhere in the middle of the curve. For the demand curve shown, the best price for the book is $10, which results in 1000 sales, for a revenue of $10,000.

What happens to the demand curve in our hypothetical everybody-shares world? The sales go way down of course, but the people willing to get the book for $1 will get to read books priced at $10. The people willing to spend $10 will be able to read the book even if it's priced at $100.

If the publisher took economics in college, he would look at the demand curve and price the book at $100 instead of $10, and guess what? The publisher revenue would be $10,000, or exactly the same as before! Since the publisher doesn't have to incur the printing costs of the larger print run, he makes more profit.

Is this realistic? Librarians can't have missed the fact that books meant for the library market are invariably priced at five times what the book would command in the trade market. On the other hand, if publishers were better at math, well... they'd have become bankers, and we'd probably all be better off in 2010 than we are!

The real world makes things complicated, and economists like UC Berkeley's Hal Varian studied these situations in the 90's. They wrote lots of interesting articles, including ones filled with math (pdf) and others fit to be read by librarians and publishers. Varian included the effects of transaction costs, production costs and the different values of owning and sharing, and found that library-like sharing benefits both publishers and consumers when the transaction cost of sharing is less then the marginal production cost:
1) more books will be read; 2) consumers will pay a lower price per reading; 3) the sellers will make a higher profit; and, 4) consumers will be better off.
To put it another way, libraries can be economically justified if the cost to lend a book is less than the cost to produce and sell a book. As I discussed in my previous article, the Institute of Museum and Library Services publishes a survey (available here) that tells us that US public libraries performed 2.17 billion circulation transactions in 2007 at an operating cost of $8.86 billion dollars. If we ignore all the other services that libraries provide, that gives us a cost per transaction of $4.09. So public libraries are inherently beneficial if it costs more than $4.09 for publishers to make, sell, and deliver an extra book.

Varian also notes that consumer sharing also yields benefits when consumer preferences are heterogeneous. Some people want to read the book the day it comes out; other people are perfectly happy to wait until the paperback comes out, and others will prefer to get it from the library. This heterogeneity allows market segmentation, which improves the efficiency of the market. The book publisher can charge $30 for the hardcover edition, but can still make money off of people who are only willing to pay $10 by issuing a paperback a year later. The publisher even makes money from library sales that allow use by people not willing to pay anything for book!

Libraries benefit society in many ways, but it's important for both publishers and librarians to understand the economic role that libraries have played in the book industry- they benefit everybody, publishers, authors and readers, by aggregating demand and helping to segment the market.

Next... ebooks.
Reblog this post [with Zemanta]

Thursday, January 7, 2010

Numbers for Libraries and the Book Market

The two best things about 10-year prognostication are that no one knows if you're a fool or a genius until it's too late to spoil the fun, and that you start to understand how little you know about the present.

In trying to predict how libraries will fit into the ebook economy (and I've decided to start leaving out the dash) I began by trying to explain the role of libraries in the print book economy. I quickly ran into the fact that it's difficult to get good estimates of how large the market role of libraries really is!

I talked to Outsell Chief Analyst Leigh Watson Healy, co-author of the Book Industry Study Group's Book Industry TRENDS report (highlights are here). Outsell estimates that the total US book market in 2009 was 41 billion dollars. They estimate that book sales to libraries was a bit over 1.6 billion dollars, which is almost 4% of the total market.

Another way to estimate the economic footprint of libraries is to look at library budgets. The Institute of Museum and Library Services publishes an annual survey (available here) of all the government-supported public libraries in the US. For the last year available, 2007, they compute that US public libraries spent 934 million dollars on their print collections (including serials). IMLS is counting something different from Outsell and there are a number of reasons the numbers would be different; my guess is the Outsell numbers, which are gathered from publishers, include a lot of libraries that either don't fit into the IMLS definitions or are outside the US.

Similar information for the UK is available from the British Booksellers Association and the Public Library Materials Fund and Budget Survey. UK public libraries spent 77 million pounds on books in 2008, while the British public spent a total of 2.3 billion pounds on books, suggesting that public library spending was 3.3% of the total.

Public libraries are only part of the library market, of course. Academic, school, medical and corporate libraries are also purchasers of books. Altogether, it seems likely that libraries total at least 5% of the book market, which is nothing to sneeze at. If we assume that public libraries buy books that are predominately in the juvenile and adult trade categories, then even the IMLS total is over 6% in those categories. In some segments of the book market, such as academic monographs, libraries probably make up much more of the total market.

The IMLS report also has some numbers relevant to my prediction of fewer libraries, more locations. In 2007, there were 9,040 "central" public libraries, 7,564 "branches" and 808 bookmobiles in the US. The per capita expenditures on public libraries was $37.21, and Americans borrowed an average of 7.3 books per year.

I plotted some of the IMLS data to see if if my fewer libraries prediction is just an extrapolation of a trend. It's not. The total number of libraries and library locations, held pretty steady between 2003 and 2007, after rising slightly from 1998-2003.

Part of the reason libraries are not merging is expressed by Katherine Gould in her comment on my fewer libraries post. Her library district has had good success with a small satellite branch, but the idea of merging libraries isn't one that appeals to her. It shouldn't; the tradeoffs of libraries vs. locations are painful, and only likely to occur in times of economic hardship. Oh.

A good example of the "storefront libraries" that I described is the Chinatown Storefront Library in Boston. An expermental, temporary library in a vacant storefront, the Storefront Library (唐人街店面图) is the ultimate "lean" library. It uses donated books, LibraryThing cataloguing, volunteer labor, community support, and a surplus of optimism about libraries. It will close the week after ALA Midwinter comes to town; I definitely recommend going by to take a look at the future of library locations. I certainly will!
Reblog this post [with Zemanta]

Tuesday, January 5, 2010

The Rock-Star Librarian and Objective Selector


You probably can't name any musical performers from the 18th or 19th century. But you've probably heard of Enrico Caruso. Caruso had a sharp business sense to go along with a legendary voice, and he took advantage of cutting edge technology to make his voice heard by more people than perhaps any other human before him. He earned millions of dollars from the sales of his recordings before his untimely death in 1920. He was the first rock-star opera singer. 
 In my article about the changing role of public libraries in the ebook economy, I observed that libraries would have a diminished economic role when most books had become digital. How will the the role of librarians change when this happens?

Librarians have already seen their roles change drastically as library operations have moved onto the internet. Cataloging has changed profoundly, reference has been googlized and pre-internet licensing was almost nonexistent. But these changes have occurred in the context of relatively stable institutions. In what ways will the technological shift to ebooks transform both the role and the context of librarians?

Here's one possibility: objective selection.

I mentioned tropical fish farmers as an example of a group with distinctive information needs, and thus in need of a specialized collection of ebooks. Someone needs to do the selection. But why limit the luxury of customized collections to obscure trade groups? Shouldn't knitters be able to support a custom-selected ebook library? what about bass guitarists? Erlang programmers? Hula dancers?

Some book publishers imagine that their future is tied to the development of "verticals", or units that specialize in a single subject and thereby develop strong relationships with their audience. They imagine developing libraries of digital content to satisfy market needs and selling these libraries by subscription. And while this is likely to be a sensible strategy, it is at the same time rather limiting. It seems to me that the libraries that best serve the consumer's needs will be built by objective selectors, not marketing mavens.

The importance of objective selection is borne out in today's book business, in which the the most powerful person is Oprah Winfrey. Her viewers trust her to select books based on their merit and her good taste, rather than their profit potential. A book selected for Oprah's Book Club will rack up hundreds of thousands of sales.

The reason the transition to ebooks could amplify the role of selectors is that new models for selling books are possible. In the print world, you would never think of buying a collection of 1000 books at any price, even if it was selected for you by Oprah herself. In the e-book world, one could easily imagine wanting to buy subscriptions to 1000-book collections- not so much to read, but to keep on the computer and the iPhone just in case.

If you accept the idea of the thousand ebook libraries and how you would market them, you are inevitably led to the concept of rock-star librarians. Collections can't be marketed by specific content, or else there would be cannibalization of single-item sales. Marketing of collections would have to be centered around the selectors. Think how much some people would pay to have Warren Buffett's librarian selecting their thousand ebook libraries! Selectors would no longer be anonymous John Does. They would develop their own followings, their own brands, their own communities. These communities would not be bounded by geography; a top selector would reach patrons around the world, just as Enrico Caruso's voice did.
If you're a publisher, your first reaction is probably that this is nonsense, and that publishers would be better positioned to develop brands and communities. Why would publishers ever offer discounted ebooks through objective selectors, let alone allow a percentage for the selectors to live on? The reason, of course, is the demand curve. Objective selectors will earn their economic keep by helping to aggregate demand and segment the market. Their libraries will provide access to books that the consumer needs but wouldn't buy on their own; those are the sales that publishers will try their best to keep to themselves.

Having an objective intermediary between consumer and publisher could have other types of benefits, most notably privacy. Librarians have a strong code of ethics surrounding the rights of patrons to read without fear of having someone looking over their shoulder, and these values could be built in to a ebook selection and collection platform. Publishers may well find it easier to sell certain kinds of content when it comprises part of a discreetly selected library.

It could happen. The other possibility is that punk rock stars could take jobs now and then as librarians. That could happen, too.

Sunday, January 3, 2010

2020: Fewer Libraries, More Locations

In the middle of the block I live on, there are two fire hydrants right next to each other. The reason for this is that half the block is in the town of Glen Ridge, New Jersey, and half the block is in Montclair, New Jersey. In the past there were incidents when fire trucks from the two towns rushed to the scene of a fire alarm, only to get into lengthy discussions about which town had responsibiity. That doesn't happen any more. In 1991, Glen Ridge closed its fire department and contracted with Montclair to supply fire protection services.

Will the same sort of consolidation happen with libraries? I think it will.

In my "Ten Predictions for the Next Ten Years" article, my first prediction was that the number of public libraries in 2020 would be half of what it is today. I also predicted that the number of public library locations would increase by 50%. I got plenty of feedback on Twitter that these predictions needed some explanation. Roy Kenagy thought that my prediction couldn't possibly apply to Iowa, where "new [libraries] sprout like weeds and people tend to them as their own".

There were two considerations, book digitization and the shift to e-books, that led me to these predictions, and neither is peculiar to New Jersey. I admit, though, that New Jersey's high taxes and density of services affected my estimate of the magnitude of coming changes.

Over the next ten years, book digitization will completely change the way most people use libraries. Instead of browsing the stacks or searching a catalog, people will increasingly make use of full text indexes and digitized resources to find books. This already happens with Google Books. They will then try to obtain the physical book in the library, or alternatively, use an e-book reader. Public libraries will need to adapt their physical plants to accommodate this changed usage pattern. Stacks will become more warehouse-like; public spaces will have fewer books and more coffee. Patrons will demand larger collections, but will accept less physical access to print. Home delivery of library materials will become much more common.

At the same time, libraries will struggle to adapt to the e-book economy. The most likely outcome will be a shift to licensed resources. Publishers will discover the benefits of putting much larger numbers of titles into e-book subscription packages such as those currently offered by Overdrive, Netlibrary, Ebrary, and others. When these packages can be used on patrons' Kindles and other e-readers, libraries will need to have them.

All of these trends will put pressure on libraries to work together on shared services, and ultimately to merge. Larger libraries will more effective at delivering both print books and e-books, and patrons will care less about where the print books are stored when they're not being lent. Smaller libraries will find it difficult to support the technical and operational expertise needed to run the public library of 2020.

While the shift to digital media will cause library organizations to become larger and fewer through mergers, it will also allow branches to be effective at smaller sizes. Without the need to store a critical mass of books, tiny, storefront branches will become more practical and cost efficient. Guys in vans carrying books will become more important. When people go to their local branch, they'll be able to use the free Google Books terminal (libraries are to get one free for every building) or other computers, check out some books, then have a coffee and socialize for an hour or so until the van makes its hourly delivery. Or they'll do their shopping rounds and come back to pick up the bag of books waiting for them. Establishing branches in shopping areas is not only a smart thing for libraries to do, it's also very cost-efficient.

In my own town, it seems that almost every year there's talk of closing the branch to save money. If you look at it, you can see why- the building is massive and has to be very expensive to operate. Eventually it will be shuttered and sold, but a storefront branch down the block could deliver the same services and cost much less to run. Does it make sense for the town high school to run its own library? Not really, but that could be another branch. We'll have fewer libraries, but more locations.

While consolidation and mergers will reduce the number of libraries, it can't be ignored that public library budgets are being slashed, and some libraries are being closed for purely financial reasons. Part of this is that the perceived value of libraries is less than it used to be. Many critical information services that used to be available only through libraries are now readily available through the internet.

There's also the possibility that public library services could be outsourced. My town's library gets annual funding of $3.8 million, roughly $100/resident, paid through our property taxes. If, in 2020, people are mostly reading books on e-readers, how much will they be willing to spend on a library? Will people prefer to spend the $100 on a commercial e-book subscription? I'm not sure, but I'm guessing that some towns will go the outsourcing route.

As I've said before, I'm an optimist about the ability of libraries to adapt to changes in media. If you look carefully at my picture of the Montclair Public Library van guy, you'll notice that what he's collected from this drop box is a newspaper, some VCR tapes and a whole bunch of DVD's. The print books were in another box, and there was not a single e-book to be carried to the main library. Makes you wonder...

Update 1/8: Some follow-up here.