Creating Scalable Digital Libraries

Like many other media content providers, libraries and museums are increasingly moving their content onto the Web.  While the move itself is no easy process (with digitization, web development, and training costs), being able to successfully deliver content to a wide audience is an ongoing concern, particularly for large libraries.

Much of the concern is financial, as most libraries do not have the internal budget or outside investors that for-profit businesses enjoy.  Even large university libraries will face serious budget constraints that even other university departments, such as science and technology would not face.

Creating a scalable infrastructure and also distributing a large digital collection that can handle multiple requests, requires planning that many librarians have not even imagined.  They must stop thinking in terms of "one-item-per-customer" and start thinking in terms of numerous users accessing the same information simultaneously.

Content Delivery Network

One option to consider is using a CDN or Content Delivery Network to remove much of the load from the server.  Rather than hosting every single image, video, text document, pdf, etc. on the web application server, CDNs host the static content on multiple redundant servers, spread across numerous geographic locations.  When a user accesses a particular collection piece, the CDN will locate the server closest to the user, with the fewest hops to get the the user, and deliver the content.

It is estimated that 80 to 90% of the user response time will involve downloading images, video, etc.  It is probably closer to 90 or even greater for digital libraries.  To reduce costs, libraries and museums can consider using pay-per-view CDN services, such as Amazon Web Services, to handle content delivery.

Metadata Server

A large part of the data distributed through a digital library is metadata.  Whereas a website like Flickr may have optional tags, a digital library's metadata is usually larger and more in-depth.  This is a huge amount of data that should be stored on a separate server from the actual web application.

As for the format of the data, that largely depends on the needs of your particular server.  While for some, an RDMS (relational database management system) may be sufficient, others may rely on XML-based metadata file sets or some other solution.

Query Routing

Although browsing is not a completely lost art, searching is the wave of the future.  People expect fast searches that produce quality results.  As a site grows and visitors increase, managing multiple queries can create bottlenecks and even interruptions in service.

On many small library websites, the user enters a search query and the search engine scours the database to find the keywords, subjects, authors, and/or titles that match.  On a large scale site with millions of users, this creates unnecessary load that can spike to dangerous levels.

By routing queries to appropriate nodes, you can save time and bandwidth.  Methods to accomplish this include mediator-based routing and peer-to-peer routing.

With mediator routing, a mediator application sits in between the search interface and the database, routing  the user's queries to likely nodes that contain content summaries and metadata, rather than the content itself, much in the same way that a search engine searches its own index rather than crawling the web with every search.

Peer-to-peer routing is more complex but can be effective when there are large numbers of users entering identical or similar queries.  In this model, the server will have redundant nodes that route multiple users, reducing the load on single nodes and making individual user response time shorter.

Caching

With a multitude of static content and redundant queries, caching simply makes sense.  If 5,000 people per day are looking at the same picture of a Persian sculpture, caching the image and metadata will reduce response time.  Furthermore, if 3,000 of those 5,000 users all used the exact same search terms to find the data, the server should not have to perform that search again, rather it can access the cached search and automatically display the associated results.

Final Notes

  • If you have the money to spend, buy good hardware.  More RAM and faster disks will help as your site grows
  • Test your website under heavy loads, monitor the statistics, find the bottlenecks, fix, and retest until you are satisfied
  • Web applications are increasingly focused on background jobs, so while optimizing your code is important, you should not ignore the background processes that may actually be slowing down your site.  Consider job queue software to reduce latency of page views.
  • Do not scale before you have to.  It is better to have a single optimized server for the current amount of traffic than additional nodes that are just sitting and waiting, costing you money.
  • Use cookies whenever possible, as this offloads a lot of data needs into the user's browser, rather than the server.  For security concerns, sign and encrypt the cookies.
  • Think in terms of packets rather than bytes.  You should try to compress as many bytes as possible, even if the files are tiny, but rather deliver as few packets as possible.

Whether you are a non-profit traditional library, an independent movie database, a photo sharing site, or a large content wiki, your digital library will need to be prepared for voluminous content distribution and a multitude of user requests.  Optimize what you have first and configure your servers for easy scaling as needed.

Written by Tavis Hampton of ServerSchool.com