Your tweets, saved for eternity
Apr 14th, 2010 by Isaiah Beard

With over 12 billion 140-character messages and growing, Twitter has exploded onto the social networking scene since the first Tweet ever posted roughly fours ago.  Those tiny text-based messages add up: That’s an estimated 1.5 terabaytes of data, and growing!

It looks like the Library of Congress sees the social impact and significance of the medium, and even believes there is a potential academic treasure trove waiting to be unearthed within this mass of single-sentence missives.  And so, the LoC has announced – via Twitter, of course – that it has acquired the entire Twitter archive.

That’s right. Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress.

So if you think the Library of Congress is “just books,” think of this: The Library has been collecting materials from the web since it began harvesting congressional and presidential campaign websites in 2000. Today we hold more than 167 terabytes of web-based information, including legal blogs, websites of candidates for national office, and websites of Members of Congress.

Twitter also made its own announcement via its blog:

It is our pleasure to donate access to the entire archive of public Tweets to the Library of Congress for preservation and research. It’s very exciting that tweets are becoming part of history. It should be noted that there are some specifics regarding this arrangement. Only after a six-month delay can the Tweets will be used for internal library use, for non-commercial research, public display by the library itself, and preservation.

The specific details of the arrangement are still a bit sketchy, and I do have some questions about how this will play out.  For instance, there’s not much direct mention of whether this archive will include the numerous photos and videos that are frequently linked to users’ tweets, but are often hosted via third party add-on sites such as TwitPic and Posterous.  A lot of Twitter users tend to use the platform as a springboard towards linking to websites and other external content, the permancnce of which can be pretty dubious.

This is still a very promising start though, and hopefully the archived twittersphere will in fact prove useful to researchers in the future.

Some may question the importance or singificane of this decision.  But Twitter isn’t just mindless banter. The LoC lists a few socially significant tweets in the archive.  Among them, the first “Victory tweet” by a president-elect.  There’s also quite a bit of historical influence that was set in motion by Twitter: political prisoners in the Middle East have used it to get their message across to followers; sometimes it was the very medium that got them into trouble, and other times it spread the word that helped set them free. Politicians in the West from all ends of the political spectrum have and continue to use Twitter to marshall their troops, as it were.  And the media have documented cases where Twitter became the source of social change in countries ruled with an iron hand, so much so that the potential outage of the service due to maintenance was once considered a serious threat to activism.  There’s PLENTY of social significance there.

NY Times Article on the realities and costs of Born Digital preservation
Mar 16th, 2010 by Isaiah Beard

Salman Rushdie. Source: Wikipedia. Click on image for link to source.

The New York Times today published an article that reflects some of the challenges of preserving born digital content – that is, documents, data and other content that has been created digitally, on a computer or electronic device, and for which there is no physical original (such as on paper).

In particular, they highlight the efforts of Emory University, in preserving Salman Rushdie’s archival materials.

Among the archival material from Salman Rushdie currently on display at Emory University in Atlanta are inked book covers, handwritten journals and four Apple computers (one ruined by a spilled Coke). The 18 gigabytes of data they contain seemed to promise future biographers and literary scholars a digital wonderland: comprehensive, organized and searchable files, quickly accessible with a few clicks.

But like most Rushdian paradises, this digital idyll has its own set of problems. As research libraries and archives are discovering, “born-digital” materials — those initially created in electronic form — are much more complicated and costly to preserve than anticipated.

Electronically produced drafts, correspondence and editorial comments, sweated over by contemporary poets, novelists and nonfiction authors, are ultimately just a series of digits — 0’s and 1’s — written on floppy disks, CDs and hard drives, all of which degrade much faster than old-fashioned acid-free paper. Even if those storage media do survive, the relentless march of technology can mean that the older equipment and software that can make sense of all those 0’s and 1’s simply don’t exist anymore.

Imagine having a record but no record player.

An interesting aspect of this collection and its exhibition is that it emulates the experience Rushdie had in creating the content.  Rather than just viewing the finished documents, you get to see the computer desktop as he saw it, open up the same applications he used, all in the 1980s and 1990s technological contexts… and not using the modern, Web 2.0, Windows 7 or Mac OS X trappings we’re accustomed to in today’s computers.

I think this article is an excellent read, irrespective of what one’s views may be on the subject matter.  Material of all kinds, in increasing amounts, faces the same perils as this collection every day, and archivists everywhere, including this one, wrestle with how best to retain it all.  So far, the only tried and true method for such types of preservation is to obsessively manage and migrate the content, and that requires making tough decisions as to how to proceed, what formats to migrate to, and hoping the decisions made are the right ones to keep the content viable, at least until the next generation of technology requires that the hard decisions be made again.

New Scientist article on “Digital Doomsday”
Feb 3rd, 2010 by Isaiah Beard

One of the topics I like to bring up in the discussion of preserving digital data is the idea of a Digital Dark Age… the notion of a period in our historic knowledge that ends up getting lost due to a failure to plan and preserve our early digital content.

The New Scientist, however, recently published an article (Feb 2, 2010) on something a bit more cataclismic: the concept of  Digital Doomsday.  From the article:

Suppose, for instance, that the global financial system collapses, or a new virus kills most of the world’s population, or a solar storm destroys the power grid in North America. Or suppose there is a slow decline as soaring energy costs and worsening environmental disasters take their toll. The increasing complexity and interdependency of society is making civilisation ever morevulnerable to such events (New Scientist, 5 April 2008, p 28 and p 32).

Whatever the cause, if the power was cut off to the banks of computers that now store much of humanity’s knowledge, and people stopped looking after them and the buildings housing them, and factories ceased to churn out new chips and drives, how long would all our knowledge survive? How much would the survivors of such a disaster be able to retrieve decades or centuries hence?

The article is a compelling read, and offers an intellectual exercise on how much of our “stuff” will survive such a castastrophe.  Ironically, the logic is that the digital content with the most copies oin existence may win out.  So, while scholarly works, theses, research and other important scientific data would be at risk, pop music may surive just fine.

Designing and Implementing a Center for Digital Curation Research
Nov 17th, 2009 by Isaiah Beard

The facility I work in at Rutgers, known as the Scholarly Communication Center (SCC), has a fairly short history in the grand scheme of academia, and yet a fairly long one when it comes to the rapid changes in technology it has seen in its lifetime.  It was originally started in the 1996, and meant to be a location for university students and faculty to access a growing body of the then-nascent collection of digital content.

Back then, the internet still wasn’t very fast and wasn’t nearly as media-rich as it it seems today.  And so, most of the data-heavy reference materials arriving in digital form came to the SCC as CD-ROMs (and later, DVD format).  To accommodate this, the SCC had a lab of ten desktop computers (known as the Datacenter), dedicated solely to accessing this type of material.

But the times changed, and so did the way people accessed digital material.  As the ‘net grew in size and capacity, it no longer made sense to ship reference material on disc, and so the access moved online.  Students migrated from visiting computer labs to bringing their own laptops (and later, netbooks and handheld mobile devices).  Traffic at the datacenter dropped to virtually nothing.  The space had to be re-tooled to continue to be relevant and useful.

And so, with my taking on the newly-minted role of Digital Data Curator, and in collaboration with my colleagues, a new plan for the former datacenter was developed.  Instead of being a place to merely access content, we would be a place to create it.  Analog items that needed to be digitized would be assessed and handled here.  New born-digital content would be edited, packaged, and prepared for permanent digital archiving in our repository.  We would be a laboratory where students getting into the field – and even faculty and staff who have been here a good while – would learn, hands-on, how to triage and care for items of historical significance, both digital and analog, and prepare them for online access.

The concept for a new facility was born.  And we call it the Digital Curation Research Center.

The center is still in “beta,” as we plug along with some internal projects for testing purposes along with a couple of willing test subjects within the university and surrounding community.  This is so we can test out the workflow of the space and make tweaks and optimizations as needed.  Our plan is to officially launch the space in the Spring of 2010, with a series of workshops and how-to sessions for the various things that make digital curation vital (e.g. digital photography, video editing, audio and podcasting, and scanning).

The plan is that this will be a continual, evolving learning experience for all involved.  People who have never really used cameras and recording equipment in a historical context will learn just how increasingly valuable the content they create, and the stories it will tell, can become over time.  And those of us in the DCRC day in and day out will encounter things that we’ve never run into before, and will have to wrap our heads around the issue of preserving it effectively.

Below are related documents that provide additional information about the DCRC.  More information will be coming up as we get closer to the official launch:

The case for improved large file support in digital repositories
Nov 2nd, 2009 by Isaiah Beard

As the person responsible for handling the various file formats in RUcore, the digital library repository for Rutgers University Libraries, I’ve been looking with trepidation at the increasing sizes of the digital assets people are starting to create.  In 2004 when the architecture for this was first envisioned, very few digital items grew past the hundred-megabyte point.

How things have changed!  Video and even audio files are routinely pushing into the gigabytes, now that technology has progressed to the point where high-definiteion video and audio can be originated for ubiquitous mobile devices.  And as RUcore and other large repositories seek to preserve this content, we are finding ourselves running into a hurdle we did not anticipate: the ability for our architectures to handle these very large digital files.  In particular, files larger than 2 Gigabytes has posed some exceptions forFEDORA, our infrastructure of choice, and this is a very big deal for video content in particular.  Consider that 2 Gigabytes can comprise less than 5 minutes of HD content, and you can see our dilemna.

Added mechanisms to support these large items has been slow in coming, and have presented some difficulties of their own in implementing.  For this reason, I’ve drafted a document which explains our position on why we need uniform large file support in digital repositories.  Feel free to have a look and provide feedback.

With any luck, developers will heed the call presented here and in other institutions,a nd work to make better support for big files a reality.

»  Substance:WordPress   »  Rights: Creative Commons License