Cloud computing is a style of computing where IT-related capabilities are provided as a service, allowing users to access technology-enabled services without knowledge of, expertise with, or control over the technology infrastructure that supports them.
As we face the need to convert several hundred years of paper documents to digital files for quick and easy access, cloud computing may provide a nice solution for us at the Government Printing Office. There have been efforts underway at GPO and other Federal agencies to scan documents and produce collections of TIFF images that can later be processed to create accessible versions. To date we have several terabytes of TIFF data collected that needs to be converted. But, this is just the tip of the iceberg – we anticipate this unprocessed data will grow to multiple petabytes. To process this data, we are faced with either building an in-house computing capability to convert these documents, outsourcing the conversion, or getting creative and exploring capabilities like cloud computing.
Fortunately there are some benchmarks emerging that we can use to help model this problem and guide us to a solution. It appears that processing a page of text in TIFF from and converting this to a searchable PDF takes about 1 minute with OCR engines working on a typical computing platform available today. This will certainly improve over time as
One of the key barriers that we need to anticipate is licensing specific applications that will be used to accomplish this conversion. In particular, licensing OCR technology to be used in a cloud of thousands of virtual machines to parallel process a large collection of TIFFs appears to be one of our major hurdles to clear.As virtualization technologies continue to emerge and mature, they will play a big role in solving some complex and, at times, short term computing tasks, minimizing the need to build large, on-site computing facilities.