In an earlier blog, I discussed how cloud computing may be a good option for help converting paper publications to searchable digital files. We embarked on a pilot test of cloud computing to evaluate this option.
The cloud was used to convert uncompressed TIFF images into searchable PDFs, requiring an optical character recognition (OCR) engine and a PDF distiller. This is a computationally intensive operation typical of image processing and therefore seemed to be a good application for distributed cloud computing.
Our pilot at the Government Printing Office consisted of processing nearly 20,000 pages of the Statutes at Large collection that had been scanned to our specifications. To assess the cloud capabilities, samples of the collection were processed with four different instance types; small, large, high-CPU medium storage, and high-CPU, large storage. The characteristics of these instances are outlined in this table:
Virtual instance details
Results were developed to assess the cost of processing a 100 page publication, which is roughly a 90MB package of TIFFs. Calculated costs included the upload/download transmission cost and the computing cost.The results were very interesting, indicating additional processing capability (virtual cores) improves the performance, to a point.
The transmission costs actually turned out to be higher than the processing costs, but it was clear that the high-CPU, medium storage processing solution was the most cost effective. Licensing costs for the OCR tool were not included in this evaluation.
Pilot cost summary
Special thanks to our to our awesome engineering co-op student, Isaac Jones for developing the test plan and collecting to data for this evaluation.
Amazon Elastic Compute Cloud
FDsys Operational Specification for Converted Content