As mentioned in the earlier post, the original investigations focused mainly on institutional repositories as archiving platforms similarly, the Thoth Archiving Network aims to bring together a group of institutions who are willing to host open-access works from smaller publishers in their repositories. The first decision was the choice of platform. Open Book Publishers (OBP) and punctum, as key COPIM partners, elected to participate in the upload, and the development team consulted with them throughout in determining their approach. Log in to an appropriate IA user account, and send the PDF and the formatted metadata to the Archive to create a new archive copy of the work there (using the Internet Archive Python library).įor the proof-of-concept workflow, we decided to perform a one-time upload of real-world files and metadata from publishers’ full back catalogues 1. Rearrange the work metadata into the format used by the Internet Archive Use this URL to download a copy of the PDF content file Given the Thoth ID of a work, obtain its full metadata in an easily-digestible format (using the Thoth Python library)įrom the metadata, extract the URL where the PDF of the work’s content can be publicly accessed online This meant we could quickly write a piece of Python software which would do the following: They also both offer open-source software libraries in the Python programming language, packages of “canned code” for performing common tasks which can be utilised when developing new programs instead of writing everything from scratch. As briefly discussed in my earlier post, both Thoth and the Internet Archive offer APIs (Application Programming Interfaces) as a simple, standardised way for software programs to interact with their databases. Previous workĭuring initial investigations, we had successfully uploaded temporary test files to the Internet Archive (IA) using the same method which would form the basis of our proof-of-concept workflow. The post will also outline our plans for building on this initial work as we start to develop the Thoth Archiving Network. All code used in the process is available on GitHub under an open-source licence, as is standard for the COPIM project. This blog post will explore the steps taken to accomplish this, providing pointers for anyone looking into implementing a similar system themselves, as well as giving some background for publishers interested in joining the Thoth programme to take advantage of this feature. Ragtime Identifier lp_ragtime-piano_john-gordy Identifier-ark ark:/13960/s2k11w9kj2g Lineage Technics SL1200MK5 Turntable + Audio-Technica AT95e cartridge > Radio Design Labs EZ-PH1 phono preamp > Focusrite Scarlett 2i2 Ocr tesseract 5.3.0-3-g9920 Ocr_detected_lang en Ocr_detected_lang_conf 1.0000 Ocr_detected_script Latin Ocr_detected_script_conf 1.0000 Ocr_module_version 0.0.20 Ocr_parameters -l eng Original-ppi 1200 Pages 4 Pdf_module_version 0.0.22 Ppi 600 Ripping_date 20230323054632 Ripping_operator Ripping_scanner archivelp-rip-cebu01 Ripping_software_version ArchiveCD Version 2.2.75lp Ripping_stylus archivelp-rip-cebu01-20230316-cae825d2 Ripping_time 5182 Scandate 20230320043309 Scanner archivelp-cat-cebu02 Scanningcenter cebu Size 12 Software_version ArchiveCD Version 2.2.Just some of the uploaded works now present at the Thoth Archiving Network collection, as displayed within the Internet Archive interface Adaptive_ocr true Addeddate 06:01:41 Betterpdf true Bookreader-defaults mode/1up Boxid IA1679015 Catalog_time 230 Condition Very Good Condition-visual Very Good Country US Derive_submittime 06:01:28 Disccount 1 External-identifier
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |