Finding library and information themed webinars and conferences that do not over-hype artificial intelligence (AI) can be a challenge. This is why I was pleasantly surprised by the ILI Bitesized Conference - small is indeed beautiful! The session that most resonated with me was the one on machine learning by Bohyun Kim. She outlined some of the pioneering attempts to apply different forms of AI in the library, museum and archive context.
What is the difference between AI and machine learning?
I have written about the human element of AI in a previous post but I’ve never differentiated between the terms machine learning and artificial intelligence. These are not synonymous: AI is the broader term, and encompasses a number of methods including natural language processing (NLP) and computer vision.
Computer vision has huge potential for libraries, art galleries, museums and archives. After all, collections contain many images, from photographs, paintings and videos to engravings and other types of illustrations. If AI enables computers to think, computer vision enables them to see, observe and understand - and organise!
What type of machine learning is out there?
Bohyun Kim introduced the session with an overview of the various branches of machine learning, which I have supplemented using the excellent IBM site.
Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy and takes steps to minimise errors.
Unsupervised learning uses machine learning algorithms to analyse and cluster unlabelled data sets. These algorithms discover hidden patterns in data without the need for human intervention, hence, they are “unsupervised”.
Ultimately, it depends on what you want out of your data. Classifying big data can be a real challenge in supervised learning because of the extra effort involved, but the results are highly accurate and trustworthy. In contrast, unsupervised learning can handle large volumes of data in real time but the lack of transparency can mean a risk of inaccuracy.
Reinforcement machine learning is similar to supervised learning, but the algorithm isn’t trained using sample data. This model learns as it goes by using trial and error. A sequence of successful outcomes will be reinforced to develop the best recommendation or policy for a given problem.
Some essential “Machine Learning in Libraries” reading
She recommended we read the 2020 Library of Congress Report called “Machine Learning + Libraries: A Report on the State of the Field” and it is a page-turner! It provides a historical overview of the application of AI in an information setting, machine learning “cautions” (including issues around algorithmic bias) and other challenges, and concludes with some recommendations.
However, the best section in the report is the outline of the most common applications of machine learning in libraries. In collection management, for example, document discovery, optical character recognition (OCR), handwriting recognition, metadata extraction, and visual data annotation. There are also applications for end-user management, education and outreach.
Another recent report worth reading is the “AI in relation to GLAMs Task Force Report” (September 2021). They provide an update to various projects, and in line with Bohyun Kim’s outline, deal with a variety of digital collections - 16 text-based projects (including scanned/OCRed and handwritten documents) and 12 image/photo projects. Other types of content are used less frequently, with 5 projects processing audio/video and 6 various types of metadata and occasional mentions of other content like 3D and maps.
We live in a visual world: metadata generation for image archives initiative
Although there are limitations on OCR technology, such as minimal description and metadata, it has been around for some time and has provided basic digital access to text heavy collections. However, what do we do with the growing number of image-heavy archives? Processing images is challenging for both humans and computers, therefore it should come as no surprise that a number of projects Bohyun Kim highlighted were instigated by image collections.
There have been a number of high profile public examples of applying tech to images in museum and art gallery collections. Angie Judge, CEO of Dexibit, outlined some examples of AI currently used in museums from visitation forecasting to understanding collections by using machine vision to help recognise, classify, or pattern images.
"Notably, the world is still in the phase of ‘training the toddler’ when it comes to AI, helping it deal with real life situations as they emerge," Judge says. "And it is definitely always being used in a hybrid human-machine decision context, where real people are still very much involved in contextualizing AI outputs and ultimately making decisions."
Websites, chatbots, "collections in the home" and interactive displays might assist collections with public engagement, but machine learning enables curators, archivists and information specialists to delve deeper into collections to generate scholarly interest. But training is most definitely the key, as the following projects demonstrate.
CAMPI: Computer-Aided Metadata Generation for Photoarchives Initiative
This project was inspired by a request from the Carnegie Mellon University Marketing Marketing and Communications team, which regularly works with the University Archives to source images for online and print materials. The images in this collection are in demand so any improvement to the metadata would make a difference. Even though the project is just a prototype, it has been a great start.
They explain that data from the tagging and deduplication work done during this project will be used as the photographs are migrated to a new digital collections system that will make them publicly accessible. In their White Paper, the team reported that of the more than 43,000 tagging decisions made, a little over 1 in 5 were able to be automatically distributed across a set, saving metadata editors more than 9000 decisions. So much time has been saved!
They also noted that the prototype applications identified around 28% of the collection as sets with duplicates. These results demonstrate how machine learning can be integrated into the existing metadata creation and editing workflow at libraries and archives. The report also includes a high-level technical architecture that discusses how such a system would connect to existing collection catalogues and image databases that libraries and archives already use.
Find out more here:
- CAMPI: Computer-Aided Metadata Generation for Photoarchives Initiative
- Libraries use Computer Vision to explore archival photo collection
Image Classifier ML Algorithm for the Frick Collection
Another example of an AI project for an image collection is from the Frick Art Reference Library in New York. They launched a pilot project with Stanford University, Cornell University, and the University of Toronto to develop an algorithm that applies a local classification system based on specific visual elements to the library’s digitised Photoarchive.
As a test case, the Cornell/Toronto/Stanford team focused on a dataset of digital reproductions of North American paintings and drawings and employed machine learning to produce automatic image classifiers. These have the potential to become powerful tools in metadata creation and image retrieval, saving archivists and researchers time and effort.
Find out more here:
- AI and the digitized Photoarchive: Promoting access and discoverability
- AI and the digital Photoarchive
AMPPD: Audiovisual Metadata Platform Pilot Development
This audiovisual project started in 2018 and was designed to create an automated metadata generation mechanism and integrate with the human metadata generation process.
Find out more here:
Summarising documents using Natural Language Processing (NLP)
Repeated testing and training is required to maximise AI capabilities in machine learning and natural language processing. Hesburgh Libraries took three NLP automated summarisation techniques and tested them on a special collection of Catholic Pamphlets. The automated summaries were generated after feeding the pamphlets as .pdf files into an OCR pipeline.
I’ve already mentioned that OCR can have limitations and this project faced many challenges. Firstly, the newly digitised documents required extensive data cleaning and text preprocessing before the computer summarisation algorithms could be started. Secondly, the Latin language caused issues.
The outcome was generally successful, and they concluded that "using the standard ROUGE F1 scoring technique, the Bert Extractive Summarizer technique had the best summarization score. It most closely matched the human reference summaries". Their experiences will be useful in other NLP projects.
Find out more here:
Image Analysis for Archival Discovery (AIDA)
How many times have you been unable to find something which you know is buried deep in the text? How often have researchers needed to compare the full text of various documents? Despite the additional catalogue metadata, full content is often the only way the end-user can locate an obscure reference or carry out other computational analyses. AI is useful when you want to attempt full content extraction.
Depending on the type of content you are extracting, this might require OCR for text, speech recognition software for auditory data, and computer vision for photographs, illustrations, and other graphic information. Bohyun Kim highlighted the issues around tabular data. These pose problems because tables cannot be as easily identified as structured data so they need a combination of table detection, cell recognition, and text extraction algorithms.
One project set out to explore what was possible. "Digital Libraries, Intelligent Data Analytics, and Augmented Description: A Demonstration Project" sought to,
- Develop and investigate the viability and feasibility of textual and image-based data analytics approaches to support and facilitate discovery
- Understand technical tools and requirements for the Library of Congress to improve access and discovery of its digital collections
- Enable the Library of Congress to plan for future possibilities
Find out more here:
What AI should libraries/archives be focusing on in the future?
Finally, she outlined some thoughts for the information industry:
- All of the projects mentioned reported a need for a field wide specialised computer vision training set. Projects require more data and ground truth for ML algorithm training and evaluation.
- There needs to be an exploration of pre-trained and off-the-shelf commercial ML models. Google is at the forefront of research and collaboration, so how can museums, archives, and other organisations partner up whilst maintaining an ethical stance on things such as privacy and data protection?
- These projects are only the tip of the iceberg - there is so much more to be done and to be explored.
It’s an exciting time to be in information management. What is the best AI-based library project you’ve come across recently?