Source: AIM
J.P. Morgan has introduced DocLLM, a generative language model designed for multimodal document understanding. DocLLM stands out as a lightweight extension to LLMs for analysing enterprise documents, spanning forms, invoices, reports, contracts that carry intricate semantics at the intersection of textual and spatial modalities.
Unlike existing multimodal LLMs, DocLLM strategically avoids expensive image encoders and focuses exclusively on bounding box information to incorporate spatial layout structures. The model introduces a disentangled spatial attention mechanism by decomposing the attention mechanism in classical transformers into a set of disentangled matrices.
For pre-training DocLLM, data was gathered from two primary sources: IIT-CDIP Test Collection 1.0 and DocBank. The former comprises over 5 million documents related to legal proceedings against the tobacco industry during the 1990s, while the latter consists of 500,000 documents, each featuring distinct layouts.
Read full article: https://analyticsindiamag.com/jpmorgan-announces-docllm-for-multimodal-document-understanding/