Innovatives Supercomputing in Deutschland
inSiDE • Vol. 9 No. 2 • Autumn 2011
current edition
about inSiDE
index  index prev  prev next  next

UIMA-HPC: High-Performance Knowledge Mining

UIMA-HPC will enable data mining applications to make efficient use of High Performance Computing resources. The focus will be on the bio-pharmacological area for which e.g. the PubMed database holds more than 20 million entries. Researchers in this field need to find answers to questions such as the following: For a given base structure, are there any structure variants already mentioned in literature, and if so, are there any indications of their effects? Are structure variants protected by third-party rights or are they freely available? These questions cannot be answered by sheer keyword searches.

Therefore the project aims to realize an HPC-based solution for the automated analysis of multi-modal pharmaco-chemical document databases, taking the patent-search use-case as an initial solution design driver. The combination of text and structure analysis is an innovative approach, but will be based on an existing and welltested data analysis architecture: the Unstructured Information Management Architecture (UIMA). UIMA is a software architecture which specifies component interfaces, design patterns and development roles for creating, describing, discovering, composing and deploying multi-modal analysis capabilities. The UIMA specification is being developed by a technical committee at OASIS (Organization for the Advancment of Structured Information Standards).

The UIMA-HPC approach centres on the workflows for the automated annotation of a document corpus, the work- flow comprising analysis components within the UIMA architecture. The individual “annotation engines”, such as text-mining of a document or analysis of diagrams within a document based on Optical Character Recognition (OCR), are of a computational complexity such that parallelization at the level of the heterogeneous “node” of a modern HPC system is highly appropriate, meaning parallelization for deployment on multi-core and/or GPU-accelerated processors. Handling the large quantity of documents - and the related load-balancing issues created by the diversity of computational complexity relating to individual documents - to be analyzed by independent instantiations of the annotation engines for the workflow is handled at the level of the nodes of the HPC compute system as a whole and will be realized within an adaptation of the UNICORE software system. An example workflow is shown in Figure 1 where red-coloured framed boxes denote UIMA-analysis pipelines and orange triangles split or collect data to achieve load-balancing and parallel execution of pipelines.

UIMA-HPC is a collaborative project funded in part by the German Federal Ministry for Education and Research (BMBF – Bundesministerium für Bildung und Forschung, Förderkennzeichen 01IH11012) and running for three years; the Consortium is led by FHG-SCAI and includes Forschungszentrum Jülich GmbH, scapos AG, and Taros Chemicals GmbH.

The Jülich Supercomputing Centre puts its R&D effort in the development of algorithms and tools for the distribution and collection of data as well as the calculation of the appropriate number of parallel analysis streams and monitoring.

Figure 1: Example data mining workflow

• Mathilde Romberg
Jülich Supercomputing Centre

top  top