UIMA-HPC: High-Performance
Knowledge Mining
UIMA-HPC will enable data mining applications to make efficient use of High
Performance Computing resources.
The focus will be on the bio-pharmacological area for which e.g. the PubMed
database holds more than 20 million
entries. Researchers in this field need
to find answers to questions such as
the following: For a given base structure,
are there any structure variants already mentioned in literature, and if so,
are there any indications of their effects?
Are structure variants protected by
third-party rights or are they freely available? These questions cannot be
answered by sheer keyword searches.
Therefore the project aims to realize
an HPC-based solution for the automated analysis of multi-modal pharmaco-chemical document databases,
taking the patent-search use-case as
an initial solution design driver. The
combination of text and structure
analysis is an innovative approach, but
will be based on an existing and welltested data analysis architecture: the
Unstructured Information Management
Architecture (UIMA). UIMA is a software architecture which specifies
component interfaces, design patterns
and development roles for creating,
describing, discovering, composing and
deploying multi-modal analysis capabilities. The UIMA specification is being
developed by a technical committee at
OASIS (Organization for the Advancment
of Structured Information Standards).
The UIMA-HPC approach centres on
the workflows for the automated annotation of a document corpus, the work-
flow comprising analysis components
within the UIMA architecture.
The individual “annotation engines”, such
as text-mining of a document or analysis
of diagrams within a document based
on Optical Character Recognition (OCR),
are of a computational complexity
such that parallelization at the level of
the heterogeneous “node” of a modern
HPC system is highly appropriate,
meaning parallelization for deployment
on multi-core and/or GPU-accelerated
processors. Handling the large
quantity of documents - and the related
load-balancing issues created by the
diversity of computational complexity
relating to individual documents - to be
analyzed by independent instantiations
of the annotation engines for the workflow is handled at the level of the
nodes of the HPC compute system as
a whole and will be realized within an
adaptation of the UNICORE software
system. An example workflow is
shown in Figure 1 where red-coloured
framed boxes denote UIMA-analysis
pipelines and orange triangles split or
collect data to achieve load-balancing
and parallel execution of pipelines.
UIMA-HPC is a collaborative project
funded in part by the German Federal
Ministry for Education and Research
(BMBF – Bundesministerium für Bildung
und Forschung, Förderkennzeichen
01IH11012) and running for three years;
the Consortium is led by FHG-SCAI
and includes Forschungszentrum
Jülich GmbH, scapos AG, and Taros
Chemicals GmbH.
The Jülich Supercomputing Centre puts
its R&D effort in the development of
algorithms and tools for the distribution
and collection of data as well as the
calculation of the appropriate number
of parallel analysis streams and
monitoring.
|
Figure 1: Example data mining workflow
|
• Mathilde Romberg Jülich
Supercomputing
Centre
top
|