UIMA Java Framework
UIMA: Unstructured Information Management Architecture

SourceForge.net Logo


Why establish an Apache incubator project when UIMA is already in open source on SourceForge?

We needed to adopt an approach for allowing multiple people and organizations to participate and contribute to the evolution of the UIMA framework. Our intention is to facilitate the development and interoperability of content analytics so that organizations can extract and utilize the full value of their unstructured information assets. We believe the best way to do this is to create an active development community around UIMA. The Apache Software Foundation provides a highly successful, globally supported, meritocratic model for software development that we think will foster innovation and accelerate industry adoption. The Apache incubator project allows us to foster a broader development community around UIMA.


Why is IBM delivering UIMA in open source?

Releasing the UIMA framework source code to the developer community will accelerate adoption, usage, and value-added innovation around UIMA. This will help to establish UIMA as a standard for integrating text analytics components, which will bring order and interoperability to what is now a highly fragmented market. Building an ecosystem of partners, customers and open source developers around UIMA has direct value for organizations by accelerating innovation around text analytics and advanced search applications. This is an important move for IBM, the search market, and developer and customer communities at large.


What exactly will be delivered into open source?

The UIMA Java framework is the basic Java implementation of the UIMA framework and includes both the build time and run time components for developing and running UIMA-compliant analytic modules and processes that can extract latent meaning from unstructured information. Complimentary capabilities (e.g. a semantic search engine) to facilitate building and running UIMA applications are available in from IBM alphaWorks.


Is IBM delivering text analytics as part of the open source delivery?

No; however, the source code for some sample analytics that can serve as a guide to building UIMA-compliant analytics is available as part of the UIMA SDK on IBM alphaWorks. In addition, there are already both free and commercial UIMA-compliant analytics available. Free analytics that can work with UIMA include those from OpenNLP and the GATE communities. The OpenNLP analytics are easily wrapped to become UIMA components; sample wrapping code is available in the UIMA SDK examples on alphaWorks. Commercial analytics are available from IBM, as well as from other ISVs such as ClearForest and Nstein. One objective of delivering UIMA into open source is to stimulate innovation and development of text analytics that enables organizations to drive greater value from their unstructured information assets.


How can someone make contributions to the source code? 

The standard Apache community process will be used for contributions to the UIMA open source implementation. Those wishing to participate in the evolution of UIMA and contribute to future releases are encouraged to join the Apache UIMA incubator project. The SourceForge delivery is a reference implementation only.


How is IBM building a community around UIMA? 

The Apache Software Foundation has evolved a model for developing community that we think will work well with UIMA. In addition, the OASIS Technical Committee working on UIMA standardization has members from a number of different organizations including IBM, EMC, Temis, NStein, SRI International, Science Applications International Corporation, Thompson Scientific, Army Information and Intelligence Warfare Directorate, Pacific Northwest National Labs, Mayo Clinic College of Medicine, University of Manchester, University of Sheffield and Carnegie Mellon University.

IBM has been building a community around UIMA for some time. In previous years, UIMA has received significant support from the Defense Advanced Research Projects Agency (DARPA), the central research and development organization for the Department of Defense. DARPA and IBM formed a working group consisting of experienced research members who have contributed their expertise in unstructured information management to the evolution of UIMA. The contributors included several leading universities, along with industrial research and development organizations. Some of the universities that participated, such as Carnegie Mellon University, Columbia University, Stanford University and The University of Massachusetts Amherst, are already using UIMA in courses and research projects.

Other organizations are actively supporting and using UIMA, including BBN Technologies, The Mayo Clinic and MITRE Corporation. In addition, over 15 Independent Software Vendors have announced their support for UIMA.


How do I get started working with the source code? 

You can download the source code from the Apache UIMA project.

To obtain sources for earlier versions of the UIMA Framework, (pre-Version 2.1), go to the SourceForge download page and download and unzip the UIMA source. Detailed instructions on how to build the framework from the source, and how to import the source into Eclipse is available in the readme.html contained within the zipped sources. You may want also download the UIMA SDK from the alphaWorks site to obtain additional components.


How are the SourceForge and Apache UIMA code different? 

The SourceForge code pre-dates the Apache base (pre Version 2.1). All new enhancements (Version 2.1 on onwards) are being made through Apache.


How do I participate in forums and mailing lists about UIMA? 

There is a user discussion mailing list at Apache uima-user@incubator.apache.org. You can also join the Apache development community's mailing list for UIMA at uima-dev@incubator.apache.org.

For help with pre-Apache versions of UIMA, please post to the UIMA forum on the alphaWorks site.


What are some functional examples of UIMA-enabled text analytics components? 

UIMA provides an open framework and standard interfaces for creating and composing analytics capable of identifying and extracting the facts and relationships expressed in unstructured information. These analytics help glean latent meaning within documents and other text-based information so this knowledge can be exploited by search and business intelligence applications. Text analytics are essential for:

  • Improving search results by enhancing the metadata that describes documents and other text-based information and compensating for the fact that search terms are often too incomplete or ambiguous to yield meaningful results
  • Making the relevant knowledge buried in mounds of unstructured information accessible to business intelligence and analytic applications by extracting insight from text and enabling its integration with database or data warehouse content where it can be reported on and analyzed in conjunction with structured data assets.

UIMA enables seamless integration of text analytics components that analyze documents and extract knowledge from unstructured content. These components can extract more meaning by intelligently:

  • Distinguishing between different semantics of the same term – for example rock (stone) vs. rock (music) vs. rock (to move back and forth)
  • Discovering information about higher-level concepts that is not explicitly expressed in text, including information about specific entities, such as people, places, organizations, parts, problems and conditions. This automatic discovery process will not just classify such terms but can infer from a problem report, for example, that the “fuel-pump” was likely the cause of the problem even though the word “cause” was never mentioned.
  • Identifying relationships between terms and understanding the meaning and intent of searches – for example, “Fred Center, the CEO of Center Micros announced the company’s plans to improve its benefits program” may be analyzed to infer that Fred Center is an employee of Center Micros. Another example may infer from reading a number of medical abstracts that a particular drug may be involved in causing certain health problems where this relationship was not known or explicit in any one document.
  • Finding documents that contain specific facts – for example, consider the example of a query that returns documents describing a fuel pump failure and being able to – through advanced analytics – return documents problems related to fuel pump failures that may not contain the exact phrases “fuel pump” or “failure” anywhere in the text. This would be very difficult to do with just keyword search.
  • Supporting vertical domain analytics, for example:
    • Automotive – knowing that alternator is an auto part, and “squeak” implies a problem
    • Financial Services – understanding what a 401(k) or a sell order is
    • Life Sciences – determining that “BIKE” is a gene, not a bicycle
    • Legal – knowing that “indemnification” is a contract clause and “discovery” is a legal process


What are some applications that can take advantage of UIMA-compliant text analytics? 

Among the most powerful aspects of UIMA is its flexibility to apply text analytics to a broad range of business applications that are dependent on or can be enhanced by unstructured information. While the complete list is practically unlimited, key application examples include:

Text Analytic Application Examples

Business intelligence – Enables business intelligence tools to extract facts from unstructured content assets and incorporate this knowledge into reporting and analysis for a more complete and accurate picture of an organization.

Product defect detection – Provides early insight into product defects and service issues before they become widespread, thus enabling quicker resolution and lower aftermarket service and recall costs Insurance fraud analysis – Analysis of claims documentation, reports, policies, customer information and public records to identify patterns and trends in claims activities, along with hidden relationships, to reduce incidents of fraud and “leakage” caused by unnecessary payouts.

Know your customer – Analysis of call center and sales notes, customer information, transaction histories, e-mail and public records to identify people, products, interests and relationships to achieve a more complete view of the customer to increase retention and identification of cross-sell and other revenue opportunities.

Automotive early warning – Analysis of warranty claims, repair requests and call center logs to identify parts, problems and conditions specific to vehicles, determine correlations and reduce the “detection-to-correction” window and the cost of recalls.

Advanced intelligence for anti-terrorism and law enforcement – Analysis of various information sources, such as field analyst reports, ship manifests and surveillance transcripts, along with public records, watch lists, news articles, publications and financial transactions enables analysts to uncover hidden patterns and identify potential criminal or terrorist activity by identifying people, organizations, actions, locations, events and related associations.

Competitive intelligence and brand awareness – Analysis of news feeds, announcements, published articles, competitor websites, and corporate financial statements to identify major events, competitor news, market feedback and customer sentiment so that companies can more quickly identify potential issues and determine appropriate responses and realignment to improve positioning.

Search Application Examples

Semantic search – Extends user queries beyond simple keywords and finds desired concepts and relationships that may appear in the text. UIMA-based applications provide a host of semantic search capabilities.

e-Discovery – Helps improve compliance and reduce litigation costs by providing highly targeted and accurate retrieval of information material to pending litigation or audit-related discovery requests Customer support and self-service – Analysis of call center logs, support e-mails, bug reports, knowledge bases, tech notes and support documentation to provide more accurate problem identification and understand the potential parts and correlations to better identify appropriate resolutions; integration into applications can provide additional context and enable recommendations to be automated and delivered inline with a customer support session.

e-Commerce and product finders – Increases transaction rates and order values by analyzing product content, customer profiles and sales notes to identify concepts, synonyms and relationships ensuring customers find both target and complementary products online and sales reps can make the right cross-sell offer


What UIMA-based solutions are currently available?
IBM offers IBM Quality Insight for Automotive, a solution that allows automakers and fleet owners to collect and analyze large quantities of data about their vehicles from a variety of sources to improve the identification of trends, better manage warranty coverage and help them adhere to government regulations. Delivered by IBM Business Consulting Services, it combines technology from IBM, ClearForest, and SAS.

IBM and nStein also offer the Public Image Monitoring Solution that enables businesses to make sense of the explosion of information from emerging social networks on the Web to deliver new insight into brand reputation and customer, competitor and public opinion about their company. It is based on technology from IBM, nStein, and Factiva.

In addition, IBM uses UIMA in consulting engagements where text analytics is required for the solution. There are a variety of ongoing government and life sciences projects.

How to work with the source code

Use the SourceForge site for access to reference versions of pre-Apache versions of the source code. Use Apache UIMA to work with the source code for UIMA version 2.1 and onwards.

For UIMA versions prior to 2.1, go to the download page and download and unzip the UIMA source. Detailed instructions on how to build the framework from the source, and how to import the source into Eclipse is available in the readme.html contained within the zipped sources. You may want also download the UIMA SDK from the alphaWorks site to obtain additional components.

How to participate in forums and mailing lists about UIMA

For discussion of Apache UIMA please use the mailing list uima-user@incubator.apache.org. You can also join the development community's mailing list for Apache UIMA at uima-dev@incubator.apache.org.

For help with pre-Apache versions of UIMA, please post to the UIMA forum on the alphaWorks site.