A Monthly Article from our Speakers
Current Article of the month
Textual Analytics: Business Intelligence from a Textual Foundation
by Bill Inmon
Analytics have been around from the time the first computer program was written. Once the corporation began to generate data, there were financial analysts, sales analysts, marketing analysts and others anxiously awaiting to use that data in novel and creative ways. In the early days, data from applications was hard to come by, and the tools the analysts used to access and analyze the data were crude. As time passed and the volume of data grew, so grew the opportunity to use analytics to compete in the business arena.
And over time the world discovered the data warehouse as a foundation for analytic processing. The data warehouse contained data that was integrated, historical, and granular that was gathered from a host of legacy systems. The data warehouse proved to be an ideal foundation for the analysis of data. Data from the data warehouse was predictable and easy to access. And because data in the data warehouse was granular, it could be reshaped for many different purposes.
Numerical Data – A Fundamental Limitation
But over time it was recognized that business analysis – analytics – had a very fundamental limitation. That limitation was that analytics operated only on numerical data. While analysis of numerical data was quite useful, in fact, the corporation has massive amounts of data that are not in the form of numerical data. In the corporation there exists massive amounts of unstructured textual data – from emails, medical records, contracts, warranties, reports, call centers, and so forth. In fact, most estimates shows that 80% of the data in the corporation is in the form of text, not numbers.
And in that textual data that is owned by the corporation, there is a wealth of information. But there is a problem with unstructured, textual data. The problem with textual data is that it is not as neatly organized and as accessible as numerical data. Textual data just doesn’t lend itself to easy and facile analysis because the software and technology used for business analytics is almost 100% dedicated to handling well structured numeric data. The very disorder of the textual data defeats (or at least greatly hampers!) any attempt at accessing and analyzing textual data in any sort of meaningful manner.
However there is technology that now is available that indeed is designed for textual analysis. That technology is FOREST RIM TECHNOLOGY Textual Foundation software. IDS is designed to allow the organization to do textual analytics, for the first time.
(Note: the following discussion of textual analytics makes free use of the many patents FOREST RIM TECHNOLOGY has on the process of doing textual analytics.)
Textual Analysis And Search Engines
When the subject of textual analytics arises, it is natural to think of search engines such as Google and Yahoo, among others. While a simple search of raw text can be considered to be a crude form of textual analytics, there are in fact many limitations to a simple textual search.
In order to do textual analytics in a sophisticated manner, first the unstructured textual data must be integrated. If raw text is not integrated before it is analyzed, the search of the raw text will produce truly sketchy and questionable results. Therefore, the first step in textual analytics is the integration of the raw text into an integrated form. In order for raw text to be integrated and to be fit for analysis –
- different terminology must be accounted for so as to yield consistent results, even though the original source text is different,
- alternate spellings (even common misspellings!) must be accounted for,
- words need to be stemmed to their Latin or Greek roots,
- and so forth.
(For an in depth treatment of the subject of the technology needed for textual integration, please refer to the white papers on the subject available from FOREST RIM TECHNOLOGY.)
Integrating Raw Text
In short, in order to do analytics on text, the raw text must first be integrated.
Fig 1 shows raw text and integrated text.
After the raw text is integrated, textual analytics can be done. Fig 2 shows that searches are done against raw textual data and that textual analytic processing is done against integrated text.
A search can be something as simple as – “Tell me where the term – Katherine Heigl – is mentioned”. In this case the search goes to the source or an index created from the source and looks for the term or part of the term that has been specified.
An analytical treatment of text might be – “tell me about all the places where terms and information relating to Sarbanes Oxley can be found”.
Fig 2 shows that searches are done on raw text whereas textual analytic processing is done on text that has been integrated.
The need for textual integration may not be obvious at all. In order to illustrate the importance of textual integration, consider the following. Suppose a medical file needs to be analyzed. In the medical file is the term “ha”. If the raw data is searched on “ha”, there are many entries. But “ha” means little or nothing to the layman. So doing a search on “ha” is questionable. However if the raw data is integrated before being searched, then for all cardiologists the term “ha” is converted to “heart attack”. For all endocrinologists the term “ha” is converted to “hepatitis A”. And for all general practitioners the term “ha” is converted to “head ache”. After the conversion is done –
- there is no questionable term “ha” to be dealt with, and
- patients with heart attacks, headaches, and hepatitis A are not grouped together.
From this simple example (and there are plenty of more cases of textual data needing to be clarified before being analyzed), it is seen that integration of text unlocks the text so that effective textual analytics can be done on the text.
But the example shown is not the only reason for the need for textual integration as a foundation for analytics.
Searching For Categories Of Text
Suppose there is a body of text about ranching. Part of the body of text relates to horses. In some cases the type of horse is discussed. In other cases the age and maturity of the horse is discussed. In other cases the gender of the horse is addressed.
Now suppose that there is a desire to do analytical processing against this document or set of documents on ranching. Suppose that there is a desire to see information about horses. One way – the search engine way – to look at horses is to look for colts, then to look for ponies, then to look for studs, and so forth. The searcher must know before hand what is being sought. Then the searcher must be able to gather all the information about horses together. Searching for a wide variety of information is tedious to do.
A better approach – the integrated text approach – is to identify all information about horses into a common category. Then the integration process goes and identifies all the places in the text where those pieces of information about horse exists.
The kind of information that is returned when looking at integrated textual information about horses might include –
- types of hay
- horse whispering
- gelding, and so forth.
Now when the textual analyst wants to know information about horses, the textual analyst simply queries on the category – horses – and all information relating to horses is returned. Note just how different the results of a query are when done textual analytic processing than when doing a simple search
By integrating the raw data, the textual analyst has prepared the data for effective textual analytical processing.
Recasting Textual Data
Fig 3 shows how raw textual data can be preprocessed and recast into an integrated form in order to set the stage for effective analytical processing.
There are indeed many forms of textual integration which can set the stage for effective textual analytical processing. As a simple example of another form of integration that needs to be done to raw nstructured text in order for text analytics to be done is the recognition that there are multiple spellings of words, especially names. Fig 4 shows that the name – Osama Bin Laden – can be spelled many different ways.
By recognizing that there are multiple spellings of the same name, the text analytical processor will not miss mentions of Osama Bin Laden when the name is spelled differently. When a simple search engine is used, the search may fail to pick up important information about Osama Bin Laden because a variation of the name is used.
Stemming Raw Text
The need for integrated text only begins with the simple examples that have been described. Another way that textual data needs to be integrated is in terms of operating at the Latin or Greek stems of words.
Latin based words tend to have similar but not quite the same spellings. If a search is literally made, then the search will not connect the fact that a word is related to another word even though they are not spelled exactly the same. As an example, consider the word – “move”. Some of the different forms of the word “move” are shown n Fig 5.
Fig 5 shows that there are different forms of the same word – “move”. If an effective analysis of the text is to be done, it must be recognized that words that have the same stem need to be considered as the same word.
Indeed, there are many other considerations of the discipline of integrating text. Some of them include screening text to see if it is business relevant, punctuation removal, case sensitivity (or insensitivity), and so forth.
The Scope Of The Search And Analysis
One of the challenges of a search engine is the scope of the material accessed and analyzed by the query. A search engine is capable of drawing on wide amounts of source material (such as the Internet). A textual analytical tool on the other hand must access and draw upon data that it has access to and can manipulate. In other words, because textual analytics requires a serious amount of preprocessing of data in order to integrate the data, textual analytics is performed on a much smaller amount of data than searches of data. Fig 6 shows that there is a very profound difference between the scope of the data that the search engine and the analytic engine tools operate on.
It does not make sense that a search engine would integrate data before doing a search because the search engine does not have the ownership and control of the data that is being searched. Textual analytical tools, on the other hand, typically operate on data from the corporation. Indeed, there is the opportunity to access and integrate corporate data before textual analytics occurs.
Fig 7 shows the difference.
A Simple Query
In addition to the standard search queries which the textual analyst needs to do, there are a whole class of queries that the textual analyst needs to do as well. There are many different kinds of queries that the textual analyst submits. One of the simplest of the queries submitted is the query by class of data. Fig 8 shows such a query.
In Fig 8 the textual analyst has submitted a query for the category of financial information. The query for financial information includes many different terms, each of which relate to finance. Some terms that relate to finance include –
- profit, and so forth.
The query is submitted by a reference to finance. The results of the query are a reference back to each place where a term related to query is found. This type of query is sometimes called an indirect query or a query by category.
An important type of query submitted by a textual analyst is that of a query looking for basic occurrences of information. In the case of Fig 9, a query has been made looking for all occurrences of the word “water”.
The query shown in Fig 9 shows that the simple search has looked for the term “water”.
Upon finding a reference to water, the next step is to do a search on the specific text preceding “water” and following “water”. These textual references to water along with their immediate text are called “snippets”.
A Snippet Search
By looking at each of the snippets, the analyst can determine the context of the word that has been sought. Fig 10 shows some snippets of text surrounding the word “water”.
Snippets are most useful for determining the context of a particular word. In Fig 10 it is seen that term “water” refers to quite different things. In one case there is a water table, a watermark, sea water that is menacing, and Waterford crystal.
A Proximity Search
Another type of query that the textual analyst sometimes needs to submit is a proximity query. In a proximity analysis, the query is done for words that are in proximity to each other in a document. In a proximity query a search is done over one or more documents where the document(s) is searched with regard to two or more words residing in the document within a predetermined proximity. Fig 10 shows that a query is done for the words “equity” and “shares” that are in proximity to each other in the same document.
In Fig 10 a block of text is brought up where the two words are in the same proximity to each other.
Of course proximity analysis can be done for lists of words as well as individual words. Fig 11 shows such an analysis.
In Fig 11 it is seen that there are two lists of words – one list for terms relating to Sarbanes Oxley and another list relating to finance. A textual analytical proximity search is done based on the words that are found in both lists.