Using Natural Language Processing to Perform Information Extraction
December 13, 2012
I. PROBLEM STATEMENT
Most large corporations provide various information technologies to their workers. To support the provided hardware and software they often have a help desk or service desk which is made up of people and software. Service desk employees are often entry level and have little understanding of the business and its specific technical problems, so software is employed to route tickets to the appropriate staff and record the knowledge which is learned along the way. For improved efficiency it is important to be able to access the information that has been amassed in the service desk software, but simple text queries often fall short. The purpose of this experiment is to use natural language processing techniques to extract information from a series of service desk tickets, then structure the information in such a way that it can be retrieved with more precision than is provided by simple text queries.
Modern America is a divided society. Maybe it is a side-effect of our socioeconomic system, which prides itself at thriving through competition, or maybe it is natural. Whatever the cause, America must strive to identify and eliminate those division which have a net negative effect on its people. Of those negative divisions, race is probably the most pernicious. Scholars such as W. Lloyd Warner and Michelle Alexander have argued that the racial divide is so extreme that it almost resembles a caste system. Regardless of exactly how divisive race is, it is clearly still an issue in contemporary America, and it carries incalculable costs. As W.E.B. Du Bois said of racial discrimination, “[It] is morally wrong, politically dangerous, industrially wasteful, and socially silly.” (Katz and Sugrue 1998, 55) All Americans should support actions to reduce the burden of historical racism. Racial division was originally a direct result of laws created within the United States, so it should be possible to use the same system of laws to redress the issue of racial discrimination and division.
The software development process for a large midwestern manufacturing firm is likely to be much less efficient than that of a fresh young startup in the Silicone Valley. Many of these large firms have been around for years, and their processes were likely developed prior to the digital revolution. According to Lehman’s third law on self regulation, iterations of a software package will take on a uniform set of characteristics over time; this is true for business processes as well, since firms will tend to keep doing what has made them successful in the past. Unfortunately, applying industrial revolution era practices to digital revolution era technologies yields less than optimal results. However, with an open mind and an able guide, it is possible for these established firms to implement an efficient, modern software development strategy.
My wife and I enjoy playing the iPhone game LetterPress which was recently released by atebits. Atebits’s lead developer is Loren Brichter, who also wrote Tweetie, which became the official Twitter app for the iPhone, writes some great software. So, if you have not done so already, you should definitely try this free game [and then make the in-app purchase if you enjoy it]. That being said, I did come up with one tiny complaint.
Recently a board came up which five “Z”s, two “V”s, two “K”s, an “X”, and a “J”. Those letters are five of the seven least used letters in the English dictionary, only “Q” and “W” were missing. Almost half of the game board was composed of very infrequently used letters. The game was still playable, but I thought that gameplay might be a little better if concentrations of infrequently used letters were minimized. I think that the chance of getting a “Z” on any given tile should be less than 1 in 26, and instead of just complaining about the game I decided to see what it would take to create an improved random letter generator.
Preparing SAP data to be printed as a linear barcode can be a pain at times; when 2D barcodes with proper field separators are required the pain factor goes up. Most of the problems come from trying to convince SAP to assign non-printing control characters to the output variable. So, I have written a small example which demonstrates both how to generate a PDF417 barcode and the steps required to assign special characters to a string in ABAP. SAPscript cannot handle 2D barcodes, so this code will only really work in SmartForms.
My boys and I enjoy playing a mobile version of the classic battleship game when we are waiting our turn at the barbershop. However, the artificial intelligence algorithm this specific game uses is so feeble that even my youngest son can consistently beat the computer player. So, I started thinking about improving the algorithm. I searched the web to see if there was already an established, dominant algorithm. Although I found several clever implementations, including one that used probabilities and another based upon a checkerboard pattern, I did not find one that I particularly enjoyed. After thinking about the problem further I came to the conclusion that this problem would be well suited for a dynamic programming algorithm.
From my perspective, the best approach to take when searching for the opponent’s ship is to target a square that is in the center of the longest line of unmarked squares. It would be even better to find a target which is at the intersection of two long lines of unchecked squares. To me, this is an effective divide and conquer approach similar in spirit to the concept of binary trees, the problem is finding an efficient algorithm. The problem seems to lend itself perfectly to the dynamic programming approach.
Replacing characters in a string is a simple programming task which should be ridiculously easy, right? It turns out that is not always the case, especially if you have to write SAP ABAP code. Let us assume that there is a variable with a base type of CHAR30 (a string of 30 characters) which needs to have its internal spaces replaced with hyphens. In theory the following statement should work.
DATA: lv_char30 TYPE char30. lv_char30 = 'AAAA BBBB CCC'. REPLACE ALL OCCURRENCES OF ' ' IN lv_char30 WITH '-'.
SAP ABAP is a difficult programming language to learn; not only is it old and clunky, but there are also huge barriers to access the only programming environment available for it. To write ABAP programs you have to have a complete and working SAP system, and even professionally managed development systems are difficult to program on due to the lack of data in the systems. I think that this at least partially contributes to the lack of good online reference material. I know that it is possible search the SAP help forums, but the code there often looks like an giant glob of unformatted nonsense. Since I do not have time right now to create a decent site which can explain ABAP to people, I thought I would simply share an example program which shows many ABAP program/report programming techniques all in one spot.