Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.adass.org/adass/proceedings/adass96/reprints/belld.pdf
Дата изменения: Wed Jan 14 23:10:18 1998
Дата индексирования: Tue Oct 2 10:27:00 2012
Кодировка:
Astronomical Data Analysis Software and Systems VI ASP Conference Series, Vol. 125, 1997 Gareth Hunt and H. E. Payne, eds. A Filtering KPNO L TEX Observing Proposals with Perl

David J. Bell National Optical Astronomy Observatories, Tucson, AZ 85726-6732 Abstract. An automated observing prop osal processing system has b een A in use at KPNO during the past three years. L TEX prop osal templates are filled out by users and sent to KPNO via electronic mail. Observer A and prop osal information fields in these files are well-marked with L TEX tags, thus allowing automated extraction and imp ortation to observatory databases. A significant complication of this process is that although the fields are well-marked, the information they contain often arrives in a variety of formats that must b e recognized and standardized. Perl's regular expression and text manipulation capabilities make it an excellent tool for p erforming these functions. This pap er outlines the filtering system in use at KPNO and discusses some of the ways Perl has proven useful for A parsing L TEX documents.

1.

Introduction

The Kitt Peak observing prop osal handling system (Bell et al. 1996) is an autoA mated system for distributing, receiving, and processing L TEX observing forms and associated PostScript figures. It has already handled several thousand files for Kitt Peak alone, and has also b een in use at two other observatories. Twice a year, as prop osals are arriving, information must b e quickly imp orted and updated in observatory databases. In the past this was p erformed through manual entry while insp ecting each pap er copy--a tedious and sometimes overwhelming task, particularly when over 100 prop osals arrive on the final day. This pap er describ es a new and b etter approach. The prop osals are run through a filtering program that locates the desired fields in the prop osal form, optionally parses each field into several database subfields, and rearranges the information into a standardized format. When confused ab out an entry, the program attempts to make a good guess but also flags that item for human insp ection. Such an approach can save much time, while still ensuring the accuracy of imp orted data. 2.
A Why L TEX?

A L TEX is a widely and freely available text formatting language that gets significant use throughout the astronomical community. Even inexp erienced users can fill in a well-designed template form using any text editor, and submit it with any e-mail program. The completed forms serve a dual role: they can b e printed to produce nicely formatted pap er documents, and, since needed information is

371

© Copyright 1997 Astronomical Society of the Pacific. All rights reserved.


372

Bell

well-tagged by the structural markup, they can also b e processed by automated scripts to extract data fields. If needed, they can b e edited locally (impractical with PostScript documents, for example), and by modifying a single style file one can quickly reformat hundreds of documents. However, problems can occur during automated filtering. Some of the fields, such as addresses, consist of just one line on the form, but need to b e split into several subfields for the database. The form could b e changed, but there are already thousands of existing documents in the present format, and it would b e inconsiderate to force users to split entries into many subfields that are all recombined on the printed form. A second problem is that users often emb ed commands into the fields, and it would b e confusing if we mailed out envelop es with raw TEX in the addresses. The biggest drawback, however, is that there are no limits over what users typ e into the fields, whereas the database needs standardized formats. 3. Why Perl?

Perl is a widely and freely available scripting language that is b est known for its many uses for system administration tasks and, more recently, as b eing the CGI-programming language. What makes it so good at these things is that it is a sup erb text-processing language. It includes some of the most efficient and p owerful regular expression and string manipulation op erators available anywhere, allowing it to quickly locate and manipulate myriads of tiny parcels of text. Powerful string manipulation code can b e written quickly and compactly, A making it great for processing L TEX files (for instance, the p opular latex2html program is a Perl script). Perl has a familiar C-like syntax and a very forgiving nature. If users leave fields blank, or typ e in full sentences when the form asked for an integer, the script doesn't b omb out, but can easily b e programmed to do something reasonable and move on. 4. Filtering Strategies and Examples

Many of the fields extracted from the forms require very little processing other than reformatting, and this can often b e done in just a few lines of code. For example, the lines: $phone = "($1) $2-$3 $4" if ($phone =~ /^\D*(\d{3})\D*(\d{3})\D*(\d{4})\s?(.*)$/); will recognize a US phone numb er in practically any likely format, such as "800­ 5551212ext123," and standardize it to "(800) 555-1212 ext123." Some more interesting examples will b e discussed in the following sections. 4.1. Names and Addresses

Name and address entries on the form need to b e split into subfields for the A database. L TEX codes are stripp ed out (e.g., diacritical marks) or replaced (e.g., non-English characters). Names are split into an array based on punctuation and whitespace, and then compared to lists of titles to b e thrown away, or likely surname comp onents that need to b e recombined into a last-name field.


A Filtering KPNO L TEX Observing Prop osals with Perl

373

\name{Prof.~Dr.~Ant{\^o}nio-Ryan M.\ VAN DE W\O RF, Jr.} \address{\small Dept.~of Phys.~\& Astr.; Mail Stop 16; rm 101; VICTORIA B.C.\ V8 x4 m6~~Canada} FN: MI: LN: A1: A2: CY: ST: ZC: CO: Antonio-Ryan M Van De Worf, Jr Dept. of Phys. & Astr Mail Stop 16, rm 101 Victoria BC V8X 4M6 CANADA

A Figure 1. L TEX Name and Address Fields and Filter Output. Formatting and sp ecial-character codes have b een stripp ed and the entries correctly parsed into subfields. The province, p ostal code, and country name have b een standardized.

\telescope{ 4.0-meter~~~} \instrument{Prime focus camera with the new 4$\times$2 CCD mosaic} TE: 4m IN: PF DE: MOSA
A Figure 2. L TEX Telescop e and Instrument Fields and Filter Output. Regular expression hashes have b een used to identify items and return standardized database codes.

\optimaldates{2/16-3/2, 4/14-5/1 or 5/13-31} \acceptabledates{21DEC1996---07JUN1997\hfil} \optimaldates{\it Feb.~1st--27th or March 2nd--23rd, 1997} \acceptabledates{Late sept.\ through early-april} OD: AD: OD: AD? Feb 16 - Mar 2, Apr 14 - May 1, May 13 - May 31 Dec 21 - Jun 7 Feb 1 - Feb 27, Mar 2 - Mar 23 Sep 20 - Apr 10

A Figure 3. L TEX Date Fields and Filter Output. Since the last range is somewhat vague, the filter has flagged the field for human insp ection.

Address parsing is more difficult, due to widely varying punctuation and foreign addresses. For this reason, lists of cities and countries from past prop osals are first searched, and once a city is found, the parsing is almost always right. When


374

Bell

new cities show up, the script makes an attempt at parsing based on punctuation, flags the data, and logs the new city for p ossible addition to the search list. See Figure 1. 4.2. Telescop es, Instruments, and Detectors in recognizing the many ways in which the achieved by stepping through associative is the key and a regular expression that the value. For instance the hash element

The primary problem in these cases is same item can b e requested. This is arrays in which a standardized code matches various forms of that item is

RCSP => r[-\.\s]*c[-\.\s]*sp can b e used to map user entries such as "r-c sp ectrograph" or "R. C. Sp ec" to the standardized code "RCSP." When new instruments b ecome available, the script can b e quickly up dated simply by adding a new code-regex pair to the hash. See Figure 2. 4.3. Observing Dates

Interpreting date strings requires parsing a string into several date ranges, then each range into dates, and finally each date into a day and month. Commonly used English words like "through" and "or" are first replaced with symb ols like "-" and "," resp ectively, b efore splitting on these symb ols. Unneeded information, such as years and ordinal abbreviations, are removed. Months are then standardized with a series of substitutions, so that the strings "09/", "SEP", "septemb er", etc., will all b e turned into "Sep". Once this month string is pulled out, what's left is hop efully a day numb er--if not, sp ecial cases like "mid" are considered. See Figure 3. 5. Conclusion

A A Perl script has b een develop ed for filtering KPNO L TEX observing prop osals. Although the desired information arrives in a large variety of formats, Perl's p owerful text manipulation capabilities allow the script to accurately identify and reformat entries for database imp ort. Only a few standardization problems remain, and these may b e eliminated in the future with pre-configured menus A and buttons on an HTML form--such an interface to the L TEX template is currently b eing designed.

References Bell, D., Biemesderfer, C. D., Barnes, J., & Massey, P. 1996, in ASP Conf. Ser., Vol. 101, Astronomical Data Analysis Software and Systems V, ed. G. H. Jacoby & J. Barnes (San Francisco: ASP), 451